diff --git a/bookdown_files/figure-html/unnamed-chunk-244-1.png b/bookdown_files/figure-html/unnamed-chunk-246-1.png similarity index 100% rename from bookdown_files/figure-html/unnamed-chunk-244-1.png rename to bookdown_files/figure-html/unnamed-chunk-246-1.png diff --git a/c04-set-up.html b/c04-set-up.html index 2134921b..9050e58b 100644 --- a/c04-set-up.html +++ b/c04-set-up.html @@ -526,23 +526,23 @@

4.2.1 American National Election

The ANES is a study that collects data from election surveys dating back to 1948. These surveys contain information on public opinion and voting behavior in U.S. presidential elections. They cover topics such as party affiliation, voting choice, and level of trust in the government. The 2020 survey, the data we use in the book, was fielded online, through live video interviews, or via computer-assisted telephone interviews (CATI).

When working with new survey data, analysts should review the survey documentation (see Chapter 3) to understand the data collection methods. The original ANES data contains variables starting with V20 (DeBell 2010), so to assist with our analysis throughout the book, we created descriptive variable names. For example, the respondent’s age is now in a variable called Age, and gender is in a variable called Gender. These descriptive variables are included in the {srvyrexploR} package, and Table 4.1 displays the list of these renamed variables. A complete overview of all variables can be found in Appendix A.

-
- @@ -1035,23 +1035,23 @@

4.2.2 Residential Energy Consumpt

RECS is a study that measures energy consumption and expenditure in American households. Funded by the Energy Information Administration, the RECS data are collected through interviews with household members and energy suppliers. These interviews take place in person, over the phone, via mail, and on the web. The survey has been fielded 14 times between 1950 and 2020. It includes questions about appliances, electronics, heating, air conditioning (A/C), temperatures, water heating, lighting, energy bills, respondent demographics, and energy assistance.

As mentioned above, analysts should read the survey documentation (see Chapter 3) to understand how the data was collected and implemented. Table 4.2 displays the list of variables in the RECS data (not including the weights, which start with NWEIGHT and will be described in more detail in Chapter 10). An overview of all variables can be found in Appendix B.

-
- diff --git a/c06-statistical-testing.html b/c06-statistical-testing.html index fa37985b..e82abc6c 100644 --- a/c06-statistical-testing.html +++ b/c06-statistical-testing.html @@ -1001,23 +1001,23 @@

Example 2: Test of Independence `Some of the time` = md("Some of<br />the time"))

chi_ex2_table
-
- @@ -1515,23 +1515,23 @@

Example 2: Test of Independence tab_options(page.orientation = "landscape")

chi_ex2_obs_table
-
- @@ -2084,23 +2084,23 @@

Example 3: Test of Homogeneity tab_stubhead(label = "Age Group")

chi_ex3_obs_table
-
- diff --git a/c08-communicating-results.html b/c08-communicating-results.html index 0436e2c4..91db34b2 100644 --- a/c08-communicating-results.html +++ b/c08-communicating-results.html @@ -593,23 +593,23 @@

8.3.1.1 Transitioning {srvyr} out
trust_gov_gt %>% 
   tab_caption("Example of gt table with trust in government estimate")
-
- @@ -1082,23 +1082,23 @@

8.3.1.1 Transitioning {srvyr} out decimals = 1)

trust_gov_gt2
-
- @@ -1604,23 +1604,23 @@

Expanding tables using {gtsummary} statistic = list(all_categorical() ~ "{p} ({p.std.error})"))

anes_des_gtsum
-
- @@ -2088,23 +2088,23 @@

Expanding tables using {gtsummary} )

anes_des_gtsum2
-
- @@ -2582,23 +2582,23 @@

Expanding tables using {gtsummary} )

anes_des_gtsum3
-
- @@ -3090,23 +3090,23 @@

Expanding tables using {gtsummary} estimates and average age")

anes_des_gtsum4
-
- @@ -3599,23 +3599,23 @@

Expanding tables using {gtsummary} )

anes_des_gtsum5
-
- diff --git a/c11-missing-data.html b/c11-missing-data.html index 315dd337..2166d4df 100644 --- a/c11-missing-data.html +++ b/c11-missing-data.html @@ -614,8 +614,8 @@

11.3.2 Visualization of missing d xlab("Voted for President in 2016")

## Scale for fill is already present.
 ## Adding another scale for fill, which will replace the existing scale.
-
-This chart has x-axis 'Voted for President in 2016' with labels Yes, No and NA and has y-axis 'Variable' with labels Age, AgeGroup, CampaignInterest, EarlyVote2020, Education, Gender, Income, Income7, PartyID, RaceEth, TrustGovernment, TrustPeople, VotedPres2016_selection, VotedPres2020 and VotedPres2020_selection. There is a legend indicating fill is used to show pct_miss, ranging from 0 represented by fill very pale blue to 100 shown as fill dark blue. Among those that voted for president in 2016, they had little missing for other variables (light color) but those that did not vote have more missing data in their 2020 voting patterns and their 2016 president selection. +
+This chart has x-axis 'Voted for President in 2016' with labels Yes, No and NA and has y-axis 'Variable' with labels Age, AgeGroup, CampaignInterest, EarlyVote2020, Education, Gender, Income, Income7, PartyID, RaceEth, TrustGovernment, TrustPeople, VotedPres2016_selection, VotedPres2020 and VotedPres2020_selection. There is a legend indicating fill is used to show pct_miss, ranging from 0 represented by fill very pale blue to 100 shown as fill dark blue. Among those that voted for president in 2016, they had little missing for other variables (light color) but those that did not vote have more missing data in their 2020 voting patterns and their 2016 president selection.

FIGURE 11.2: Missingness in variables for each level of VotedPres2016 in the ANES 2020 data

diff --git a/reference-keys.txt b/reference-keys.txt index e9e1fa47..9c074ea7 100644 --- a/reference-keys.txt +++ b/reference-keys.txt @@ -34,7 +34,7 @@ fig:results-plot3 fig:results-plot4 tab:apidata fig:missing-anes-vismiss -fig:unnamed-chunk-244 +fig:unnamed-chunk-246 fig:missing-recs-hist tab:missing-anes-shadow-tab tab:cb-incident diff --git a/search_index.json b/search_index.json index b5b8b73a..f199b54c 100644 --- a/search_index.json +++ b/search_index.json @@ -1 +1 @@ -[["index.html", "Exploring Complex Survey Data Analysis in R A Tidy Introduction with srvyr Preface", " Exploring Complex Survey Data Analysis in R A Tidy Introduction with srvyr Stephanie Zimmer, Rebecca J. Powell, and Isabella Velásquez 2024-03-11 Preface "],["c01-intro.html", "Chapter 1 Introduction 1.1 What to expect 1.2 Datasets used in this book", " Chapter 1 Introduction Surveys are used to gather information about a population. They are frequently used by researchers, governments, and businesses to better understand public opinion and behavior. For example, a non-profit group might be interested in public opinion on a given topic, government agencies may be interested in behaviors to inform policy, or companies may survey potential consumers about what they want from their products. Developing and fielding a survey is a method to gather information about topics that interest us. This book focuses on how to analyze the data collected from a survey. We assume that you have conducted a survey or obtained a microdata file. Microdata, also known as respondent-level or row-level data, contains individual survey responses, analysis weights, and design variables (as opposed to opposed to summarized data in tables). For the purposes of this book, you need the weights and design variables for your survey data. These are required to accurately calculate unbiased estimates1. Understanding the concepts and techniques discussed in this book will help you to extract meaningful insights from your survey data. To account for the weights and study design, researchers rely on statistical software such as SAS, Stata, SUDAAN, and R. In this book, we will use R to provide an overview to survey analysis. Our goal is to provide a comprehensive guide for individuals new to survey analysis but have some statistics and R programming background. We will use a combination of both the {survey} and {srvyr} packages and present the code following best practices from the tidyverse. In 2003, the {survey} package was released on CRAN and has been continuously developed over time2. This package, primarily developed by Thomas Lumley, is extensive and includes the following features: Estimates of point estimates and their associated variances, including means, totals, ratios, quantiles, and proportions Estimation of regression models, including generalized linear models, log-linear models, and survival curves Variances by Taylor linearization or by replicate weights (balance repeated replication, jackknife, bootstrap, multistage bootstrap, or user-supplied) Hypothesis testing for means, proportions, and more The {srvyr} package in R builds on the {survey} package. It provides wrappers for functions that align with the tidyverse philosophy, which is our motivation for using and recommending this package. We find that the {srvyr} package is user-friendly for those familiar with tidyverse packages in R. For example, while many functions in the {survey} package use variables as formulas, the {srvyr} package uses tidy selection to pass variable names3 (a common feature in the tidyverse). Users of the tidyverse are most likely familiar with the magittr pipe (%>%), which seamlessly works with functions from the {srvyr} package. Moreover, several common functions from {dplyr}, such as including filter(), mutate(), and summarize(), can be applied to survey objects. Users can streamline their analysis workflow and capitalize on the benefits of both the {srvyr} and tidyverse packages. There is one limitation to the {srvyr} package: it doesn’t fully incorporate the modeling capabilities of the {survey} package into its tidy versions. This book will use the {survey} package when discussing modeling and hypothesis testing. However, we will guide you on how to apply the pipe to these functions to ensure clarity in your analyses. 1.1 What to expect This book will cover many aspects of survey design and analysis, from understanding how to create design effects to conducting descriptive analysis, statistical tests, and models. Additionally, we emphasize best practices in coding and presenting results. Throughout this book, we use real-world data and present practical examples to help you gain proficiency in survey analysis. While we provide a brief overview of survey methodology and statistical theory, this book is not intended to be the sole resource for these topics. We reference other materials throughout the book and encourage readers to seek those out for more information. Below is a summary of each chapter: Chapter 2: An overview of surveys and the process of designing surveys. This is only an overview, and we include many references for more in-depth knowledge. Chapter 3: Understanding survey documentation. How to read the various components of survey documentation, working with missing data, and finding the documentation. Chapter 4: TO-DO Chapter 5: Descriptive analyses. Calculating point estimates along with their standard errors, confidence intervals, and design effects. Chapter 6: Statistical testing. Testing for differences between groups, including comparisons of means and proportions as well as goodness of fit tests, tests of independence, and tests of homogeneity. Chapter 7: Modeling. Linear regression, ANOVA, and logistic regression. Chapter 8: Communicating results. Describing results, reproducibility, making publishable tables and graphs, and helpful functions. Chapter 9: TO-DO Chapter 10: Specifying sampling designs. Descriptions of common sampling designs, when they are used, the math behind the mean and standard error estimates, how to specify the designs in R, and examples using real data. Chapter 11: TO-DO Chapter 12: TO-DO Chapter 13: National Crime Victimization Survey Vignette. A vignette on how to analyze data from the NCVS, a survey in the U.S. that collects information on crimes and their characteristics. This illustrates an analysis that requires multiple files to calculate victimization rates. Chapter 14: AmericasBarometer Vignette. A vignette on how to analyze data from the AmericasBarometer, a survey of attitudes, evaluations, experiences, and behavior in countries in the Western Hemisphere. This includes how to make choropleth maps with survey estimates. <<<<<<< HEAD In most chapters, you’ll find code that you can follow. Each of these chapters starts with a “set-up” section. This section will include the code needed to load the necessary packages and datasets in the chapter. We then provide the main idea of the chapter and examples on how to use the functions. Most chapters end with exercises to work through. Solutions to the exercises can be found in the Appendix. 1.2 Datasets used in this book We work with two key datasets throughout the book: the Residential Energy Consumption Survey (RECS – U.S. Energy Information Administration 2023a) and the American National Election Studies (ANES – DeBell 2010). To ensure that all readers can follow the examples, we have provided analytic datasets in an R package, {srvyrexploR}. Install the package from GitHub using the {remotes} package. remotes::install_github("tidy-survey-r/srvyrexploR") To explore the provided datasets in the package, access the documentation usng the help() command. help(package="srvyrexploR") To load the RECS and ANES datasets, start by running library(srvyrexploR) to load the package. Then, use the data() command to load the datasets into the environment. library(tidyverse) library(survey) library(srvyr) library(srvyrexploR) data(recs_2020) data(anes_2020) RECS is a study that provides energy consumption and expenditures data in American households. The Energy Information Administration funds RECS and has been fielded 15 times between 1950 and 2020. The survey has two components - the household survey and the energy supplier survey. In 2020, the household survey was collected by web and paper questionnaires and included questions about appliances, electronics, heating, air conditioning (A/C), temperatures, water heating, lighting, respondent demographics, and energy assistance. The energy supplier survey consists of components relating to energy consumption and energy expenditure. Below is an overview of the recs_2020 data: recs_2020 %>% select(-starts_with("NWEIGHT")) ## # A tibble: 18,496 × 57 ## DOEID ClimateRegion_BA Urbanicity Region REGIONC Division STATE_FIPS ## <dbl> <fct> <fct> <fct> <chr> <fct> <chr> ## 1 100001 Mixed-Dry Urban Area West WEST Mountai… 35 ## 2 100002 Mixed-Humid Urban Area South SOUTH West So… 05 ## 3 100003 Mixed-Dry Urban Area West WEST Mountai… 35 ## 4 100004 Mixed-Humid Urban Area South SOUTH South A… 45 ## 5 100005 Mixed-Humid Urban Area North… NORTHE… Middle … 34 ## 6 100006 Hot-Humid Urban Area South SOUTH West So… 48 ## 7 100007 Mixed-Humid Urban Area South SOUTH West So… 40 ## 8 100008 Mixed-Humid Urban Clu… South SOUTH East So… 28 ## 9 100009 Mixed-Humid Urban Area South SOUTH South A… 11 ## 10 100010 Hot-Dry Urban Area West WEST Mountai… 04 ## # ℹ 18,486 more rows ## # ℹ 50 more variables: state_postal <fct>, state_name <fct>, ## # HDD65 <dbl>, CDD65 <dbl>, HDD30YR <dbl>, CDD30YR <dbl>, ## # HousingUnitType <fct>, YearMade <ord>, TOTSQFT_EN <dbl>, ## # TOTHSQFT <dbl>, TOTCSQFT <dbl>, ZTOTSQFT_EN <fct>, ZYearMade <fct>, ## # ZHousingUnitType <fct>, SpaceHeatingUsed <lgl>, ## # ZSpaceHeatingUsed <fct>, ACUsed <lgl>, ZACUsed <fct>, … recs_2020 %>% select(starts_with("NWEIGHT")) ## # A tibble: 18,496 × 61 ## NWEIGHT NWEIGHT1 NWEIGHT2 NWEIGHT3 NWEIGHT4 NWEIGHT5 NWEIGHT6 ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 3284. 3273. 3349. 3345. 3437. 3416. 3355. ## 2 9007. 9020. 9081. 9020. 9213. 9117. 9179. ## 3 5669. 5793. 5914. 5763. 5870. 5721. 5663. ## 4 5294. 5361. 5362. 5371. 5393. 5328. 5354. ## 5 9935. 10048. 10262. 10037. 9961. 10108. 10298. ## 6 7250. 7339. 7435. 7336. 7426. 7309. 7472. ## 7 5684. 5733. 5831. 5664. 5874. 5906. 5734. ## 8 9700. 9725. 9774. 9658. 10221. 9792. 9906. ## 9 1236. 1268. 1263. 1258. 1250. 1267. 1252. ## 10 7084. 7131. 7287. 7184. 7176. 7198. 7174. ## # ℹ 18,486 more rows ## # ℹ 54 more variables: NWEIGHT7 <dbl>, NWEIGHT8 <dbl>, NWEIGHT9 <dbl>, ## # NWEIGHT10 <dbl>, NWEIGHT11 <dbl>, NWEIGHT12 <dbl>, NWEIGHT13 <dbl>, ## # NWEIGHT14 <dbl>, NWEIGHT15 <dbl>, NWEIGHT16 <dbl>, NWEIGHT17 <dbl>, ## # NWEIGHT18 <dbl>, NWEIGHT19 <dbl>, NWEIGHT20 <dbl>, NWEIGHT21 <dbl>, ## # NWEIGHT22 <dbl>, NWEIGHT23 <dbl>, NWEIGHT24 <dbl>, NWEIGHT25 <dbl>, ## # NWEIGHT26 <dbl>, NWEIGHT27 <dbl>, NWEIGHT28 <dbl>, … From this output, we can see that there are 18,496 rows and 118 variables. We can see that there are variables containing an ID (DOEID), geographic information (e.g., Region, state_postal, Urbanicity), along with information about the house, including the type of house (HousingUnitType) and when the house was built (YearMade). Additionally, there is a long list of weighting variables that we will use in the analysis (e.g., NWEIGHT, NWEIGHT1, …, NWEIGHT60). We will discuss using these weighting variables in Chapter 10. For a more detailed codebook, see Appendix B. The ANES is a series study that has collected data from election surveys since 1948. These surveys contain data on public opinion and voting behavior in U.S. presidential elections. The 2020 survey (the data we will be using) was fielded to individuals over the web, through live video interviewing, or over with computer-assisted telephone interviewing (CATI). The survey includes questions on party affiliation, voting choice, and level of trust with the government. Here is an overview of the anes_2020 data. First, we show the variables starting with “V” followed by a number; these are the original variables. Then, we show you the remaining variables that we created based on the original data: anes_2020 %>% select(matches("^V\\\\d")) ## # A tibble: 7,453 × 42 ## V200001 V200002 V200010b V200010c V200010d V201006 V201024 V201025x ## <dbl> <hvn_lbl> <dbl> <dbl> <dbl> <hvn_l> <hvn_l> <hvn_lb> ## 1 200015 3 1.01 2 9 2 -1 3 ## 2 200022 3 1.16 2 26 3 -1 3 ## 3 200039 3 0.769 1 41 2 -1 3 ## 4 200046 3 0.521 2 29 3 -1 3 ## 5 200053 3 0.966 1 23 2 -1 3 ## 6 200060 3 0.235 2 37 1 -1 3 ## 7 200084 3 0.441 1 7 2 -1 3 ## 8 200091 3 0.769 2 37 3 -1 2 ## 9 200107 3 1.42 2 32 2 2 4 ## 10 200114 3 1.84 2 41 2 -1 3 ## # ℹ 7,443 more rows ## # ℹ 34 more variables: V201029 <hvn_lbll>, V201101 <hvn_lbll>, ## # V201102 <hvn_lbll>, V201103 <hvn_lbll>, V201228 <hvn_lbll>, ## # V201229 <hvn_lbll>, V201230 <hvn_lbll>, V201231x <hvn_lbll>, ## # V201233 <hvn_lbll>, V201237 <hvn_lbll>, V201507x <hvn_lbll>, ## # V201510 <hvn_lbll>, V201546 <hvn_lbll>, V201547a <hvn_lbll>, ## # V201547b <hvn_lbll>, V201547c <hvn_lbll>, V201547d <hvn_lbll>, … anes_2020 %>% select(-matches("^V\\\\d")) ## # A tibble: 7,453 × 21 ## CaseID InterviewMode Weight VarUnit Stratum CampaignInterest ## <dbl> <fct> <dbl> <fct> <fct> <fct> ## 1 200015 Web 1.01 2 9 Somewhat interested ## 2 200022 Web 1.16 2 26 Not much interested ## 3 200039 Web 0.769 1 41 Somewhat interested ## 4 200046 Web 0.521 2 29 Not much interested ## 5 200053 Web 0.966 1 23 Somewhat interested ## 6 200060 Web 0.235 2 37 Very much interested ## 7 200084 Web 0.441 1 7 Somewhat interested ## 8 200091 Web 0.769 2 37 Not much interested ## 9 200107 Web 1.42 2 32 Somewhat interested ## 10 200114 Web 1.84 2 41 Somewhat interested ## # ℹ 7,443 more rows ## # ℹ 15 more variables: VotedPres2016 <fct>, ## # VotedPres2016_selection <fct>, PartyID <fct>, ## # TrustGovernment <fct>, TrustPeople <fct>, Age <dbl>, ## # AgeGroup <fct>, Education <fct>, RaceEth <fct>, Gender <fct>, ## # Income <fct>, Income7 <fct>, VotedPres2020 <fct>, ## # VotedPres2020_selection <fct>, EarlyVote2020 <fct> From this output we can see that there are 7,453 rows and 63 variables. Most of the variables start with V20, so referencing the documentation for survey will be crucial to not get lost (see Chapter 3). We have created some more descriptive variables for you to use throughout this book, such as the age (Age) and gender (Gender) of the respondent, along with variables that represent their party affiliation (PartyID). Additionally, we need the variables Weight and Stratum to analyze this data accurately. We will discuss how to use these weighting variables in Chapters 10 and 3. For a more detailed codebook, see Appendix A. In most chapters, you’ll find code that you can follow. Each of these chapters starts with a “setup” section. The setup section includes the code needed to load the necessary packages and datasets in the chapter. We then provide the main idea of the chapter and examples on how to use the functions. Most chapters end with exercises to work through. Solutions to the exercises can be found in the Appendix. References DeBell, Matthew. 2010. “How to Analyze ANES Survey Data.” ANES Technical Report Series nes012492. Palo Alto, CA: Stanford University; Ann Arbor, MI: the University of Michigan; https://electionstudies.org/wp-content/uploads/2018/05/HowToAnalyzeANESData.pdf. ———. 2023a. “2020 Residential Energy Consumption Survey: Household Characteristics Technical Documentation Summary.” https://www.eia.gov/consumption/residential/data/2020/pdf/2020%20RECS_Methodology%20Report.pdf. Valliant, Richard, and Jill A. Dever. 2018. Survey Weights: A Step-by-Step Guide to Calculation. Stata Press. If you do not already have weights created for the survey data you are using, we recommend reviewing other resources focused on weight creation such as Valliant and Dever (2018)↩︎ https://cran.r-project.org/src/contrib/Archive/survey/↩︎ https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html↩︎ "],["c02-overview-surveys.html", "Chapter 2 Overview of Surveys 2.1 Introduction 2.2 Searching for public-use survey data 2.3 Pre-Survey Planning 2.4 Study Design 2.5 Data Collection 2.6 Post-Survey Processing 2.7 Post-survey data analysis and reporting", " Chapter 2 Overview of Surveys 2.1 Introduction Developing surveys to gather accurate information about populations involves a more intricate and time-intensive process compared to surveys that use non-random criteria for selecting samples. Researchers can spend months, or even years, developing the study design, questions, and other methods for a single survey to ensure high-quality data is collected. While this book focuses on the analysis methods of complex surveys, understanding the entire survey life cycle can provide a better insight into what types of analyses should be conducted on the data. The survey life cycle consists of the necessary stages to execute a survey project successfully. Each stage influences the survey’s timing, costs, and feasibility, consequently impacting the data collected and how we should analyze it. The survey life cycle starts with a research topic or question of interest (e.g., what impact does childhood trauma have on health outcomes later in life). Researchers typically review existing data sources to determine if data are already available that can answer this question, as drawing from available resources can result in a reduced burden on respondents, cheaper research costs, and faster research outcomes. However, if existing data cannot answer the nuances of the research question, a survey can be used to capture the exact data that the researcher needs through a questionnaire, or a set of questions. To gain a deeper understanding of survey design and implementation, we recommend reviewing several pieces of existing literature in detail (e.g., Dillman, Smyth, and Christian 2014; Groves et al. 2009; Tourangeau, Rips, and Rasinski 2000; Bradburn, Sudman, and Wansink 2004; Valliant, Dever, and Kreuter 2013; Biemer and Lyberg 2003). 2.2 Searching for public-use survey data Throughout this book, we use public-use datasets from different surveys, including the American National Election Survey (ANES), the Residential Energy Consumption Survey (RECS), the National Crime Victimization Survey (NCVS), and the AmericasBarometer surveys. As mentioned above, researchers should look for existing data that can provide insights into their research questions before embarking on a new survey. One of the greatest sources of data is the government. For example, in the U.S., we can get data directly from the various statistical agencies like with RECS and NCVS. Other countries often have data available through official statistics offices, such as the Office for National Statistics in the United Kingdom. In addition to government data, many researchers will make their data publicly available through repositories such as the Inter-university Consortium for Political and Social Research (ICPSR) variable search or the Odum Institute Data Archive. Searching these repositories or other compiled lists (e.g., Analyze Survey Data for Free - asdfree.com) can be an efficient way to identify surveys with questions related to the researcher’s topic of interest. 2.3 Pre-Survey Planning There are multiple things to consider when starting a survey. Errors are the differences between the true values of the variables being studied and the values obtained through the survey. Each step and decision made before the launch of the survey impact the types of errors that are introduced into the data, which in turn impact how to interpret the results. Generally, survey researchers consider there to be seven main sources of error that fall under either Representation and Measurement (Groves et al. 2009): Representation Coverage Error: A mismatch between the population of interest (also known as the target population or study population) and the sampling frame. Sampling Error: Error produced when selecting a sample, the subset of the population, from the sampling frame, the list from which the sample is drawn (there is no sampling error if conducting a census). This error is due to randomization, and we discuss how to quantify this error in Chapter 10. Nonresponse Error: Differences between those who responded and did not respond to the survey (unit nonresponse) or a given question (item nonresponse). Adjustment Error: Error introduced during post-survey statistical adjustments. Measurement Validity: A mismatch between the topic of interest and the question(s) used to collect that information. Measurement Error: A mismatch between what the researcher asked and how the respondent answered. Processing Error: Edits by the researcher to responses provided by the respondent (e.g., adjustments to data based on illogical responses). Almost every survey has errors. Researchers attempt to conduct a survey that reduces the total survey error, or the accumulation of all errors that may arise throughout the survey life cycle. By assessing these different types of errors together, researchers can seek strategies to maximize the overall survey quality and improve the reliability and validity of results (Biemer 2010). However, attempts to lower individual sources errors (and therefore total survey error) come at the price of time, resources, and money. For example: Coverage Error Tradeoff: Researchers can search for or create more accurate and updated sampling frames, but they can be difficult to construct or obtain. Sampling Error Tradeoff: Researchers can increase the sample size to reduce sampling error; however, larger samples can be expensive and time-consuming to field. Nonresponse Error Tradeoff: Researchers can increase or diversify efforts to improve survey participation but this may be resource-intensive while not entirely removing nonresponse bias. Adjustment Error Tradeoff: Weighting, or a statistical technique used to adjust the contribution of individual survey responses to the final survey estimates, is typically done to make the sample more representative of the target population. However, if researchers do not carefully execute the adjustments or base them on inaccurate information, they can introduce new biases, leading to less accurate estimates. Validity Error Tradeoff: Reseachers can increase validity through a variety of ways, such as extensive research, using established scales, or collaborating with a psychometrician during survey design. However, doing so lengthens the amount of time and resources needed to complete survey design. Measurement Error Tradeoff: Reseachers can use techniques such as questionnaire testing and cognitive interviewing to ensure respondents are answering questions as expected. However, these activities also require time and resources to complete. Processing Error Tradeoff: Researchers can impose rigorous data cleaning and validation processes. However, this requires supervision, training, and time. The challenge for survey researchers is to find the optimal tradeoffs among these errors. They must carefully consider ways to reduce each error source and total survey error while balancing their study’s objectives and resources. For survey analysts, understanding the decisions that researchers took to minimize these error sources can impact how results are interpreted. The remainder of this chapter dives into critical considerations for survey development. We explore how to consider each of these sources of error and how these error sources can inform the interpretations of the data. 2.4 Study Design From formulating methodologies to choosing an appropriate sampling frame, the study design phase is where the blueprint for a successful survey takes shape. Study design encompasses multiple parts of the survey life cycle, including decisions on the population of interest, survey mode (the format through which a survey is administered to respondents), timeline, and questionnaire design. Knowing who and how to survey individuals depends on the study’s goals and the feasibility of implementation. This section explores the strategic planning that lays the foundation for a survey. 2.4.1 Sampling Design The set or group we want to survey is known as the population of interest. The population of interest could be broad, such as “all adults age 18+ living in the U.S.” or a specific population based on a particular characteristic or location. For example, we may want to know about “adults aged 18-24 who live in North Carolina” or “eligible voters living in Illinois.” However, a sampling frame with contact information is needed to survey individuals in these populations of interest. If researchers are looking at eligible voters, the sampling frame could be the voting registry for a given state or area. The sampling frame is likely imperfect for more broad target populations like all adults in the United States. In these cases, researchers may choose to use a sampling frame of mailing addresses and send the survey to households, or they may choose to use random digit dialing (RDD) and call random phone numbers (that may or may not be assigned, connected, and working). These imperfect sampling frames can result in coverage error where there is a mismatch between the target population and the list of individuals researchers can select. For example, if a researcher is looking to obtain estimates for “all adults aged 18+ living in the U.S.”, a sampling frame of mailing addresses will miss specific types of individuals, such as the homeless, transient populations, and incarcerated individuals. Additionally, many households have more than one adult living there, so researchers would need to consider how to get a specific individual to fill out the survey (called within household selection) or adjust the target population to report on “U.S. households” instead of “individuals.” Once the researchers have selected the sampling frame, the next step is determining how to select individuals for the survey. In rare cases, researchers may conduct a census and survey everyone on the sampling frame. However, the ability to implement a questionnaire at that scale is something only some can do (e.g., government censuses). Instead, researchers typically choose to sample individuals and use weights to estimate numbers in the target population. They can use a variety of different sampling methods, and more information on these can be found in Chapter 10. This decision of which sampling method to use impacts sampling error and can be accounted for in weighting. Example: Number of Pets in a Household Let’s use a simple example where a researcher is interested in the average number of pets in a household. Our researcher needs to consider the target population for this study. Specifically, are they interested in all households in a given country or households in a more local area (e.g., city or state)? Let’s assume our researcher is interested in the number of pets in a U.S. household with at least one adult (18 years old or older). In this case, a sampling frame of mailing addresses would provide the least coverage error as the frame would closely match our target population. Specifically, our researcher would likely want to use the Computerized Delivery Sequence File (CDSF), which is a file of mailing addresses that the United States Postal Service (USPS) creates and covers nearly 100% of U.S. households (Harter et al. 2016). To sample these households, for simplicity, we use a stratified simple random sample design, where we randomly sample households within each state (i.e., we stratify by state). Throughout this chapter, we build on this example research question to plan a survey. 2.4.2 Data Collection Planning With the sampling design decided, researchers can then decide how to survey these individuals. Specifically, the modes used for contacting and surveying the sample, how frequently to send reminders and follow-ups, and the overall timeline of the study are four of the major data collection determinations. Traditionally, researchers have considered four main modes4: Computer Assisted Personal Interview (CAPI; also known as face-to-face or in-person interviewing) Computer Assisted Telephone Interview (CATI; also known as phone or telephone interviewing) Computer Assisted Web Interview (CAWI; also known as web or online interviewing) Paper and Pencil Interview (PAPI) Researchers can use a single mode to collect data or multiple modes (also called mixed modes). Using mixed modes can allow for broader reach and increase response rates depending on the target population (DeLeeuw 2005, 2018; Biemer et al. 2017). For example, researchers could both call households to conduct a CATI survey and send mail with a PAPI survey to the household. Using both modes, researchers could gain participation through the mail from individuals who do not pick up the phone to unknown numbers or through the phone from individuals who do not open all of their mail. However, mode effects (where responses differ based on the mode of response) can be present in the data and may need to be considered during analysis. When selecting which mode, or modes, to use, understanding the unique aspects of the chosen target population and sampling frame provides insight into how they can best be reached and engaged. For example, if we plan to survey adults aged 18-24 who live in North Carolina, asking them to complete a survey using CATI (i.e., over the phone) would likely not be as successful as other modes like the web. This age group does not talk on the phone as much as other generations and often does not answer their phones for unknown numbers. Additionally, the mode for contacting respondents relies on what information is available in the sampling frame. For example, if our sampling frame includes an email address, we could email our selected sample members to convince them to complete a survey. Alternatively, if the sampling frame is a list of mailing addresses, we could contact sample members with a letter. It is important to note that there can be a difference between the contact and survey modes. For example, if we have a sampling frame with addresses, we can send a letter to our sample members and provide information on completing a web survey. Another option is using mixed-mode surveys by sending sample members a paper and pencil survey with our letter and also asking them to complete the survey online. Combining different contact modes and different survey modes can be helpful in reducing unit nonresponse error–where the entire unit (e.g., a household) does not respond to the survey at all–as different sample members may respond better to different contact and survey modes. However, when considering which modes to use, it is important to make access to the survey as easy as possible for sample members to reduce burden and unit nonresponse. Another way to reduce unit nonresponse error is by varying the language of the contact materials (Dillman, Smyth, and Christian 2014). People are motivated by different things, so constantly repeating the same message may not be helpful. Instead, mixing up the messaging and the type of contact material the sample member receives can increase response rates and reduce the unit nonresponse error. For example, instead of only sending standard letters, researchers could consider sending mailings that invoke “urgent” or “important” thoughts by sending priority letters or using other delivery services like FedEx, UPS, or DHL. A study timeline may also determine the number and types of contacts. If the timeline is long, there is plentiful time for follow-ups and diversified messages in contact materials. If the timeline is short, then fewer follow-ups can be implemented. Many studies start with the tailored design method put forth by Dillman, Smyth, and Christian (2014) and implement five contacts: Prenotification (Prenotice) letting sample members know the survey is coming Invitation to complete the survey Reminder that also thanks the respondents that may have already completed the survey Reminder (with a replacement paper survey if needed) Final reminder This method is easily adaptable based on the study timeline and needs but provides a starting point for most studies. Example: Number of Pets in a Household Let’s return to our example of a researcher who wants to know the average number of pets in a household. We are using a sampling frame of mailing addresses, so we recommend starting our data collection with letters mailed to households, but later in data collection, we want to send interviewers to the house to conduct an in-person (or CAPI) interview to decrease unit nonresponse error. This means we have two contact modes (paper and in-person). As mentioned above, the survey mode does not have to be the same as the contact mode, so we recommend a mixed-mode study with both Web and CAPI modes. Let’s assume we have six months for data collection, so we may want to recommend the following protocol: Protocol Example for 6-month Web and CAPI Data Collection Week Contact Mode Contact Message Survey Mode Offered 1 Mail: Letter Prenotice — 2 Mail: Letter Invitation Web 3 Mail: Postcard Thank You/Reminder Web 6 Mail: Letter in large envelope Animal Welfare Discussion Web 10 Mail: Postcard Inform Upcoming In-Person Visit Web 14 In-Person Visit — CAPI 16 Mail: Letter Reminder of In-Person Visit Web, but includes a number to call to schedule CAPI 20 In-Person Visit — CAPI 25 Mail: Letter in large envelope Survey Closing Notice Web, but includes a number to call to schedule CAPI This is just one possible protocol that we can use that starts respondents with the web (typically done to reduce costs). However, researchers may want to begin in-person data collection earlier during the data collection period or ask their interviewers to attempt more than two visits with a household. 2.4.3 Questionnaire Design When developing the questionnaire, it can be helpful to first outline the topics to be asked and include the “why” each question or topic is important to the research question(s). This can help researchers better tailor the questionnaire and reduce the number of questions (and thus the burden on the respondent) if topics are deemed irrelevant to the research question. When making these decisions, researchers should also consider questions needed for weighting. While we would love to have everyone in our population of interest answer our survey, this is rarely the case. Thus, including questions about demographics in the survey can assist with weighting for nonresponse errors (both unit and item nonresponse). Knowing the details of the sampling plan and what may impact coverage error and sampling error can help researchers determine what types of demographics to include. Researchers can benefit from the work of others by using questions from other surveys. Demographic sections such as race, ethnicity, or education borrow questions from a government census or other official surveys. Question banks such as the Inter-university Consortium for Political and Social Research (ICPSR) variable search can provide additional potential questions. If a question does not exist in a question bank, researchers can craft their own. When developing survey questions, researchers should start with the research topic and attempt to write questions that match the concept. The closer the question asked is to the overall concept, the better validity there is. For example, if the researcher wants to know how people consume T.V. series and movies but only asks a question about how many T.V.s are in the house, then they would be missing other ways that people watch T.V. series and movies, such as on other devices or at places outside of the home. As mentioned above, researchers can employ techniques to increase the validity of their questionnaires. For example, questionnaire testing involves piloting the survey instrument to identify and fix potential issues before conducting the main survey. Additionally, researchers could conduct cognitive interviews – a technique where researchers walk through the survey with participants, encouraging them to speak their thoughts out loud to uncover how they interpret and understand survey questions. Additionally, when designing questions, researchers should consider the mode for the survey and adjust the language appropriately. In self-administered surveys (e.g., web or mail), respondents can see all the questions and response options, but that is not the case in interviewer-administered surveys (e.g., CATI or CAPI). With interviewer-administered surveys, the response options must be read aloud to the respondents, so the question may need to be adjusted to create a better flow to the interview. Additionally, with self-administered surveys, because the respondents are viewing the questionnaire, the formatting of the questions is even more critical to ensure accurate measurement. Incorrect formatting or wording can result in measurement error, so following best practices or using existing validated questions can reduce error. There are multiple resources to help researchers draft questions for different modes (e.g., Dillman, Smyth, and Christian 2014; Fowler and Mangione 1989; Bradburn, Sudman, and Wansink 2004; Tourangeau, Couper, and Conrad 2004). Example: Number of Pets in a Household As part of our survey on the average number of pets in a household, researchers may want to know what animal most people prefer to have as a pet. Let’s say we have the following question in our survey: FIGURE 2.1: Example Question Asking Pet Preference Type This question may have validity issues as it only provides the options of “dogs” and “cats” to respondents, and the interpretation of the data could be incorrect. For example, if we had 100 respondents who answered the question and 50 selected dogs, then the results of this question cannot be “50% of the population prefers to have a dog as a pet,” as only two response options were provided. If a respondent taking our survey prefers turtles, they could either be forced to choose a response between these two (i.e., interpret the question as “between dogs and cats, which do you prefer?” and result in measurement error), or they may not answer the question (which results in item nonresponse error). Based on this, the interpretation of this question should be, “When given a choice between dogs and cats, 50% of respondents preferred to have a dog as a pet.” To avoid this issue, researchers should consider these possibilities and adjust the question accordingly. One simple way could be to add an “other” response option to give respondents a chance to provide a different response. The “other” response option could then include a way for respondents to write their other preference. For example, we could rewrite this question as: FIGURE 2.2: Example Question Asking Pet Preference Type with Other Specify Option Researchers can then code the responses from the open-ended box and get a better understanding of the respondent’s choice of preferred pet. Interpreting this question becomes easier as researchers no longer need to qualify the results with the choices provided. This is a simple example of how the presentation of the question and options can impact the findings. For more complex topics and questions, researchers must thoroughly consider how to mitigate any impacts from the presentation, formatting, wording, and other aspects. As survey analysts, reviewing not only the data but also the wording of the questions is crucial to ensure the results are presented in a manner consistent with the question asked. Chapter 3 provides further details on how to review existing survey documentation to inform our analyses. 2.5 Data Collection Once the data collection starts, researchers try to stick to the data collection protocol designed during pre-survey planning. However, effective researchers adjust their plans and adapt as needed to the current progress of data collection (Schouten, Peytchev, and Wagner 2018). Some extreme examples could be natural disasters that could prevent mail or interviewers from getting to the sample members. Others could be smaller in that something newsworthy occurs connected to the survey, so researchers could choose to play this up in communication materials. In addition to these external factors, there could be factors unique to the survey, such as lower response rates for a specific sub-group, so the data collection protocol may need to find ways to improve response rates for that specific group. 2.6 Post-Survey Processing After data collection, various activities need to be completed before we can analyze the survey. Multiple decisions made during this post-survey phase can assist researchers in reducing different error sources, such as through weighting to account for the sample selection. Knowing the decisions researchers made in creating the final analytic data can impact how analysts use the data and interpret the results. 2.6.1 Data Cleaning and Imputation Post-survey cleaning and imputation is one of the first steps researchers do to get the survey responses into a dataset for use by analysts. Data cleaning can consist of cleaning inconsistent data (e.g., with skip pattern errors or multiple questions throughout the survey being consistent with each other), editing numeric entries or open-ended responses for grammar and consistency, or recoding open-ended questions into categories for analysis. There is no universal set of fixed rules that every project must adhere to. Instead, each project or research study should establish its own guidelines and procedures for handling various cleaning scenarios based on its specific objectives. Researchers should use their best judgment to ensure data integrity, and all decisions should be documented and available to those using the data in the analysis. Each decision a researcher makes impacts processing error, so often, researchers have multiple people review these rules or recode open-ended data and adjudicate any differences in an attempt to reduce this error. Another crucial step in post-survey processing is imputation. Often, there is item nonresponse where respondents do not answer specific questions. If the questions are crucial to analysis efforts or the research question, researchers may implement imputation to reduce item nonresponse error. Imputation is a technique for replacing missing or incomplete data values with estimated values. However, as imputation is a way of assigning a value to missing data based on an algorithm or model, it can also introduce processing error, so researchers should consider the overall implications of imputing data compared to having item nonresponse. There are multiple ways to impute data. We recommend reviewing other resources like Kim and Shao (2021) for more information. Example: Number of Pets in a Household Let’s return to the question we created to ask about animal preference. The “other specify” invites respondents to specify the type of animal they prefer to have as a pet. If respondents entered answers such as “puppy,” “turtle,” “rabit,” “rabbit,” “bunny,” “ant farm,” “snake,” “Mr. Purr,” then researchers may wish to categorize these write-in responses to help with analysis. In this example, “puppy” could be assumed to be a reference to a “Dog”, and could be recoded there. The misspelling of “rabit” could be coded along with “rabbit” and “bunny” into a single category of “Bunny or Rabbit”. These are relatively standard decisions that a researcher could make. The remaining write-in responses could be categorized in a few different ways. “Mr. Purr,” which may be someone’s reference to their own cat, could be recoded as “Cat”, or it could remain as “Other” or some category that is “Unknown”. Depending on the number of responses related to each of the others, they could all be combined into a single “Other” category, or maybe categories such as “Reptiles” or “Insects” could be created. Each of these decisions may impact the interpretation of the data, so our researchers should document the types of responses that fall into each of the new categories and any decisions made. 2.6.2 Weighting We can address some of the error sources identified in the previous sections using weighting. For example, weights can address coverage, sampling, and nonresponse errors. Many published surveys include an “analysis weight” variable that combines these adjustments. However, weighting itself can also introduce adjustment error, so researchers need to balance which types of errors should be corrected with weighting. The construction of weights is outside the scope of this book, and researchers should reference other materials if interested in constructing their own (Valliant and Dever 2018). Instead, this book assumes the survey has been completed, weights are constructed, and data is available to users. We walk users through how to read the documentation (Chapter 3) and work with the data and analysis weights provided to analyze and interpret survey results correctly. Example: Number of Pets in a Household In the simple example of our survey, we decided to use a stratified sample by state to select our sample members. Knowing this sampling design, our researcher can include selection weights for analysis that account for how the sample members were selected for the survey. Additionally, the sampling frame may have the type of building associated with each address, so we could include the building type as a potential nonresponse weighting variable, along with some interviewer observations that may be related to our research topic of the average number of pets in a household. Combining these weights, we can create an analytic weight that researchers need to use when analyzing the data. 2.6.3 Disclosure Before data is released publicly, researchers need to ensure that individual respondents can not be identified by the data when confidentiality is required. There are a variety of different methods that can be used, including data swapping, top or bottom coding, coarsening, and perturbation. In data swapping, researchers may swap specific data values across different respondents so that it does not impact insights from the data but ensures that specific individuals cannot be identified. We can use top and bottom coding to mask extreme values. For example, researchers may top-code income values such that households with income greater than $500,000 are coded into a single category of “$500,000 or more”. Other disclosure methods may include aggregating response categories or location information to avoid having only a few respondents in a given group and thus be identified. For example, researchers may use coarsening to display income in categories instead of as a continuous variable. We can also perturb the data by adding random noise. There is as much art as there is science to the methods used for disclosure. In the survey documentation, researchers should only provide high-level comments about the disclosure and not specific details. This ensures nobody can reverse the disclosure and thus identify individuals. For more information on different disclosure methods, please see Skinner (2009) and AAPOR Standards. 2.6.4 Documentation Documentation is a critical step of the survey life cycle. Researchers systematically record all the details, decisions, procedures, and methodologies to ensure transparency, reproducibility, and the overall quality of survey research. Proper documentation allows analysts to understand, reproduce, and evaluate the study’s methods and findings. Chapter 3 dives into how analysts should use survey data documentation. 2.7 Post-survey data analysis and reporting After completing the survey life cycle, the data is ready for analysts to use. The rest of this book continues from this point. For more information on the survey life cycle, please explore the references cited throughout this chapter. References Biemer, Paul P. 2010. “Total Survey Error: Design, Implementation, and Evaluation.” Public Opinion Quarterly 74 (5): 817–48. https://doi.org/10.1093/poq/nfq058. Biemer, Paul P., and Lars E. Lyberg. 2003. Introduction to Survey Quality. John Wiley & Sons. Biemer, Paul P., Joe Murphy, Stephanie Zimmer, Chip Berry, Grace Deng, and Katie Lewis. 2017. “Using Bonus Monetary Incentives to Encourage Web Response in Mixed-Mode Household Surveys.” Journal of Survey Statistics and Methodology 6 (2): 240–61. https://doi.org/10.1093/jssam/smx015. Bradburn, Norman M., Seymour Sudman, and Brian Wansink. 2004. Asking Questions: The Definitive Guide to Questionnaire Design. 2nd Edition. Jossey-Bass. DeLeeuw, Edith D. 2005. “To Mix or Not to Mix Data Collection Modes in Surveys.” Journal of Official Statistics 21: 233–55. ———. 2018. “Mixed-Mode: Past, Present, and Future.” Survey Research Methods 12 (2): 75–89. https://doi.org/10.18148/srm/2018.v12i2.7402. Dillman, Don A, Jolene D Smyth, and Leah Melani Christian. 2014. Internet, Phone, Mail, and Mixed-Mode Surveys: The Tailored Design Method. John Wiley & Sons. Fowler, Floyd J, and Thomas W. Mangione. 1989. Standardized Survey Interviewing. SAGE. Groves, Robert M, Floyd J Fowler Jr, Mick P Couper, James M Lepkowski, Eleanor Singer, and Roger Tourangeau. 2009. Survey Methodology. John Wiley & Sons. Harter, Rachel, Michael P Battaglia, Trent D Buskirk, Don A Dillman, Ned English, Mansour Fahimi, Martin R Frankel, et al. 2016. “Address-Based Sampling.” Task force report. American Association for Public Opinion Research; https://aapor.org/wp-content/uploads/2022/11/AAPOR_Report_1_7_16_CLEAN-COPY-FINAL-2.pdf. Kim, Jae Kwang, and Jun Shao. 2021. Statistical Methods for Handling Incomplete Data. Chapman & Hall/CRC Press. Schouten, Barry, Andy Peytchev, and James Wagner. 2018. Adaptive Survey Design. Chapman & Hall/CRC Press. Skinner, Chris. 2009. “Chapter 15: Statistical Disclosure Control for Survey Data.” In Handbook of Statistics: Sample Surveys: Design, Methods and Applications, edited by C. R. Rao, 381–96. Elsevier B.V. Tourangeau, Roger, Mick P. Couper, and Frederick Conrad. 2004. “Sapcing, Position, and Order: Interpretive Heuristics for Visual Features of Survey Questions.” Public Opinion Quarterly 68: 368–93. Tourangeau, Roger, Lance J. Rips, and Kenneth Rasinski. 2000. Psychology of Survey Response. Cambridge University Press. Valliant, Richard, and Jill A. Dever. 2018. Survey Weights: A Step-by-Step Guide to Calculation. Stata Press. Valliant, Richard, Jill A Dever, and Frauke Kreuter. 2013. Practical Tools for Designing and Weighting Survey Samples. Vol. 1. Springer. Other modes such as using mobile apps or text messaging can also be considered, but at the time of publication, have smaller reach or are better for longitudinal studies (i.e., surveying the same individuals over many time periods of a single study).↩︎ "],["c03-understanding-survey-data-documentation.html", "Chapter 3 Understanding Survey Data Documentation 3.1 Introduction 3.2 Types of survey documentation 3.3 Missing data coding 3.4 Example: American National Election Studies (ANES) 2020 Survey Documentation", " Chapter 3 Understanding Survey Data Documentation 3.1 Introduction Survey documentation helps us prepare before we look at the actual survey data. The documentation includes technical guides, questionnaires, codebooks, errata, and other useful resources. By taking the time to review these materials, we can gain a comprehensive understanding of the survey data (including research and design decisions discussed in Chapters 2 and 10) and conduct our analysis more effectively. Survey documentation can vary in organization, type, and ease of use. The information may be stored in any format - PDFs, Excel spreadsheets, Word documents, and so on. Some surveys bundle documentation together, such as providing the codebook and questionnaire in a single document. Others keep them in separate files. Despite these variations, we can gain a general understanding of the documentation types and what aspects to focus on in each. 3.2 Types of survey documentation 3.2.1 Technical documentation The technical documentation, also known as user guides or methodology/analysis guides, highlights the variables necessary to specify the survey design. We recommend concentrating on these key sections: Introduction: The introduction orients us to the survey. This section provides the project’s background, the study’s purpose, and the main research questions. Study design: The study design section describes how researchers prepared and administered the survey. Sample: The sample section describes the sample frame, any known sampling errors, and the limitations of the sample. This section can contain recommendations on how to use sampling weights. Look for weight information, whether the survey design contains strata, clusters/PSUs, or replicate weights. Also look for population sizes, finite population correction, or replicate weight scaling information. Additional detail on sample designs is available in Chapter 10. Notes on fielding: Any additional notes on fielding, such as response rates, may be found in the technical documentation. The technical documentation may include other helpful resources. Some technical documentation includes syntax for SAS, SUDAAN, Stata, and/or R, so we do not have to create this code from scratch. 3.2.2 Questionnaires A questionnaire is a series of questions used to collect information from people in a survey. It can ask about opinions, behaviors, demographics, or even just numbers like the count of lightbulbs, square footage, or farm size. Questionnaires can employ different types of questions, such as closed-ended (e.g., select one or check all that apply), open-ended (e.g., numeric or text), Likert scales (e.g., a 5- or 7-point scale specifying a respondent’s level of agreement to a statement), or ranking questions (e.g., a list of options that a respondent ranks by preference). It may randomize the display order of responses or include instructions that help respondents understand the questions. A survey may have one questionnaire or multiple, depending on its scale and scope. The questionnaire is another important resource for understanding and interpreting the survey data (see Section 2.4.3), and we should use it alongside any analysis. It provides details about each of the questions asked in the survey, such as question name, question wording, response options, skip logic, randomizations, display specification, mode differences, and the universe (the subset of respondents that were asked a question). Below, in Figure 3.1, we show an example from the ANES 2020 questionnaire (American National Election Studies 2021). The figure shows a question’s question name (POSTVOTE_RVOTE), description (Did R Vote?), full wording of the question and responses, response order, universe, question logic (this question was only asked if vote_pre = 0), and other specifications. The section also includes the variable name, which we can link to the codebook. FIGURE 3.1: ANES 2020 Questionnaire Example The content and structure of questionnaires vary depending on the specific survey. For instance, question names may be informative (like the ANES example above), sequential, or denoted by a code. In some cases, surveys may not use separate names for questions and variables. Figure 3.2 shows an example from the Behavioral Risk Factor Surveillance System (BRFSS) questionnaire that shows a sequential question number and a coded variable name (as opposed to a question name) (Centers for Disease Control and Prevention (CDC) 2021). FIGURE 3.2: BRFSS 2021 Questionnaire Example We should factor in the details of a survey when conducting our analyses. For example, surveys that use various modes (e.g., web and mail) may have differences in question wording or skip logic, as web surveys can include fills or automate skip logic. These variations could warrant separate analyses for each mode. 3.2.3 Codebooks While a questionnaire provides information about the questions posed to respondents, the codebook explains how the survey data was coded and recorded. It lists details such as variable names, variable labels, variable meanings, codes for missing data, value labels, and value types (whether categorical or continuous, etc.). The codebook helps us understand and use the variables appropriately in our analysis. In particular, the codebook (as opposed to the questionnaire) often includes information on missing data. Note that the term data dictionary is sometimes used interchangeably with codebook, but a data dictionary may include more details on the structure and elements of the data. Figure 3.3 is a question from the ANES 2020 codebook (American National Election Studies 2022). This section indicates a particular variable’s name (V202066), question wording, value labels, universe, and associated survey question (POSTVOTE_RVOTE). FIGURE 3.3: ANES 2020 Codebook Example Reviewing the questionnaires and codebooks in parallel can clarify how to interpret the variables (Figures 3.1 and 3.3), as questions and variables do not always correspond directly to each other in a one-to-one mapping. A single question may have multiple associated variables, or a single variable may summarize multiple questions. 3.2.4 Errata An erratum (singular) or errata (plural) is a document that lists errors found in a publication or dataset. The purpose of an erratum is to correct or update inaccuracies in the original document. Examples of errata include: Issuing a corrected data table after realizing a typo or mistake in a table cell Reporting incorrectly programmed skips in an electronic survey where questions are skipped by the respondent when they should not have been The 2004 ANES dataset released an erratum, notifying analysts to remove a specific row from the data file due to the inclusion of a respondent who should not have been part of the sample. Adhering to an issued erratum helps us increase the accuracy and reliability of analysis. 3.2.5 Additional resources Survey documentation may include additional material, such as interviewer instructions or “show cards” provided to respondents during interviewer-administered surveys to help respondents answer questions. Explore the survey website to find out what resources were used and in what contexts. 3.3 Missing data coding For some observations in a dataset, there may be missing data. This can be by design or from nonresponse, and these concepts are detailed in Chapter 11. In that chapter, we also discuss how to analyze data with missing data. In this section, we discuss how to understand documentation related to missing data. The survey documentation, often the codebook, represents the missing data with a code. The codebook may list different codes depending on why certain data is missing. In the example of variable V202066 from the ANES (Figure 3.3), -9 represents “Refused,” -7 means that the response was deleted due to an incomplete interview, -6 means that there is no response because there was no follow-up interview, and -1 means “Inapplicable” (due to the designed skip pattern). As another example, there may be a summary variable that describes the missingness of a set of variables - particularly with “select all that apply” or “multiple response” questions. In the National Crime Victimization Survey (NCVS), respondents who are victims of a crime and saw the offender are asked if the offender have a weapon and then asked what the type of weapon was. This part of the questionnaire from 2021 is shown in Figure 3.4. FIGURE 3.4: Excerpt from the NCVS 2020-2021 Crime Incident Report - Weapon Type The NCVS codebook includes coding for all multiple response variables of a “lead in” variable that summarizes the individual options. For question 23a on the weapon type, the lead in variable is V4050 which is shown in 3.5. This variable is then followed by a set of variables for each weapon type. An example of one of the individual variables from the codebook, the handgun, is shown in 3.6. We will dive in more to this example in Chapter 11 of how to analyze this variable. FIGURE 3.5: Excerpt from the NCVS 2021 Codebook for V4050 - LI WHAT WAS WEAPON FIGURE 3.6: Excerpt from the NCVS 2021 Codebook for V4051 - C WEAPON: HAND GUN When data is read into R, some values may be system missing, that is they are coded as NA even if that is not evident in a codebook. We will discuss in Chapter 11 how to analyze data with NA values and review how R handles missing data in calculations. 3.4 Example: American National Election Studies (ANES) 2020 Survey Documentation Let’s look at the survey documentation for the American National Election Studies (ANES) 2020. The survey website is located at https://electionstudies.org/data-center/2020-time-series-study/. Navigating to “User Guide and Codebook” (American National Election Studies 2022), we can download the PDF that contains the survey documentation, titled “ANES 2020 Time Series Study Full Release: User Guide and Codebook”. Do not be daunted by the 796-page PDF. We will focus on the most critical information. Introduction The first section in the User Guide explains that the ANES 2020 Times Series Study continues a series of election surveys conducted since 1948. These surveys contain data on public opinion and voting behavior in the U.S. presidential elections. The introduction also includes information about the modes used for data collection (web, live video interviewing, or CATI). Additionally, there is a summary of the number of pre-election interviews (8,280) and post-election re-interviews (7,449). Sample Design and Respondent Recruitment The section “Sample Design and Respondent Recruitment” provides more detail about the survey’s sequential mixed-mode design. All three modes were conducted one after another and not at the same time. Additionally, it indicates that for the 2020 survey, they resampled all respondents who participated in 2016 ANES, along with a newly-drawn cross-section: The target population for the fresh cross-section was the 231 million non-institutional U.S. citizens aged 18 or older living in the 50 U.S. states or the District of Columbia. The document continues with more details on the sample groups. Data Analysis, Weights, and Variance Estimation The section “Data Analysis, Weights, and Variance Estimation” includes information on weights and strata/cluster variables. Reading through, we can find the full sample weight variables: For analysis of the complete set of cases using pre-election data only, including all cases and representative of the 2020 electorate, use the full sample pre-election weight, V200010a. For analysis including post-election data for the complete set of participants (i.e., analysis of post-election data only or a combination of pre- and post-election data), use the full sample post-election weight, V200010b. Additional weights are provided for analysis of subsets of the data… The document provides more information about the variables, summarized in Table 3.1. TABLE 3.1: Weight and variance information for ANES For weight Use variance unit/PSU/cluster and use variance stratum V200010a V200010c V200010d V200010b V200010c V200010d Methodology The user guide mentions a supplemental document called “How to Analyze ANES Survey Data” (DeBell 2010) as a ‘how-to guide’ for analyzing the data. In this document, we learn more about the weights, where we learn that they sum to the sample size and not the population. If our goal is to calculate estimates for the entire U.S. population instead of just the sample, we must adjust the weights to the U.S. population. To create accurate weights for the population, we need to determine the total population size at the time of the survey. Let’s review the “Sample Design and Respondent Recruitment” section for more details: The target population for the fresh cross-section was the 231 million non-institutional U.S. citizens aged 18 or older living in the 50 U.S. states or the District of Columbia. The documentation suggests that the population should equal around 231 million, but this is a very imprecise count. Upon further investigation in the available resources, we can find the methodology file titled “Methodology Report for the ANES 2020 Time Series Study” (DeBell, Matthew and Amsbary, Michelle and Brader, Ted and Brock, Shelley and Good, Cindy and Kamens, Justin and Maisel, Natalya and Pinto, Sarah 2022). This file states that we can use the population total from the Current Population Survey (CPS), a monthly survey sponsored by the U.S. Census Bureau and the U.S. Bureau of Labor Statistics. The CPS provides a more accurate population estimate for a specific month. Therefore, we can use the CPS to get the total population number for March 2020, the time in which the ANES was conducted. Chapter 4 goes into detailed instructions on how to calculate and adjust this value in the data. References American National Election Studies. 2021. “ANES 2020 Time Series Study: Pre-Election and Post-Election Survey Questionnaires.” https://electionstudies.org/wp-content/uploads/2021/07/anes_timeseries_2020_questionnaire_20210719.pdf. ———. 2022. “ANES 2020 Time Series Study Full Release: User Guide and Codebook.” https://electionstudies.org/wp-content/uploads/2022/02/anes_timeseries_2020_userguidecodebook_20220210.pdf. Centers for Disease Control and Prevention (CDC). 2021. “Behavioral Risk Factor Surveillance System Survey Questionnaire.” U.S. Department of Health; Human Services, Centers for Disease Control; Prevention; https://www.cdc.gov/brfss/questionnaires/pdf-ques/2021-BRFSS-Questionnaire-1-19-2022-508.pdf. DeBell, Matthew. 2010. “How to Analyze ANES Survey Data.” ANES Technical Report Series nes012492. Palo Alto, CA: Stanford University; Ann Arbor, MI: the University of Michigan; https://electionstudies.org/wp-content/uploads/2018/05/HowToAnalyzeANESData.pdf. DeBell, Matthew and Amsbary, Michelle and Brader, Ted and Brock, Shelley and Good, Cindy and Kamens, Justin and Maisel, Natalya and Pinto, Sarah. 2022. “Methodology Report for the ANES 2020 Time Series Study.” https://electionstudies.org/wp-content/uploads/2022/08/anes_timeseries_2020_methodology_report.pdf. "],["c04-set-up.html", "Chapter 4 Setup 4.1 Packages 4.2 Data 4.3 Design objects", " Chapter 4 Setup This chapter provides an overview of the packages, data, and design objects we use throughout this book. For a streamlined learning experience, we recommend taking the time to walk through the code provided and making sure everything is installed. As mentioned in Chapter 2, understanding how a survey was conducted helps us make sense of the results and interpret findings. So, we provide background on the datasets used in examples and exercises. Finally, we walk through how to create the survey design objects necessary to begin analysis. If you have questions or face issues while going through the book, please report them in the book’s GitHub repository: https://github.com/tidy-survey-r/tidy-survey-book. 4.1 Packages We use several packages throughout the book, but let’s install and load specific ones for this chapter. Many functions in the examples and exercises are from three packages: {tidyverse}, {survey}, and {srvyr}. If they are not already installed, use the code below. The {tidyverse} and {survey} package can both be installed from the Comprehensive R Archive Network (CRAN). We use the GitHub development version of {srvyr} because of its additional functionality compared to the one on CRAN. Install the package directly from GitHub using the {remotes} package: install.packages(c("tidyverse", "survey")) remotes::install_github("https://github.com/gergness/srvyr") We bundled the datasets used in the book in an R package, {srvyrexploR}. Install it directly from GitHub using the {remotes} package: remotes::install_github("https://github.com/tidy-survey-r/srvyrexploR") After installing these packages, load them using the library() function: library(tidyverse) library(survey) library(srvyr) library(srvyrexploR) The packages {broom}, {gt}, and {gtsummary} play a role in displaying output and creating formatted tables. Install them with the provided code5: install.packages(c("gt", "gtsummary")) After installing these packages, load them using the library() function: library(broom) library(gt) library(gtsummary) Install and load the {censusapi} package to access the Current Population Survey (CPS), which we use to ensure accurate weighting of a key dataset in the book. Run the code below to install {censusapi}: install.packages("censusapi") After installing this package, load it using the library() function: library(censusapi) Note that the {censusapi} package requires a Census API key, available for free from the U.S. Census Bureau website (refer to the package documentation for more information). It’s recommended to include the Census API key in our R environment instead of directly in the code. After obtaining the API key, save it in your R environment by running Sys.setenv(): Sys.setenv(CENSUS_KEY="YOUR_API_KEY_HERE") Then, restart the R session. Once the Census API key is stored, we can retrieve it in our R code with Sys.getenv(\"CENSUS_KEY\"). There are other packages used throughout the book. We list them in the Prerequisite boxes at the beginning of each chapter. As we work through the book, make sure to check the Prerequisite box and install any missing packages before proceeding. 4.2 Data As mentioned above, the {srvyrexploR} package contains the datasets used in the book. Once installed and loaded, explore the documentation using the help() function. Read the descriptions of the datasets to understand what they contain: help(package = "srvyrexploR") This book uses two main datasets: the American National Election Studies (ANES – DeBell 2010) and the Residential Energy Consumption Survey (RECS – U.S. Energy Information Administration 2023a). We can load these datasets individually with the data() function by specifying the dataset name as an argument. In the code below, we load the anes_2020 and recs_2020 datasets into objects with their respective names: data(anes_2020) data(recs_2020) 4.2.1 American National Election Studies (ANES) Data The ANES is a study that collects data from election surveys dating back to 1948. These surveys contain information on public opinion and voting behavior in U.S. presidential elections. They cover topics such as party affiliation, voting choice, and level of trust in the government. The 2020 survey, the data we use in the book, was fielded online, through live video interviews, or via computer-assisted telephone interviews (CATI). When working with new survey data, analysts should review the survey documentation (see Chapter 3) to understand the data collection methods. The original ANES data contains variables starting with V20 (DeBell 2010), so to assist with our analysis throughout the book, we created descriptive variable names. For example, the respondent’s age is now in a variable called Age, and gender is in a variable called Gender. These descriptive variables are included in the {srvyrexploR} package, and Table 4.1 displays the list of these renamed variables. A complete overview of all variables can be found in Appendix A. #hzaixepvho table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #hzaixepvho thead, #hzaixepvho tbody, #hzaixepvho tfoot, #hzaixepvho tr, #hzaixepvho td, #hzaixepvho th { border-style: none; } #hzaixepvho p { margin: 0; padding: 0; } #hzaixepvho .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #hzaixepvho .gt_caption { padding-top: 4px; padding-bottom: 4px; } #hzaixepvho .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #hzaixepvho .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #hzaixepvho .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #hzaixepvho .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #hzaixepvho .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #hzaixepvho .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #hzaixepvho .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #hzaixepvho .gt_column_spanner_outer:first-child { padding-left: 0; } #hzaixepvho .gt_column_spanner_outer:last-child { padding-right: 0; } #hzaixepvho .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #hzaixepvho .gt_spanner_row { border-bottom-style: hidden; } #hzaixepvho .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #hzaixepvho .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #hzaixepvho .gt_from_md > :first-child { margin-top: 0; } #hzaixepvho .gt_from_md > :last-child { margin-bottom: 0; } #hzaixepvho .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #hzaixepvho .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #hzaixepvho .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #hzaixepvho .gt_row_group_first td { border-top-width: 2px; } #hzaixepvho .gt_row_group_first th { border-top-width: 2px; } #hzaixepvho .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #hzaixepvho .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #hzaixepvho .gt_first_summary_row.thick { border-top-width: 2px; } #hzaixepvho .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #hzaixepvho .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #hzaixepvho .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #hzaixepvho .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #hzaixepvho .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #hzaixepvho .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #hzaixepvho .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #hzaixepvho .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #hzaixepvho .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #hzaixepvho .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #hzaixepvho .gt_left { text-align: left; } #hzaixepvho .gt_center { text-align: center; } #hzaixepvho .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #hzaixepvho .gt_font_normal { font-weight: normal; } #hzaixepvho .gt_font_bold { font-weight: bold; } #hzaixepvho .gt_font_italic { font-style: italic; } #hzaixepvho .gt_super { font-size: 65%; } #hzaixepvho .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #hzaixepvho .gt_asterisk { font-size: 100%; vertical-align: 0; } #hzaixepvho .gt_indent_1 { text-indent: 5px; } #hzaixepvho .gt_indent_2 { text-indent: 10px; } #hzaixepvho .gt_indent_3 { text-indent: 15px; } #hzaixepvho .gt_indent_4 { text-indent: 20px; } #hzaixepvho .gt_indent_5 { text-indent: 25px; } TABLE 4.1: List of created variables in the ANES Data Variable Name CaseID InterviewMode Weight VarUnit Stratum CampaignInterest VotedPres2016 VotedPres2016_selection PartyID TrustGovernment TrustPeople Age AgeGroup Education RaceEth Gender Income Income7 VotedPres2020 VotedPres2020_selection EarlyVote2020 Before beginning an analysis, it is useful to view the data to understand the available variables. The dplyr::glimpse() function produces a list of all variables, their types (e.g., function, double), and a few example values. Below, we remove variables containing numbers with select(-matches(\"^V\\\\d\")) before using glimpse() to get a quick overview of the data with descriptive variable names: anes_2020 %>% select(-matches("^V\\\\d")) %>% glimpse() ## Rows: 7,453 ## Columns: 21 ## $ CaseID <dbl> 200015, 200022, 200039, 200046, 200053… ## $ InterviewMode <fct> Web, Web, Web, Web, Web, Web, Web, Web… ## $ Weight <dbl> 1.0057, 1.1635, 0.7687, 0.5210, 0.9658… ## $ VarUnit <fct> 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2,… ## $ Stratum <fct> 9, 26, 41, 29, 23, 37, 7, 37, 32, 41, … ## $ CampaignInterest <fct> Somewhat interested, Not much interest… ## $ VotedPres2016 <fct> Yes, Yes, Yes, Yes, Yes, No, Yes, No, … ## $ VotedPres2016_selection <fct> Trump, Other, Clinton, Clinton, Trump,… ## $ PartyID <fct> Strong republican, Independent, Indepe… ## $ TrustGovernment <fct> Never, Never, Some of the time, About … ## $ TrustPeople <fct> About half the time, Some of the time,… ## $ Age <dbl> 46, 37, 40, 41, 72, 71, 37, 45, 70, 43… ## $ AgeGroup <fct> 40-49, 30-39, 40-49, 40-49, 70 or olde… ## $ Education <fct> Bachelor's, Post HS, High school, Post… ## $ RaceEth <fct> "Hispanic", "Asian, NH/PI", "White", "… ## $ Gender <fct> Male, Female, Female, Male, Male, Fema… ## $ Income <fct> "$175,000-249,999", "$70,000-74,999", … ## $ Income7 <fct> $125k or more, $60-80k, $100-125k, $20… ## $ VotedPres2020 <fct> NA, Yes, Yes, Yes, Yes, Yes, Yes, NA, … ## $ VotedPres2020_selection <fct> NA, Other, Biden, Biden, Trump, Biden,… ## $ EarlyVote2020 <fct> NA, No, No, No, No, No, No, NA, Yes, N… From the output, we can see there are 7,453 rows and 21 variables in the ANES data. This output also indicates that most of the variables are factors (e.g., InterviewMode), while a few variables are in double (numeric) format (e.g., Age). 4.2.2 Residential Energy Consumption Survey (RECS) Data RECS is a study that measures energy consumption and expenditure in American households. Funded by the Energy Information Administration, the RECS data are collected through interviews with household members and energy suppliers. These interviews take place in person, over the phone, via mail, and on the web. The survey has been fielded 14 times between 1950 and 2020. It includes questions about appliances, electronics, heating, air conditioning (A/C), temperatures, water heating, lighting, energy bills, respondent demographics, and energy assistance. As mentioned above, analysts should read the survey documentation (see Chapter 3) to understand how the data was collected and implemented. Table 4.2 displays the list of variables in the RECS data (not including the weights, which start with NWEIGHT and will be described in more detail in Chapter 10). An overview of all variables can be found in Appendix B. #iwffdsqgzu table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #iwffdsqgzu thead, #iwffdsqgzu tbody, #iwffdsqgzu tfoot, #iwffdsqgzu tr, #iwffdsqgzu td, #iwffdsqgzu th { border-style: none; } #iwffdsqgzu p { margin: 0; padding: 0; } #iwffdsqgzu .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #iwffdsqgzu .gt_caption { padding-top: 4px; padding-bottom: 4px; } #iwffdsqgzu .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #iwffdsqgzu .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #iwffdsqgzu .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #iwffdsqgzu .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #iwffdsqgzu .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #iwffdsqgzu .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #iwffdsqgzu .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #iwffdsqgzu .gt_column_spanner_outer:first-child { padding-left: 0; } #iwffdsqgzu .gt_column_spanner_outer:last-child { padding-right: 0; } #iwffdsqgzu .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #iwffdsqgzu .gt_spanner_row { border-bottom-style: hidden; } #iwffdsqgzu .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #iwffdsqgzu .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #iwffdsqgzu .gt_from_md > :first-child { margin-top: 0; } #iwffdsqgzu .gt_from_md > :last-child { margin-bottom: 0; } #iwffdsqgzu .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #iwffdsqgzu .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #iwffdsqgzu .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #iwffdsqgzu .gt_row_group_first td { border-top-width: 2px; } #iwffdsqgzu .gt_row_group_first th { border-top-width: 2px; } #iwffdsqgzu .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #iwffdsqgzu .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #iwffdsqgzu .gt_first_summary_row.thick { border-top-width: 2px; } #iwffdsqgzu .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #iwffdsqgzu .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #iwffdsqgzu .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #iwffdsqgzu .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #iwffdsqgzu .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #iwffdsqgzu .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #iwffdsqgzu .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #iwffdsqgzu .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #iwffdsqgzu .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #iwffdsqgzu .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #iwffdsqgzu .gt_left { text-align: left; } #iwffdsqgzu .gt_center { text-align: center; } #iwffdsqgzu .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #iwffdsqgzu .gt_font_normal { font-weight: normal; } #iwffdsqgzu .gt_font_bold { font-weight: bold; } #iwffdsqgzu .gt_font_italic { font-style: italic; } #iwffdsqgzu .gt_super { font-size: 65%; } #iwffdsqgzu .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #iwffdsqgzu .gt_asterisk { font-size: 100%; vertical-align: 0; } #iwffdsqgzu .gt_indent_1 { text-indent: 5px; } #iwffdsqgzu .gt_indent_2 { text-indent: 10px; } #iwffdsqgzu .gt_indent_3 { text-indent: 15px; } #iwffdsqgzu .gt_indent_4 { text-indent: 20px; } #iwffdsqgzu .gt_indent_5 { text-indent: 25px; } TABLE 4.2: List of Variables in the RECS Data Variable Name DOEID ClimateRegion_BA Urbanicity Region REGIONC Division STATE_FIPS state_postal state_name HDD65 CDD65 HDD30YR CDD30YR HousingUnitType YearMade TOTSQFT_EN TOTHSQFT TOTCSQFT ZTOTSQFT_EN ZYearMade ZHousingUnitType SpaceHeatingUsed ZSpaceHeatingUsed ACUsed ZACUsed ZACBehavior HeatingBehavior WinterTempDay WinterTempAway WinterTempNight ACBehavior SummerTempDay SummerTempAway SummerTempNight ZHeatingBehavior ZWinterTempAway ZSummerTempAway ZWinterTempDay ZSummerTempDay ZWinterTempNight ZSummerTempNight BTUEL DOLLAREL ZBTUEL BTUNG DOLLARNG ZBTUNG BTULP DOLLARLP ZBTULP BTUFO DOLLARFO ZBTUFO BTUWOOD ZBTUWOOD TOTALBTU TOTALDOL Before starting an analysis, we recommend viewing the data to understand the types of data and variables that are included. The dplyr::glimpse() function produces a list of all variables, the type of the variable (e.g., function, double), and a few example values. Below, we remove the weight variables with select(-matches(\"^NWEIGHT\")) before using glimpse() to get a quick overview of the data: recs_2020 %>% select(-matches("^NWEIGHT")) %>% glimpse() ## Rows: 18,496 ## Columns: 57 ## $ DOEID <dbl> 1e+05, 1e+05, 1e+05, 1e+05, 1e+05, 1e+05, 1e… ## $ ClimateRegion_BA <fct> Mixed-Dry, Mixed-Humid, Mixed-Dry, Mixed-Hum… ## $ Urbanicity <fct> Urban Area, Urban Area, Urban Area, Urban Ar… ## $ Region <fct> West, South, West, South, Northeast, South, … ## $ REGIONC <chr> "WEST", "SOUTH", "WEST", "SOUTH", "NORTHEAST… ## $ Division <fct> Mountain South, West South Central, Mountain… ## $ STATE_FIPS <chr> "35", "05", "35", "45", "34", "48", "40", "2… ## $ state_postal <fct> NM, AR, NM, SC, NJ, TX, OK, MS, DC, AZ, CA, … ## $ state_name <fct> New Mexico, Arkansas, New Mexico, South Caro… ## $ HDD65 <dbl> 3844, 3766, 3819, 2614, 4219, 901, 3148, 182… ## $ CDD65 <dbl> 1679, 1458, 1696, 1718, 1363, 3558, 2128, 23… ## $ HDD30YR <dbl> 4451, 4429, 4500, 3229, 4896, 1150, 3564, 26… ## $ CDD30YR <dbl> 1027, 1305, 1010, 1653, 1059, 3588, 2043, 21… ## $ HousingUnitType <fct> Single-family detached, Apartment: 5 or more… ## $ YearMade <ord> 1970-1979, 1980-1989, 1960-1969, 1980-1989, … ## $ TOTSQFT_EN <dbl> 2100, 590, 900, 2100, 800, 4520, 2100, 900, … ## $ TOTHSQFT <dbl> 2100, 590, 900, 2100, 800, 3010, 1200, 900, … ## $ TOTCSQFT <dbl> 2100, 590, 900, 2100, 800, 3010, 1200, 0, 50… ## $ ZTOTSQFT_EN <fct> Not imputed, Not imputed, Not imputed, Not i… ## $ ZYearMade <fct> Not imputed, Not imputed, Not imputed, Not i… ## $ ZHousingUnitType <fct> Not imputed, Not imputed, Not imputed, Not i… ## $ SpaceHeatingUsed <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR… ## $ ZSpaceHeatingUsed <fct> Not imputed, Not imputed, Not imputed, Not i… ## $ ACUsed <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FA… ## $ ZACUsed <fct> Not imputed, Not imputed, Not imputed, Not i… ## $ ZACBehavior <fct> Not imputed, Imputed, Not imputed, Not imput… ## $ HeatingBehavior <fct> Set one temp and leave it, Turn on or off as… ## $ WinterTempDay <dbl> 70, 70, 69, 68, 68, 76, 74, 70, 68, 70, 72, … ## $ WinterTempAway <dbl> 70, 65, 68, 68, 68, 76, 65, 70, 60, 70, 70, … ## $ WinterTempNight <dbl> 68, 65, 67, 68, 68, 68, 74, 68, 62, 68, 72, … ## $ ACBehavior <fct> Set one temp and leave it, Turn on or off as… ## $ SummerTempDay <dbl> 71, 68, 70, 72, 72, 69, 68, NA, 72, 74, 77, … ## $ SummerTempAway <dbl> 71, 68, 68, 72, 72, 74, 70, NA, 76, 74, 77, … ## $ SummerTempNight <dbl> 71, 68, 68, 72, 72, 68, 70, NA, 68, 72, 77, … ## $ ZHeatingBehavior <fct> Not imputed, Not imputed, Not imputed, Not i… ## $ ZWinterTempAway <fct> Not imputed, Not imputed, Not imputed, Not i… ## $ ZSummerTempAway <fct> Not imputed, Not imputed, Not imputed, Not i… ## $ ZWinterTempDay <fct> Not imputed, Not imputed, Not imputed, Not i… ## $ ZSummerTempDay <fct> Not imputed, Not imputed, Not imputed, Not i… ## $ ZWinterTempNight <fct> Not imputed, Not imputed, Not imputed, Not i… ## $ ZSummerTempNight <fct> Not imputed, Not imputed, Not imputed, Not i… ## $ BTUEL <dbl> 42723, 17889, 8147, 31647, 20027, 48968, 494… ## $ DOLLAREL <dbl> 1955.06, 713.27, 334.51, 1424.86, 1087.00, 1… ## $ ZBTUEL <fct> Not imputed, Not imputed, Imputed amount and… ## $ BTUNG <dbl> 101924.4, 10145.3, 22603.1, 55118.7, 39099.5… ## $ DOLLARNG <dbl> 701.83, 261.73, 188.14, 636.91, 376.04, 439.… ## $ ZBTUNG <fct> Not imputed, Not imputed, Imputed, Not imput… ## $ BTULP <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 17… ## $ DOLLARLP <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,… ## $ ZBTULP <fct> Not applicable, Not applicable, Not applicab… ## $ BTUFO <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 68… ## $ DOLLARFO <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 18… ## $ ZBTUFO <fct> Not applicable, Not applicable, Not applicab… ## $ BTUWOOD <dbl> 0, 0, 0, 0, 0, 3000, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ ZBTUWOOD <fct> Not applicable, Not applicable, Not applicab… ## $ TOTALBTU <dbl> 144648, 28035, 30750, 86765, 59127, 85401, 1… ## $ TOTALDOL <dbl> 2656.9, 975.0, 522.6, 2061.8, 1463.0, 2335.1… From the output, we can see that there are 18,496 rows and 57 non-weight variables in the RECS data. This output also indicates that most of the variables are in double (numeric) format (e.g., TOTSQFT_EN), with some factor (e.g., Region), Boolean (e.g., ACUsed), character (e.g., REGIONC), and ordinal (e.g., YearMade) variables. 4.3 Design objects The design object is the backbone for survey analysis. It is where we specify the sampling design, weights, and other necessary information to ensure we account for errors in the data. Before creating the design object, analysts should carefully review the survey documentation to understand how to create the design object for accurate analysis. In this chapter, we provide details on how to code the design object for the ANES and RECS data used in the book. However, we only provide a high-level overview to get readers started. For a deeper understanding of creating these design objects for a variety of sampling designs, see Chapter 10. While we recommend conducting exploratory data analysis on the original data before diving into complex survey analysis (see Chapter 12), the actual analysis and inference should be performed with the survey design objects instead of the original survey data. This ensures that we appropriately apply the details of the survey design to our calculations. For example, the ANES data is called anes_2020. If we create a survey design object called anes_des, our analyses should begin with anes_des and not anes_2020. 4.3.1 American National Election Studies (ANES) Design Object The ANES documentation (DeBell 2010) details the sampling and weighting implications for analyzing the survey data. From this documentation and as noted in Chapter 3, the 2020 ANES data is weighted to the sample, not the population. To make generalizations for the population, we need to weigh the data against the full population count.The ANES methodology recommends using the Current Population Survey (CPS) to determine the number of the non-institutional U.S. citizens aged 18 or older living in the 50 U.S. states or D.C. in March of 2020. We can use the {censusapi} package to obtain the information needed for the survey design object. The getCensus() function allows us to retrieve the CPS data for March (cps/basic/mar) in 2020 (vintage = 2020). Additionally, we extract several variables from the CPS: month (HRMONTH) and year (HRYEAR4) of the interview: to confirm the correct time period age (PRTAGE) of the respondent: to narrow the population to 18 and older (eligible age to vote) citizenship status (PRCITSHP) of the respondent: to narrow the population to only those eligible to vote final person-level weight (PWSSWGT) Detailed information for these variables can be found in the CPS data dictionary6. cps_state_in <- getCensus(name = "cps/basic/mar", vintage = 2020, region = "state", vars = c("HRMONTH", "HRYEAR4", "PRTAGE", "PRCITSHP", "PWSSWGT"), key = Sys.getenv("CENSUS_KEY")) cps_state <- cps_state_in %>% as_tibble() %>% mutate(across(.cols = everything(), .fns = as.numeric)) In the code above, we include region = \"state\". The default region type for the CPS data is at the state level. While it’s not required to include this, it can be helpful for understanding the geographical context of the data. In getCensus(), we filtered the dataset by specifying the month (HRMONTH == 3) and year (HRYEAR4 == 2020) of our request. Therefore, we expect that all interviews within our output were conducted during that particular month and year. We can confirm that the data is from March of 2020 by running the code below: cps_state %>% distinct(HRMONTH, HRYEAR4) ## # A tibble: 1 × 2 ## HRMONTH HRYEAR4 ## <dbl> <dbl> ## 1 3 2020 We can narrow down the dataset using the age and citizenship variables to include only individuals who are 18 years or older (PRTAGE >= 18) and have U.S. citizenship (PRCITSHIP %in% c(1:4)): cps_narrow_resp <- cps_state %>% filter(PRTAGE >= 18, PRCITSHP %in% c(1:4)) To calculate the U.S. population from the filtered data, we sum the person weights (PWSSWGT): targetpop <- cps_narrow_resp %>% pull(PWSSWGT) %>% sum() targetpop ## [1] "231,034,125" The target population in 2020 is 231,034,125. This result gives us what we need to create the survey design object for estimating population statistics. Using the anes_2020 data, we adjust the weighting variable (V200010b) using the target population we just calculated (targetpop). We determine the proportion of the total weight for each individual weights (V200010b / sum(V200010b)) and then multiply that proportion by the calculated target population. anes_adjwgt <- anes_2020 %>% mutate(Weight = V200010b / sum(V200010b) * targetpop) Once we have the adjusted weights, we can refer to the rest of the documentation to create the survey design. The documentation indicates that the study uses a stratified cluster sampling design. This means that we need to specify variables for strata and ids (cluster) and fill in the nest argument. The documentation provides guidance on which strata and cluster variables to use depending on whether we are analyzing pre- or post-election data. In this book, we analyze post-election data, so we need to use the post-election weight V200010b, strata variable V200010d, and PSU/cluster variable V200010c. Additionally, we set nest=TRUE to ensure the clusters are nested within the strata. anes_des <- anes_adjwgt %>% as_survey_design(weights = Weight, strata = V200010d, ids = V200010c, nest = TRUE) anes_des ## Stratified 1 - level Cluster Sampling design (with replacement) ## With (101) clusters. ## Called via srvyr ## Sampling variables: ## - ids: V200010c ## - strata: V200010d ## - weights: Weight ## Data variables: ## - V200001 (dbl), CaseID (dbl), V200002 (hvn_lbll), InterviewMode ## (fct), V200010b (dbl), Weight (dbl), V200010c (dbl), VarUnit (fct), ## V200010d (dbl), Stratum (fct), V201006 (hvn_lbll), CampaignInterest ## (fct), V201024 (hvn_lbll), V201025x (hvn_lbll), V201029 (hvn_lbll), ## V201101 (hvn_lbll), V201102 (hvn_lbll), VotedPres2016 (fct), ## V201103 (hvn_lbll), VotedPres2016_selection (fct), V201228 ## (hvn_lbll), V201229 (hvn_lbll), V201230 (hvn_lbll), V201231x ## (hvn_lbll), PartyID (fct), V201233 (hvn_lbll), TrustGovernment ## (fct), V201237 (hvn_lbll), TrustPeople (fct), V201507x (hvn_lbll), ## Age (dbl), AgeGroup (fct), V201510 (hvn_lbll), Education (fct), ## V201546 (hvn_lbll), V201547a (hvn_lbll), V201547b (hvn_lbll), ## V201547c (hvn_lbll), V201547d (hvn_lbll), V201547e (hvn_lbll), ## V201547z (hvn_lbll), V201549x (hvn_lbll), RaceEth (fct), V201600 ## (hvn_lbll), Gender (fct), V201607 (hvn_lbll), V201610 (hvn_lbll), ## V201611 (hvn_lbll), V201613 (hvn_lbll), V201615 (hvn_lbll), V201616 ## (hvn_lbll), V201617x (hvn_lbll), Income (fct), Income7 (fct), ## V202051 (hvn_lbll), V202066 (hvn_lbll), V202072 (hvn_lbll), ## VotedPres2020 (fct), V202073 (hvn_lbll), V202109x (hvn_lbll), ## V202110x (hvn_lbll), VotedPres2020_selection (fct), EarlyVote2020 ## (fct) We can examine this new object to learn more about the survey design, such that the ANES is a “Stratified 1 - level Cluster Sampling design (with replacement) With (101) clusters”. Additionally, the output displays the sampling variables and then lists the remaning variables in the dataset. This design object will be used throughout this book to conduct survey analysis. 4.3.2 Residential Energy Consumption Survey (RECS) Design Object The RECS documentation (U.S. Energy Information Administration 2023a) provides information on the survey’s sampling and weighting implications for analysis. The documentation shows the 2020 RECS uses Jackknife weights, where the main analytic weight is NWEIGHT, and the Jackknife weights are NWEIGHT1-NWEIGHT60. In the survey design object code, we can specify these in the weights and repweights arguments, respectively. With Jackknife weights, additional information is required: type, scale, and mse. Chapter 10 goes into depth about each of these arguments, but to quickly get started, the documentation lets us know that type=JK1, scale=59/60, and mse = TRUE. We can use the following code to create the survey design object: recs_des <- recs_2020 %>% as_survey_rep( weights = NWEIGHT, repweights = NWEIGHT1:NWEIGHT60, type = "JK1", scale = 59 / 60, mse = TRUE ) recs_des ## Call: Called via srvyr ## Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances. ## Sampling variables: ## - repweights: `NWEIGHT1 + NWEIGHT2 + NWEIGHT3 + NWEIGHT4 + NWEIGHT5 + ## NWEIGHT6 + NWEIGHT7 + NWEIGHT8 + NWEIGHT9 + NWEIGHT10 + NWEIGHT11 + ## NWEIGHT12 + NWEIGHT13 + NWEIGHT14 + NWEIGHT15 + NWEIGHT16 + ## NWEIGHT17 + NWEIGHT18 + NWEIGHT19 + NWEIGHT20 + NWEIGHT21 + ## NWEIGHT22 + NWEIGHT23 + NWEIGHT24 + NWEIGHT25 + NWEIGHT26 + ## NWEIGHT27 + NWEIGHT28 + NWEIGHT29 + NWEIGHT30 + NWEIGHT31 + ## NWEIGHT32 + NWEIGHT33 + NWEIGHT34 + NWEIGHT35 + NWEIGHT36 + ## NWEIGHT37 + NWEIGHT38 + NWEIGHT39 + NWEIGHT40 + NWEIGHT41 + ## NWEIGHT42 + NWEIGHT43 + NWEIGHT44 + NWEIGHT45 + NWEIGHT46 + ## NWEIGHT47 + NWEIGHT48 + NWEIGHT49 + NWEIGHT50 + NWEIGHT51 + ## NWEIGHT52 + NWEIGHT53 + NWEIGHT54 + NWEIGHT55 + NWEIGHT56 + ## NWEIGHT57 + NWEIGHT58 + NWEIGHT59 + NWEIGHT60` ## - weights: NWEIGHT ## Data variables: ## - DOEID (dbl), ClimateRegion_BA (fct), Urbanicity (fct), Region ## (fct), REGIONC (chr), Division (fct), STATE_FIPS (chr), ## state_postal (fct), state_name (fct), HDD65 (dbl), CDD65 (dbl), ## HDD30YR (dbl), CDD30YR (dbl), HousingUnitType (fct), YearMade ## (ord), TOTSQFT_EN (dbl), TOTHSQFT (dbl), TOTCSQFT (dbl), ## ZTOTSQFT_EN (fct), ZYearMade (fct), ZHousingUnitType (fct), ## SpaceHeatingUsed (lgl), ZSpaceHeatingUsed (fct), ACUsed (lgl), ## ZACUsed (fct), ZACBehavior (fct), HeatingBehavior (fct), ## WinterTempDay (dbl), WinterTempAway (dbl), WinterTempNight (dbl), ## ACBehavior (fct), SummerTempDay (dbl), SummerTempAway (dbl), ## SummerTempNight (dbl), ZHeatingBehavior (fct), ZWinterTempAway ## (fct), ZSummerTempAway (fct), ZWinterTempDay (fct), ZSummerTempDay ## (fct), ZWinterTempNight (fct), ZSummerTempNight (fct), NWEIGHT ## (dbl), NWEIGHT1 (dbl), NWEIGHT2 (dbl), NWEIGHT3 (dbl), NWEIGHT4 ## (dbl), NWEIGHT5 (dbl), NWEIGHT6 (dbl), NWEIGHT7 (dbl), NWEIGHT8 ## (dbl), NWEIGHT9 (dbl), NWEIGHT10 (dbl), NWEIGHT11 (dbl), NWEIGHT12 ## (dbl), NWEIGHT13 (dbl), NWEIGHT14 (dbl), NWEIGHT15 (dbl), NWEIGHT16 ## (dbl), NWEIGHT17 (dbl), NWEIGHT18 (dbl), NWEIGHT19 (dbl), NWEIGHT20 ## (dbl), NWEIGHT21 (dbl), NWEIGHT22 (dbl), NWEIGHT23 (dbl), NWEIGHT24 ## (dbl), NWEIGHT25 (dbl), NWEIGHT26 (dbl), NWEIGHT27 (dbl), NWEIGHT28 ## (dbl), NWEIGHT29 (dbl), NWEIGHT30 (dbl), NWEIGHT31 (dbl), NWEIGHT32 ## (dbl), NWEIGHT33 (dbl), NWEIGHT34 (dbl), NWEIGHT35 (dbl), NWEIGHT36 ## (dbl), NWEIGHT37 (dbl), NWEIGHT38 (dbl), NWEIGHT39 (dbl), NWEIGHT40 ## (dbl), NWEIGHT41 (dbl), NWEIGHT42 (dbl), NWEIGHT43 (dbl), NWEIGHT44 ## (dbl), NWEIGHT45 (dbl), NWEIGHT46 (dbl), NWEIGHT47 (dbl), NWEIGHT48 ## (dbl), NWEIGHT49 (dbl), NWEIGHT50 (dbl), NWEIGHT51 (dbl), NWEIGHT52 ## (dbl), NWEIGHT53 (dbl), NWEIGHT54 (dbl), NWEIGHT55 (dbl), NWEIGHT56 ## (dbl), NWEIGHT57 (dbl), NWEIGHT58 (dbl), NWEIGHT59 (dbl), NWEIGHT60 ## (dbl), BTUEL (dbl), DOLLAREL (dbl), ZBTUEL (fct), BTUNG (dbl), ## DOLLARNG (dbl), ZBTUNG (fct), BTULP (dbl), DOLLARLP (dbl), ZBTULP ## (fct), BTUFO (dbl), DOLLARFO (dbl), ZBTUFO (fct), BTUWOOD (dbl), ## ZBTUWOOD (fct), TOTALBTU (dbl), TOTALDOL (dbl) Viewing this new object provides information about the survey design, such that the RECS is an “unstratified cluster jacknife (JK1) with 60 replicates and MSE variances”. Additionally, the output shows the sampling variables (NWEIGHT1-NWEIGHT50) and then lists the remaining variables in the dataset. This design object will be used throughout this book to conduct survey analysis. This chapter walked through the installation and loading of several packages, introduced the survey data available in the {srvyrexploR} package, and provided context on creating survey design objects for the ANES and RECS datasets. With this foundational knowledge, we can follow the instructions listed in the Prerequisite boxes at the start of each chapter. References DeBell, Matthew. 2010. “How to Analyze ANES Survey Data.” ANES Technical Report Series nes012492. Palo Alto, CA: Stanford University; Ann Arbor, MI: the University of Michigan; https://electionstudies.org/wp-content/uploads/2018/05/HowToAnalyzeANESData.pdf. ———. 2023a. “2020 Residential Energy Consumption Survey: Household Characteristics Technical Documentation Summary.” https://www.eia.gov/consumption/residential/data/2020/pdf/2020%20RECS_Methodology%20Report.pdf. Note: {broom} s already included in the tidyverse, so no separate installation is required↩︎ https://www2.census.gov/programs-surveys/cps/datasets/2020/basic/2020_Basic_CPS_Public_Use_Record_Layout_plus_IO_Code_list.txt↩︎ "],["c05-descriptive-analysis.html", "Chapter 5 Descriptive Analyses in {srvyr} 5.1 Introduction 5.2 Similarities Between {dplyr} and {srvyr} Functions 5.3 Counts and Cross-Tabulations 5.4 Totals and Sums 5.5 Means and Proportions 5.6 Quantiles and Medians 5.7 Ratios 5.8 Correlations 5.9 Standard Deviation and Variance 5.10 Additional Topics 5.11 Exercises", " Chapter 5 Descriptive Analyses in {srvyr} Prerequisites For this chapter, load the following packages: library(tidyverse) library(survey) library(srvyr) library(srvyrexploR) library(broom) To help explain the similarities between {dplyr} functions and {srvyr} functions, this chapter will use the mtcars and iris datasets that are built-in to R and apistrat data that comes in the {survey} package: data(api) dstrata <- apistrat %>% as_survey_design(strata = stype, weights = pw) We will also be using data from ANES and RECS described in Chapter 4. As a reminder, here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter 4 for more information). targetpop <- 231592693 data(anes_2020) anes_adjwgt <- anes_2020 %>% mutate(Weight = Weight/sum(Weight) * targetpop) anes_des <- anes_adjwgt %>% as_survey_design(weights = Weight, strata = Stratum, ids = VarUnit, nest = TRUE) For RECS, details are included in the RECS documentation and Chapters 4 and 10. data(recs_2020) recs_des <- recs_2020 %>% as_survey_rep(weights = NWEIGHT, repweights = NWEIGHT1:NWEIGHT60, type = "JK1", scale = 59/60, mse = TRUE) 5.1 Introduction Descriptive analyses, such as basic counts, cross-tabulations, or means, are one of the first steps a researcher takes before conducting statistical tests or developing models. Reviewing findings from descriptive analyses can help researchers glean insight into the data, the underlying population, and any unique aspects of the data or population. For example, if the data shows a proportion of males of only 10% in the data, this could indicate either a unique population or a potential error in the data. Additionally, researchers can use descriptive analyses to provide means, proportions, or other measures to summarize the data and make estimates about the population. We will discuss many different types of descriptive analyses in this chapter, but it is important to know what type of data we have and what statistics to use for that type of data. In survey data, we typically consider data to be one of these four main data types: Categorical/nominal data: variables with levels or descriptions that cannot be ordered, such as the region of the country (North, South, East, and West) Ordinal data: variables that can be ordered, such as those from a Likert scale (strongly disagree, disagree, agree, and strongly agree) Discrete data: variables that are counted or measured, such as number of children Continuous data, variables that are measured and whose values can lie anywhere on an interval, such as weight When we pull the data from surveys into R, the data will be listed as character, factor, numeric, or logical/Boolean. They will not clearly indicate the type of survey data (e.g., ordinal). When working with survey data, researchers need to properly use the questionnaire and codebook along with the data (see Chapter 3) to understand what the values for each variable represent. For example, our survey data may represent categorical variables (e.g., the North, South, East, and West regions of the United States) using numeric codes (e.g., 1, 2, 3, and 4). Though this is a categorical variable from the survey, this variable might be automatically read as numeric values when we import our data into R. This can lead to the common mistake of applying a mean function to categorical values instead of a proportion function. Choosing appropriate measures is crucial to reach valid conclusions. Different variable types have distinct properties and levels of measurement, and we cannot apply all measures to all variables. This chapter will discuss how to analyze measures of distribution (e.g., cross-tabulations), central tendency (e.g., means), relationship (e.g., ratios), and dispersion (e.g., standard). Measures of distribution describe how often an event or response occurs. These measures include counts and totals. Measures of central tendency find the central (or average) responses. These measures include means and medians. Measures of relationship describe how variables relate to each other. These measures include correlations and ratios. Measures of dispersion describe how data spreads around the central tendency for continuous variables. These measures include standard deviations and variances. Specifically, we will cover the following functions from the {srvyr} package: Count of observations (survey_count() and survey_tally()) Summation of variables (survey_total()) Means and proportions (survey_mean() and survey_prop()) Quantiles and medians (survey_quantile() and survey_median()) Correlations (survey_corr()) Ratios (survey_ratio()) Variances and standard deviations (survey_var() and survey_sd()) To incorporate each of these survey functions, recall the general process for survey estimation from Chapter 10: Create a tbl_svy object using srvyr::as_survey_design() or srvyr::as_survey_rep(). Subset the data for subpopulations using srvyr::filter(), if needed. Specify domains of analysis using srvyr::group_by(), if needed. Analyze the data with survey-specific functions. We have already discussed how to create the survey design objects in Chapter 10, and the code for creating these for the two datasets used in this chapter is provided in the Prerequisites box at the beginning of this chapter. We will apply the survey functions covered in this chapter in Step 4. To look at the data by different subgroups, we can choose to filter and/or group the data. It is very important that we filter and group the data only after creating the design object. This is necessary to ensure that the results accurately account for the survey design. Removing any data before creating the survey design object means that the data for those cases is not included in the survey design information and estimations of the variance. 5.2 Similarities Between {dplyr} and {srvyr} Functions One of the major advantages of using {srvyr} is that it applies {dplyr}-like syntax to the {survey} package. We can use pipes to specify a tbl_svy object, apply a function, and then feed that output into the next function’s first argument. Functions follow the ‘tidy’ convention of snake_case function names. The example below calculates the mean and median for the variable mpg (miles per gallon) in the mtcars dataset. mtcars %>% summarize(mpg_mean = mean(mpg), mpg_median = median(mpg)) ## mpg_mean mpg_median ## 1 20.09 19.2 Similarly, in the next example, the variance and standard deviation of the variable api00 are calculated for the tbl_svy object dstrata. Note the similarity in the syntax. When we dig into the functions later, we will show that the results output are similar in that one row is output for each group (if there are groups), but there will be more columns output. Specifically, by default, the standard error of the statistic is calculated in addition to the statistic. dstrata %>% summarize(api00_mean = survey_mean(api00), api00_med = survey_median(api00)) ## # A tibble: 1 × 4 ## api00_mean api00_mean_se api00_med api00_med_se ## <dbl> <dbl> <dbl> <dbl> ## 1 662. 9.54 668 13.7 The functions in {srvyr} also play nicely with other tidyverse functions. If we wanted to select columns that have something in common, we use {tidyselect} functions such as starts_with(), num_range(), etc. In the examples below, a combination of across() and starts_with() to calculate the mean of variables starting with “Sepal” in the iris data frame and then starting with api in the dstrata survey object. iris %>% summarize(across(starts_with("Sepal"), mean)) ## Sepal.Length Sepal.Width ## 1 5.843 3.057 dstrata %>% summarize(across(starts_with("api"), survey_mean)) ## # A tibble: 1 × 6 ## api00 api00_se api99 api99_se api.stu api.stu_se ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 662. 9.54 629. 10.1 498. 16.4 We can use {dplyr} verbs such as mutate(), filter(), etc., on our survey object. dstrata_mod <- dstrata %>% mutate(api_diff = api00 - api99) %>% filter(stype == "E") %>% select(stype, api99, api00, api_diff, api_students = api.stu) dstrata_mod ## Stratified Independent Sampling design (with replacement) ## Called via srvyr ## Sampling variables: ## - ids: `1` ## - strata: stype ## - weights: pw ## Data variables: ## - stype (fct), api99 (int), api00 (int), api_diff (int), api_students ## (int) dstrata ## Stratified Independent Sampling design (with replacement) ## Called via srvyr ## Sampling variables: ## - ids: `1` ## - strata: stype ## - weights: pw ## Data variables: ## - cds (chr), stype (fct), name (chr), sname (chr), snum (dbl), dname ## (chr), dnum (int), cname (chr), cnum (int), flag (int), pcttest ## (int), api00 (int), api99 (int), target (int), growth (int), ## sch.wide (fct), comp.imp (fct), both (fct), awards (fct), meals ## (int), ell (int), yr.rnd (fct), mobility (int), acs.k3 (int), ## acs.46 (int), acs.core (int), pct.resp (int), not.hsg (int), hsg ## (int), some.col (int), col.grad (int), grad.sch (int), avg.ed ## (dbl), full (int), emer (int), enroll (int), api.stu (int), pw ## (dbl), fpc (dbl) Instead of data frames or tibbles, {srvyr} functions are meant for tbl_svy objects. Attempting to run data manipulation on non-tbl_svy objects will result in an error, as shown in the example below when using the mtcars data frame (which is not tbl_svy object). mtcars %>% summarize(mpg_mean = survey_mean(mpg)) ## Error in `summarize()`: ## ℹ In argument: `mpg_mean = survey_mean(mpg)`. ## Caused by error in `cur_svy()` at gergness-srvyr-1917f75/R/survey_statistics.r:114:3: ## ! Survey context not set A few functions in {srvyr} parallel functions in {dplyr}, such as srvyr::summarize() and srvyr::group_by(). Unlike {srvyr}-specific verbs, the package recognizes these parallel functions on a non-survey object. It will not error and instead give the equivalent output from {dplyr}: mtcars %>% srvyr::summarize(mpg_mean = mean(mpg)) ## mpg_mean ## 1 20.09 Because this book focuses on survey analysis, most of our pipes will stem from a survey object. We will not include the namespace for each function (e.g., srvyr::summarize()). Several functions in {srvyr} must be called within srvyr::summarize() with the exception of srvyr::survey_count() and srvyr::survey_tally() much like dplyr::count() and dplyr::tally() are not called within dplyr::summarize(). These verbs can be used in conjunction with group_by() or by/.by, applying the functions on a group-by-group basis to create grouped summaries. mtcars %>% group_by(cyl) %>% dplyr::summarize(mpg_mean = mean(mpg)) ## # A tibble: 3 × 2 ## cyl mpg_mean ## <dbl> <dbl> ## 1 4 26.7 ## 2 6 19.7 ## 3 8 15.1 We use a similar setup to summarize data in {srvyr}. dstrata %>% group_by(stype) %>% summarize(api00_mean = survey_mean(api00), api00_median = survey_median(api00)) ## # A tibble: 3 × 5 ## stype api00_mean api00_mean_se api00_median api00_median_se ## <fct> <dbl> <dbl> <dbl> <dbl> ## 1 E 674. 12.5 671 20.7 ## 2 H 626. 15.5 635 21.6 ## 3 M 637. 16.6 648 24.1 5.3 Counts and Cross-Tabulations With survey_count() and survey_tally(), we can calculate the estimated population counts for a given variable or combination of variables. Sometimes, these are referred to as cross-tabulations or crosstabs, for short. These summaries should be applied to categorical data and are used to get estimated counts of the population size of groups from the survey. 5.3.1 Syntax The syntax for survey_count() is very similar to the dplyr::count() syntax; however, as noted above, it can only be called on tbl_svy objects. Let’s explore the syntax: survey_count( x, ..., wt = NULL, sort = FALSE, name = "n", .drop = dplyr::group_by_drop_default(x), vartype = c("se", "ci", "var", "cv") ) The arguments are: x: a tbl_svy object created by as_survey ...: variables to group by, passed to group_by wt: a variable to weight on in addition to the survey weights, defaults to NULL sort: how to sort the variables, defaults to FALSE name: the name of the count variable, defaults to n .drop: whether to drop empty groups vartype: type(s) of variation estimate to calculate including any of c(\"se\", \"ci\", \"var\", \"cv\"), defaults to se (standard error) (see 5.3.1 for more information) To capture a count or crosstabs by different variables, we include them in the (...) argument. This argument can take any number of variables and will break down the counts by all combinations of the provided variables. This is the same as with dplyr::count(). We can also obtain an estimate of the overall population by not including any variables in the (...) argument or by using the survey_tally() function. The survey_tally() function has a similar syntax to the survey_count() function, but it does not include the (...) or the .drop arguments: survey_tally( x, wt, sort = FALSE, name = "n", vartype = c("se", "ci", "var", "cv") ) Both functions include the vartype argument with four different values: se: standard error The estimated standard deviation of the estimate Output has a column with the variable name specified in the name argument with a suffix of “_se” ci: confidence interval The lower and upper limits of a confidence interval Output has a column with the variable name specified in the name argument with a suffix of “_low” and “_upp” By default, this is a 95% confidence interval but can be changed by using the argument level and specifying a number between 0 and 1. For example, level=0.8 would produce a 80% confidence interval. var: variance The estimated variance of the estimate Output has a column with the variable name specified in the name argument with a suffix of “_var” cv: coefficient of variation A ratio of the standard error and the estimate Output has a column with the variable name specified in the name argument with a suffix of “_cv” The confidence intervals are always calculated using a symmetric t-distribution based confidence interval as follows: \\[ \\text{estimate} \\pm t^*_{df}\\times SE\\] where \\(t^*_{df}\\) is the critical value from a t-distribution based on the confidence level and the degrees of freedom. By default, the degrees of freedom are calculated based on the design or number of replicates, but they can be specified using the argument df. For survey design objects, the degrees of freedom are calculated as the number of PSUs minus the number of strata. For replicate-based objects, the degrees of freedom are calculated as one less than the rank of the matrix of replicate weight, where the number of replicates is typically the rank. Note that specifying df = Inf is equivalent to using a normal (z-based) confidence interval. These variability types are the same for most of the survey functions, and we will provide examples using different types of variability throughout this chapter. 5.3.2 Examples Example 1: Estimated Population Count If we wanted to obtain the estimated number of households in the U.S. (the target population) using the Residential Energy Consumption Survey (RECS) data, we could use survey_count(). If we do not specify any variables in the survey_count() function, it will output the estimated population count (n) and standard error (n_se). recs_des %>% survey_count() ## # A tibble: 1 × 2 ## n n_se ## <dbl> <dbl> ## 1 123529025. 0.148 Thus, the estimated number of households in the U.S. is 123,529,025. We could also use the survey_tally() function and the example below yields the same results as using survey_count() previously. recs_des %>% survey_tally() ## # A tibble: 1 × 2 ## n n_se ## <dbl> <dbl> ## 1 123529025. 0.148 Example 2: Estimated Counts by Subgroups (Crosstabs) To calculate the estimated number of observations for subgroups, such as Region and Division, we can add the variables of interest into the survey_count() function. In the example below, the estimated number of housing units by region and division is calculated. Additionally, one of the arguments allows us to change the name of the count variable from the default (n) using name =. In this case, we are changing the name to \"N\". recs_des %>% survey_count(Region, Division, name = "N") ## # A tibble: 10 × 4 ## Region Division N N_se ## <fct> <fct> <dbl> <dbl> ## 1 Northeast New England 5876166 0.0000000137 ## 2 Northeast Middle Atlantic 16043503 0.0000000487 ## 3 Midwest East North Central 18546912 0.000000437 ## 4 Midwest West North Central 8495815 0.0000000177 ## 5 South South Atlantic 24843261 0.0000000418 ## 6 South East South Central 7380717. 0.114 ## 7 South West South Central 14619094 0.000488 ## 8 West Mountain North 4615844 0.119 ## 9 West Mountain South 4602070 0.0000000492 ## 10 West Pacific 18505643. 0.00000295 When we run the crosstab, we see there are an estimated 5,876,166 housing units in the New England Division. If we wanted to use survey_tally() to output the same results, we would get an error if we try to use the same syntax as survey_count(): recs_des %>% survey_tally(Region, Division, name = "N") ## Error in `dplyr::summarise()` at gergness-srvyr-1917f75/R/summarise.r:10:3: ## ℹ In argument: `N = survey_total(Region, vartype = vartype, ## na.rm = TRUE)`. ## Caused by error: ## ! Factor not allowed in survey functions, should be used as a grouping variable. Instead, use a the group_by() function prior to using survey_tally() to obtain this crosstab: recs_des %>% group_by(Region, Division) %>% survey_tally(name = "N") ## # A tibble: 10 × 4 ## # Groups: Region [4] ## Region Division N N_se ## <fct> <fct> <dbl> <dbl> ## 1 Northeast New England 5876166 0.0000000137 ## 2 Northeast Middle Atlantic 16043503 0.0000000487 ## 3 Midwest East North Central 18546912 0.000000437 ## 4 Midwest West North Central 8495815 0.0000000177 ## 5 South South Atlantic 24843261 0.0000000418 ## 6 South East South Central 7380717. 0.114 ## 7 South West South Central 14619094 0.000488 ## 8 West Mountain North 4615844 0.119 ## 9 West Mountain South 4602070 0.0000000492 ## 10 West Pacific 18505643. 0.00000295 5.4 Totals and Sums The survey_total() function is analogous to sum. This can be used to find the estimated aggregate sum of an outcome and should be applied to continuous variables to obtain the estimated total quantity in a population. All the functions introduced from this point on in this chapter must be called from within summarize(). 5.4.1 Syntax Here is the syntax: survey_total( x, na.rm = FALSE, vartype = c("se", "ci", "var", "cv"), level = 0.95, deff = FALSE, df = NULL ) The arguments are: x: a variable, expression, or empty na.rm: an indicator of whether missing values should be dropped, defaults to FALSE vartype: type(s) of variation estimate to calculate including any of c(\"se\", \"ci\", \"var\", \"cv\"), defaults to se (standard error) (see 5.3.1 for more information) level: a number or a vector indicating the confidence level, defaults to 0.95 deff: a logical value stating whether the design effect should be returned, defaults to FALSE (this is described in more detail in Section 5.10.3) df: (for vartype = 'ci'), a numeric value indicating degrees of freedom for the t-distribution 5.4.2 Examples Example 1: Estimated Population Count To calculate a population count estimate with survey_total(), the argument x can be left empty as shown in the example below: recs_des %>% summarize(Tot = survey_total()) ## # A tibble: 1 × 2 ## Tot Tot_se ## <dbl> <dbl> ## 1 123529025. 0.148 Note that the result from recs_des %>% summarize(survey_total()) is equivalent to the survey_count() and survey_tally() functions. However, the survey_total() function is called within summarize, whereas survey_count() and survey_tally() are not. Example 2: Overall Summation of Continuous Variables The difference between survey_total() and survey_count() is more evident when specifying continuous variables to sum. Let’s compute the total cost of electricity in whole dollars from variable DOLLAREL7. recs_des %>% summarize(elec_bill = survey_total(DOLLAREL)) ## # A tibble: 1 × 2 ## elec_bill elec_bill_se ## <dbl> <dbl> ## 1 170473527909. 664893504. It is estimated that American residential households spent a total of $170,473,527,909 on electricity in 2020, and the estimate has a standard error of $664,893,504. Example 3: Summation by Groups As we are using the {srvyr} package, we can use group_by() to calculate the cost of electricity by different groups. Let’s see how much the cost of electricity in whole dollars differed between regions and output the confidence interval instead of the default standard error. recs_des %>% group_by(Region) %>% summarize(elec_bill = survey_total(DOLLAREL, vartype = "ci")) ## # A tibble: 4 × 4 ## Region elec_bill elec_bill_low elec_bill_upp ## <fct> <dbl> <dbl> <dbl> ## 1 Northeast 29430369947. 28788987554. 30071752341. ## 2 Midwest 34972544751. 34339576041. 35605513460. ## 3 South 72496840204. 71534780902. 73458899506. ## 4 West 33573773008. 32909111702. 34238434313. The survey results estimate that households in the Northeast spent $29,430,369,947 with a confidence interval of ($28,788,987,554, $30,071,752,341) on electricity in 2020 while households in the South spent an estimated $72,496,840,204 with a confidence interval of ($28,788,987,554, $73,458,899,506). 5.5 Means and Proportions Means and proportions are the backbone of most research. The estimates calculated are often the first things we look for when reviewing research on a given topic. The survey_mean() and survey_prop() functions calculate means and proportions while incorporating the survey design elements. The survey_mean() function should be used on continuous variables of survey data, while the survey_prop() function should be used on categorical variables. These topics are grouped together because a proportion is simply a mean of a logical (Boolean) variable. 5.5.1 Syntax The syntax for both means and proportions are very similar: survey_mean( x, na.rm = FALSE, vartype = c("se", "ci", "var", "cv"), level = 0.95, proportion = FALSE, prop_method = c("logit", "likelihood", "asin", "beta", "mean"), deff = FALSE, df = NULL ) survey_prop( na.rm = FALSE, vartype = c("se", "ci", "var", "cv"), level = 0.95, proportion = TRUE, prop_method = c("logit", "likelihood", "asin", "beta", "mean", "xlogit"), deff = FALSE, df = NULL ) Both functions have the following arguments and defaults: na.rm: an indicator of whether missing values should be dropped, defaults to FALSE vartype: type(s) of variation estimate to calculate including any of c(\"se\", \"ci\", \"var\", \"cv\"), defaults to se (standard error) (see 5.3.1 for more information) level: a number or a vector indicating the confidence level, defaults to 0.95 prop_method: Method to calculate the confidence interval for confidence intervals deff: a logical value stating whether the design effect should be returned, defaults to FALSE (this is described in more detail in Section 5.10.3) df: (for vartype = 'ci'), a numeric value indicating degrees of freedom for the t-distribution There are two main differences in the syntax. The survey_mean() function includes the first argument of x, while survey_prop() does not. The x argument includes the variable or expression on which the mean should be calculated. There is no argument to include variables in the survey_prop() function. Instead, prior to summarize(), we need to use the group_by() function to specify the variables of interest. For survey_mean(), including a group_by() function will allow us to obtain the means by the different groups. The other main difference is with the proportion argument. In the survey_mean() function, this defaults to FALSE, while in the survey_prop() function, this defaults to TRUE. This is because the survey_mean() function can be used to calculate both means and proportions. If we wish to calculate a proportion using this function, we will need to set the proportion argument to TRUE. In section 5.3.1, we provide an overview of the different variability types. The interval used in confidence intervals for most measures, such as means and counts, is referred to as a Wald-type interval. While a Wald-type interval using a symmetric t-based confidence interval is an option for proportions, this generally does not have the correct coverage rate when sample sizes are small and/or the proportion is “near” 0 or 1. Thus, other methods have been developed to calculate confidence intervals and can be specified using the prop_method option in survey_prop(). The options include: logit: fits a logistic regression model and computes a Wald-type interval on the log-odds scale, which is then transformed to the probability scale. This is the default method. likelihood: uses the (Rao-Scott) scaled chi-squared distribution for the log-likelihood from a binomial distribution. asin: uses the variance-stabilizing transformation for the binomial distribution, the arcsine square root, and then back-transforms the interval to the probability scale beta: uses the incomplete beta function with an effective sample size based on the estimated variance of the proportion. mean: the Wald-type interval xlogit: uses a logit transformation of the proportion, calculates a Wald-type interval, and then back-transforms to the probability scale. This method is implemented in SUDAAN and SPSS. Each option will provide slightly different confidence interval bounds when dealing with proportions. Please note that when working with survey_mean(), this method does not need to be specified unless the proprtion argument is TRUE. 5.5.2 Examples Example 1: One Variable Proportion If we are interested in obtaining the proportion of people in each region in the RECS data, we can use group_by() and survey_prop() to obtain this. recs_des %>% group_by(Region) %>% summarize(p = survey_prop()) ## When `proportion` is unspecified, `survey_prop()` now defaults to `proportion = TRUE`. ## ℹ This should improve confidence interval coverage. ## This message is displayed once per session. ## # A tibble: 4 × 3 ## Region p p_se ## <fct> <dbl> <dbl> ## 1 Northeast 0.177 2.12e-10 ## 2 Midwest 0.219 2.62e-10 ## 3 South 0.379 7.40e-10 ## 4 West 0.224 8.16e-10 17.7% of the households are in the Northeast, 21.9% in the Midwest, and so on. Note: survey_prop() is essentially the same as using survey_mean() with a categorical variable and without specifying a numeric variable in the x argument. The following code will give us the same results as above: recs_des %>% group_by(Region) %>% summarize(p = survey_mean()) ## # A tibble: 4 × 3 ## Region p p_se ## <fct> <dbl> <dbl> ## 1 Northeast 0.177 2.12e-10 ## 2 Midwest 0.219 2.62e-10 ## 3 South 0.379 7.40e-10 ## 4 West 0.224 8.16e-10 Example 2: Conditional Proportions It is possible to obtain proportions by more than one variable. In the following example, we look at the proportion of housing units by Region and whether air conditioning is used (ACUsed).8 recs_des %>% group_by(Region, ACUsed) %>% summarize(p = survey_prop()) ## # A tibble: 8 × 4 ## # Groups: Region [4] ## Region ACUsed p p_se ## <fct> <lgl> <dbl> <dbl> ## 1 Northeast FALSE 0.110 0.00590 ## 2 Northeast TRUE 0.890 0.00590 ## 3 Midwest FALSE 0.0666 0.00508 ## 4 Midwest TRUE 0.933 0.00508 ## 5 South FALSE 0.0581 0.00278 ## 6 South TRUE 0.942 0.00278 ## 7 West FALSE 0.255 0.00759 ## 8 West TRUE 0.745 0.00759 When specifying multiple variables, the proportions are conditional. In the results above, notice that the proportions sum to 1 within each region. This can be interpreted as the proportion of housing units with air conditioning within each region. Example 3: Joint Proportions If we want the joint proportion instead, the interact function is necessary. In the example below, the interact function is used on Region and ACUsed: recs_des %>% group_by(interact(Region, ACUsed)) %>% summarize(p = survey_prop()) ## # A tibble: 8 × 4 ## Region ACUsed p p_se ## <fct> <lgl> <dbl> <dbl> ## 1 Northeast FALSE 0.0196 0.00105 ## 2 Northeast TRUE 0.158 0.00105 ## 3 Midwest FALSE 0.0146 0.00111 ## 4 Midwest TRUE 0.204 0.00111 ## 5 South FALSE 0.0220 0.00106 ## 6 South TRUE 0.357 0.00106 ## 7 West FALSE 0.0573 0.00170 ## 8 West TRUE 0.167 0.00170 As noted earlier, both the survey_prop() and survey_mean() functions can be used here and will provide the same results. Example 4: Overall Mean We can calculate the estimated average cost of electricity in the U.S. and include both the standard error and the confidence interval: recs_des %>% summarize(elec_bill = survey_mean(DOLLAREL, vartype = c("se", "ci"))) ## # A tibble: 1 × 4 ## elec_bill elec_bill_se elec_bill_low elec_bill_upp ## <dbl> <dbl> <dbl> <dbl> ## 1 1380. 5.38 1369. 1391. Nationally, the average household spent $$1,380 in 2020. Example 5: Means by Subgroup We can also calculate the estimated average cost of electricity in the U.S. by each region. To do this, we include a group_by() function with the variable of interest before the summarize() function: recs_des %>% group_by(Region) %>% summarize(elec_bill = survey_mean(DOLLAREL)) ## # A tibble: 4 × 3 ## Region elec_bill elec_bill_se ## <fct> <dbl> <dbl> ## 1 Northeast 1343. 14.6 ## 2 Midwest 1293. 11.7 ## 3 South 1548. 10.3 ## 4 West 1211. 12.0 Households from the West spent $1,211 on electricity, and in the South, they spent an average of $1,548. 5.6 Quantiles and Medians To better understand the distribution of a continuous variable, quantiles can be calculated at specific points to help gain insight. For example, we might want estimates of the quartiles (25%, 50%, 75%) of income in a population to understand how the income is distributed. We use the survey_quantile() function to calculate quantiles in survey data. Medians are often used to find the midpoint of a continuous distribution when the data is considered to be skewed, as medians are less subject to outliers than means. The median in the data is the same as the 50th percentile. In other words, it is the value where 50% of the data is higher than it and 50% is lower. Medians are a special case of quantiles that are used more often; thus, a unique function has been created for it (survey_median()). We can calculate the median of the data using both the survey_median() function and the survey_quantile() function with the 50% quantile provided as an argument. 5.6.1 Syntax The syntax for survey_quantile() and survey_median() are nearly identical: survey_quantile( x, quantiles, na.rm = FALSE, vartype = c("se", "ci", "var", "cv"), level = 0.95, interval_type = c("mean", "beta", "xlogit", "asin", "score", "quantile"), qrule = c("math", "school", "shahvaish", "hf1", "hf2", "hf3", "hf4", "hf5", "hf6", "hf7", "hf8", "hf9"), df = NULL ) survey_median( x, na.rm = FALSE, vartype = c("se", "ci", "var", "cv"), level = 0.95, interval_type = c("mean", "beta", "xlogit", "asin", "score", "quantile"), qrule = c("math", "school", "shahvaish", "hf1", "hf2", "hf3", "hf4", "hf5", "hf6", "hf7", "hf8", "hf9"), df = NULL ) The arguments that are in both functions are: x: a variable, expression, or empty na.rm: an indicator of whether missing values should be dropped, defaults to FALSE vartype: type(s) of variation estimate to calculate, defaults to se (standard error) level: a number or a vector indicating the confidence level, defaults to 0.95 interval_type: method for calculating a confidence interval qrule: rule for defining quantiles. The default is the lower end of the quantile interval (“math”). The midpoint of the quantile interval is the “school” rule. “hf1” to “hf9” are weighted analogs to type=1 to 9 in quantile(). “shahvaish” corresponds to a rule proposed by Shah and Vaish (2006). See vignette(\"qrule\", package=\"survey\") for more information. df: (for vartype = 'ci'), a numeric value indicating degrees of freedom for the t-distribution The only difference between survey_quantile() and survey_median() is the inclusion of the quantiles argument in the survey_quantile() function. This argument takes a vector with values between 0 and 1 to indicate which quantiles to calculate. For example, if we wanted the quartiles of a variable, we would provide quantiles = c(0.25, 0.5, 0.75). While we can specify quantiles of 0 and 1, which represent the minimum and maximum, this is not recommended. It only returns the minimum and maximum of the respondents and cannot be extrapolated to the population as there is no valid definition of standard error. In section 5.3.1, we provide an overview of the different variability types. The interval used in confidence intervals for most measures, such as means and counts, is referred to as a Wald-type interval. However, like confidence intervals for proportions, this is not always the most accurate interval for quantiles. With quantiles, the methods for interval types are many of the same as those for proportions (asin, beta, mean, and xlogit; see section 5.5.1) with the addition of two more methods: score: the Francisco & Fuller confidence interval based on inverting a score test (only available for design-based survey objects and not replicate-based objects) quantile: based on the replicates of the quantile. This is not valid for jackknife-type replicates but is available for bootstrap and BRR replicates. One thing of note with the score method is that when there are many ties in the data, this method can produce confidence intervals that do not contain the estimate. When dealing with a high propensity for ties (e.g., many respondents will have the same age), it is recommended to use another method. This is the method implemented in SUDAAN. However, SUDAAN adds noise to the values to prevent the issue with the ties, while the documentation in the {survey} package indicates this method generally has lower performance than the beta and logit intervals. 5.6.2 Examples Example 1: Overall Quartiles Quantiles are useful in learning about the distribution of a variable. Let’s look into the quartiles, specifically, the first quartile (p=0.25), the median (p=0.5), and the third quartile (p=0.75) of electric bills. recs_des %>% summarize(elec_bill = survey_quantile(DOLLAREL, quantiles = c(0.25, 0.5, 0.75))) ## # A tibble: 1 × 6 ## elec_bill_q25 elec_bill_q50 elec_bill_q75 elec_bill_q25_se ## <dbl> <dbl> <dbl> <dbl> ## 1 795. 1215. 1770. 5.69 ## # ℹ 2 more variables: elec_bill_q50_se <dbl>, elec_bill_q75_se <dbl> The output above shows the three quartiles and their respective standard errors. Example 2: Quartiles by Subgroup We can also estimate the quantiles of electric bills by region by incorporating the group_by() function: recs_des %>% group_by(Region) %>% summarize(elec_bill = survey_quantile(DOLLAREL, quantiles = c(0.25, 0.5, 0.75))) ## # A tibble: 4 × 7 ## Region elec_bill_q25 elec_bill_q50 elec_bill_q75 elec_bill_q25_se ## <fct> <dbl> <dbl> <dbl> <dbl> ## 1 Northeast 740. 1148. 1712. 13.7 ## 2 Midwest 769. 1149. 1632. 8.88 ## 3 South 968. 1402. 1945. 10.6 ## 4 West 623. 1028. 1568. 10.8 ## # ℹ 2 more variables: elec_bill_q50_se <dbl>, elec_bill_q75_se <dbl> Example 3: Minimum and Maximum As mentioned in the syntax section, we can specify quantiles of 0 (minimum) and 1 (maximum). R will calculate these two values and provide results. However, these are only the minimum and maximum values in the data. There is not sufficient information to determine what the standard errors should be: recs_des %>% summarize(elec_bill = survey_quantile(DOLLAREL, quantiles = c(0, 1))) ## # A tibble: 1 × 4 ## elec_bill_q00 elec_bill_q100 elec_bill_q00_se elec_bill_q100_se ## <dbl> <dbl> <dbl> <dbl> ## 1 -151. 15680. NaN 0 Example 4: Overall Median We can calculate the estimated median cost of electricity in the U.S. using the survey_median() function: recs_des %>% summarize(elec_bill = survey_median(DOLLAREL)) ## # A tibble: 1 × 2 ## elec_bill elec_bill_se ## <dbl> <dbl> ## 1 1215. 6.33 Nationally, the median household spent $1,215 in 2020. This is the same result as we obtained using the survey_quantile() function ($1,215). It is also interesting to note that the average electric bill for households that we calculated in section 5.5 is $1,380, but the estimated median electric bill is $1,215 indicating the distribution is likely right-skewed. Example 5: Medians by Subgroup We can also calculate the estimated median cost of electricity in the U.S. by each region. This is similar to finding the mean by region in that we include a group_by() function with the variable of interest before the summarize() function: recs_des %>% group_by(Region) %>% summarize(elec_bill = survey_median(DOLLAREL)) ## # A tibble: 4 × 3 ## Region elec_bill elec_bill_se ## <fct> <dbl> <dbl> ## 1 Northeast 1148. 16.6 ## 2 Midwest 1149. 11.6 ## 3 South 1402. 9.17 ## 4 West 1028. 14.3 Households from the West spent $1,028 on electricity, and in the South, they spent an average of $1,402. 5.7 Ratios Many are not familiar with the ratio estimate. The ratio is a measure of the ratio of the sum of two variables, specifically in the form of: \\[ \\frac{\\sum x_i}{\\sum y_i}.\\] The ratio is not the same as calculating the following: \\[ \\frac{1}{N} \\sum \\frac{x_i}{y_i} \\] which could be calculated with survey_mean() by creating a derived variable \\(z=x/y\\) and then calculating the mean of \\(z\\). Consider a survey of police agencies in the United States. We might want to estimate the ratio of female police officers to total police officers. We could run survey_ratio(Female_Officers, Total_Officers). If, instead, we used survey_means(Female_Officers/Total_Officers), we would be estimating the average percentage of female officers across agencies, which is a different quantity. 5.7.1 Syntax The syntax for survey_ratio() is as follows: survey_ratio( numerator, denominator, na.rm = FALSE, vartype = c("se", "ci", "var", "cv"), level = 0.95, deff = FALSE, df = NULL ) The arguments are: numerator: The numerator of the ratio denominator: The denominator of the ratio na.rm: A logical value to indicate whether missing values should be dropped vartype: type(s) of variation estimate to calculate including any of c(\"se\", \"ci\", \"var\", \"cv\"), defaults to se (standard error) (see 5.3.1 for more information) level: A single number or vector of numbers indicating the confidence level deff: A logical value to indicate whether the design effect should be returned (this is described in more detail in Section 5.10.3) df: (For vartype = “ci” only) A numeric value indicating the degrees of freedom for t-distribution 5.7.2 Examples Example 1: Overall Ratios Suppose we wanted to find the ratio of dollars spent on liquid propane per unit (in British thermal unit [Btu]) nationally9. If we wanted to find the average cost to a household, we could use survey_mean(), but to find the national unit rate, we can use ratio. In the following example, we will show both methods and discuss the interpretation of each: recs_des %>% summarize(DOLLARLP_Tot = survey_total(DOLLARLP, vartype = NULL), BTULP_Tot = survey_total(BTULP, vartype = NULL), DOL_BTU_Rat = survey_ratio(DOLLARLP, BTULP), DOL_BTU_Avg = survey_mean(DOLLARLP/BTULP, na.rm = TRUE)) ## # A tibble: 1 × 6 ## DOLLARLP_Tot BTULP_Tot DOL_BTU_Rat DOL_BTU_Rat_se DOL_BTU_Avg ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 8122911173. 391425311586. 0.0208 0.000240 0.0240 ## # ℹ 1 more variable: DOL_BTU_Avg_se <dbl> The ratio of the total spent on liquid propane to the total consumption was 0.0208, but the average rate was 0.024. With a bit of calculation, we can show that the ratio is the ratio of the totals DOLLARLP_Tot/BTULP_Tot=8,122,911,173/391,425,311,586=0.0208. While the ratio could be calculated manually in this manner, the standard error requires the use of the survey_ratio() function. The average can be interpreted as the average rate paid by a household. Example 2: Ratios by Subgroup As previously done with other estimates, we can use group_by() to examine whether this rate varies by region. recs_des %>% group_by(Region) %>% summarize(DOL_BTU_Rat = survey_ratio(DOLLARLP, BTULP)) ## # A tibble: 4 × 3 ## Region DOL_BTU_Rat DOL_BTU_Rat_se ## <fct> <dbl> <dbl> ## 1 Northeast 0.0247 0.000488 ## 2 Midwest 0.0158 0.000240 ## 3 South 0.0245 0.000388 ## 4 West 0.0246 0.000875 Though not a statistical test, it does appear the cost rates in the Midwest for liquid propane are the lowest. 5.8 Correlations The correlation is a measure of linear relationship between two continuous variables, which ranges between -1 and 1. The most common one used is Pearson’s correlation (referred to as correlation henceforth). A sample correlation for a simple random sample is calculated as follows: \\[\\frac{\\sum (x_i-\\bar{x})(y_i-\\bar{y})}{\\sqrt{\\sum (x_i-\\bar{x})^2} \\sqrt{\\sum(y_i-\\bar{y})^2}} \\] When using survey_corr() for designs other than a simple random sample, the weights are applied when estimating the correlation. 5.8.1 Syntax The syntax for survey_corr() is as follows: survey_corr( x, y, na.rm = FALSE, vartype = c("se", "ci", "var", "cv"), level = 0.95, df = NULL ) The arguments are: x: A variable or expression y: A variable or expression na.rm: A logical value to indicate whether missing values should be dropped vartype: type(s) of variation estimate to calculate including any of c(\"se\", \"ci\", \"var\", \"cv\"), defaults to se (standard error) (see 5.3.1 for more information) level: (For vartype = “ci” only) A single number or vector of numbers indicating the confidence level df: (For vartype = “ci” only) A numeric value indicating the degrees of freedom for t-distribution 5.8.2 Examples Example 1: Overall Correlation We can calculate the correlation between total square footage (TOTSQFT_EN)10 and electricity consumption (BTUEL)11. recs_des %>% summarize(SQFT_Elec_Corr = survey_corr(TOTSQFT_EN, BTUEL)) ## # A tibble: 1 × 2 ## SQFT_Elec_Corr SQFT_Elec_Corr_se ## <dbl> <dbl> ## 1 0.417 0.00689 Example 2: Correlations by Subgroup Like with other statistics, we can do this by subgroups. For example, we can examine the correlation by whether air conditioning is used (ACUsed). recs_des %>% group_by(ACUsed) %>% summarize(SQFT_Elec_Corr = survey_corr(TOTSQFT_EN, DOLLAREL)) ## # A tibble: 2 × 3 ## ACUsed SQFT_Elec_Corr SQFT_Elec_Corr_se ## <lgl> <dbl> <dbl> ## 1 FALSE 0.290 0.0240 ## 2 TRUE 0.401 0.00808 5.9 Standard Deviation and Variance All survey functions produce an estimate of the variability of a given estimate. No additional function is needed when dealing with variable estimates. However, if estimates of the population variance and population standard deviation are needed, we can use the survey_var() and survey_sd() functions. In our experience, most researchers will not use these functions. These are sometimes used when designing a future study, as understanding the variability in the population can help inform the precision of a future sampling design. 5.9.1 Syntax As with non-survey data, the standard deviation estimate is the square root of the variance estimate, and thus, the functions have the same arguments, except the standard deviation does not allow the usage of vartype. survey_var( x, na.rm = FALSE, vartype = c("se", "ci", "var"), level = 0.95, df = NULL ) survey_sd( x, na.rm = FALSE ) The arguments are: x: A variable or expression, or empty na.rm: A logical value to indicate whether missing values should be dropped vartype: type(s) of variation estimate to calculate including any of c(\"se\", \"ci\", \"var\"), defaults to se (standard error) (see 5.3.1 for more information) level: (For vartype = “ci” only) A single number or vector of numbers indicating the confidence level. df: (For vartype = “ci” only) A numeric value indicating the degrees of freedom for t-distribution 5.9.2 Examples Example 1: Overall Variability Returning to electricity bills, we look at the variability in electricity expenditure. recs_des %>% summarize(var_elbill = survey_var(DOLLAREL), sd_elbill = survey_sd(DOLLAREL)) ## Warning: There were 2 warnings in `dplyr::summarise()`. ## The first warning was: ## ℹ In argument: `var_elbill = survey_var(DOLLAREL)`. ## Caused by warning in `thetas - meantheta`: ## ! Recycling array of length 1 in vector-array arithmetic is deprecated. ## Use c() or as.vector() instead. ## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning. ## # A tibble: 1 × 3 ## var_elbill var_elbill_se sd_elbill ## <dbl> <dbl> <dbl> ## 1 704906. 13926. 840. A warning message may be displayed if using a replicate design. The results are still valid. The results above give an estimate of the population variance of electricity bills (var_elbill), the standard error of that variance (var_elbill_se), and the estimated population standard deviation of electricity bills (sd_elbill). Note that no standard error is associated with the standard deviation - this is the only estimate that does not include a standard error. Example 2: Variability by Subgroup Like other estimates, we can calculate the variance by region. This would be useful to learn if the variability is similar across regions: recs_des %>% group_by(Region) %>% summarize(var_elbill = survey_var(DOLLAREL), sd_elbill = survey_sd(DOLLAREL)) ## Warning: There were 8 warnings in `dplyr::summarise()`. ## The first warning was: ## ℹ In argument: `var_elbill = survey_var(DOLLAREL)`. ## ℹ In group 1: `Region = Northeast`. ## Caused by warning in `thetas - meantheta`: ## ! Recycling array of length 1 in vector-array arithmetic is deprecated. ## Use c() or as.vector() instead. ## ℹ Run `dplyr::last_dplyr_warnings()` to see the 7 remaining warnings. ## # A tibble: 4 × 4 ## Region var_elbill var_elbill_se sd_elbill ## <fct> <dbl> <dbl> <dbl> ## 1 Northeast 775450. 38843. 881. ## 2 Midwest 552423. 25252. 743. ## 3 South 702521. 30641. 838. ## 4 West 717886. 30597. 847. 5.10 Additional Topics 5.10.1 Unweighted Analysis Sometimes, it is helpful to calculate an unweighted estimate of a given variable. For this, we use the unweighted() function in the summarize() function. The unweighted() function calculates unweighted summaries from tbl_svy object, which reflects the summary among the respondents and does not extrapolate to a population estimate. The unweighted function can be used in conjunction with any {dplyr} functions. Here is an example looking at the average household electricity cost. recs_des %>% summarize(elec_bill = survey_mean(DOLLAREL), elec_unweight = unweighted(mean(DOLLAREL))) ## # A tibble: 1 × 3 ## elec_bill elec_bill_se elec_unweight ## <dbl> <dbl> <dbl> ## 1 1380. 5.38 1425. It is estimated that American residential households spent an average of $1,380 on electricity in 2020, and the estimate has a standard error of $5. The unweighted function calculates the unweighted average and illustrates the average amount of money the respondents spent on electricity in 2020, which was $1,425. 5.10.2 Subpopulation Analysis Briefly, we mentioned using filter() to subset a survey object for analysis. This operation should be done after creating the design object. In rare circumstances, subsetting data before creating the object can lead to incorrect variability estimates. This can occur if subsetting removes an entire PSU. Suppose we wanted estimates of the average amount spent on natural gas among housing units that use natural gas using the variable BTUNG12. This could be obtained by first filtering records to only include records where BTUNG > 0 and then finding the average amount of money spent. recs_des %>% filter(BTUNG > 0) %>% summarize(NG_mean = survey_mean(DOLLARNG, vartype = c("se", "ci"))) ## # A tibble: 1 × 4 ## NG_mean NG_mean_se NG_mean_low NG_mean_upp ## <dbl> <dbl> <dbl> <dbl> ## 1 631. 4.64 621. 640. Note that this yields a higher mean than when not applying the filter. When including housing units that do not use natural gas, many $0 amounts are included in the mean calculation. recs_des %>% summarize(NG_mean = survey_mean(DOLLARNG, vartype = c("se", "ci"))) ## # A tibble: 1 × 4 ## NG_mean NG_mean_se NG_mean_low NG_mean_upp ## <dbl> <dbl> <dbl> <dbl> ## 1 382. 3.41 375. 389. 5.10.3 Design Effects The design effect measures how the precision of an estimate is impacted by the sampling design. A design effect is calculated as the ratio of the variance of an estimate under the design at hand to the variance of the estimate under a simple random sample without replacement (SRS). A design effect less than 1 indicates that the design is more statistically efficient than a SRS design. This is rare but possible in a stratified sampling design where the outcome is correlated with the stratification variable(s). A design effect greater than 1 indicates that the design is less statistically efficient than a SRS design. From a design effect, we can calculate the effective sample size as follows: \\[n_{eff}=\\frac{n}{D_{eff}} \\] where \\(n\\) is the nominal sample size (number of survey responses) and \\(D_{eff}\\) is the estimated design effect. The effective sample size has an interesting interpretation that a survey using an SRS design would need a sample size of \\(n_{eff}\\) to obtain the same precision as the design at hand, which is where the efficiency interpretation comes in. Design effects are outcome-specific — outcomes that are less clustered in the population have smaller design effects than outcomes that are clustered. In the {srvyr} package, design effects can be calculated for totals, proportions, means, and ratio estimates by setting the deff argument to TRUE in the corresponding functions. For example, the design effect can be calculated for the average consumption of electricity (BTUEL), natural gas (BTUNG), liquid propane (BTULP), fuel oil (BTUFO), and wood (BTUWOOD). recs_des %>% summarize(across(c(BTUEL, BTUNG, BTULP, BTUFO, BTUWOOD), ~survey_mean(.x, deff = TRUE, vartype = NULL))) %>% select(ends_with("deff")) ## # A tibble: 1 × 5 ## BTUEL_deff BTUNG_deff BTULP_deff BTUFO_deff BTUWOOD_deff ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.597 0.938 1.21 0.720 1.10 5.10.4 Creating Summary Rows When using group_by() in analysis, results are returned with a row for each group or group combination. Often, we want both the breakdowns by group and a summary row for the estimate for the entire population. For example, we may want the average electricity consumption by region AND nationally. The {srvyr} package has a function cascade(), which adds summary rows for the total of a group. It is used in place of summarize() and has similar functions along with some additional features. Syntax The syntax is as follows: cascade( .data, ..., .fill = NA, .fill_level_top = FALSE, .groupings = NULL ) where the arguments are: .data: A tbl_svy object ...: Name-value pairs of summary functions (same as the summarize() function) .fill: Value to fill in for group summaries (defaults to NA) .fill_level_top: When filling factor variables, whether to put the value ‘.fill’ in the first position (defaults to FALSE, placing it in the bottom). .groupings: (Experimental) A list of quosures to manually specify the groupings to use, rather than the default. Example First, let’s look at a simple example and then build on it to examine the features of the function. In the first example, all default values are used. recs_des %>% group_by(Region) %>% cascade(DOLLAREL_mn = survey_mean(DOLLAREL)) ## # A tibble: 5 × 3 ## Region DOLLAREL_mn DOLLAREL_mn_se ## <fct> <dbl> <dbl> ## 1 Northeast 1343. 14.6 ## 2 Midwest 1293. 11.7 ## 3 South 1548. 10.3 ## 4 West 1211. 12.0 ## 5 <NA> 1380. 5.38 The last row where Region = NA is the national average electricity bill. We might wish to have a better name for it and can do that using the .fill argument. recs_des %>% group_by(Region) %>% cascade(DOLLAREL_mn = survey_mean(DOLLAREL), .fill = "National") ## # A tibble: 5 × 3 ## Region DOLLAREL_mn DOLLAREL_mn_se ## <fct> <dbl> <dbl> ## 1 Northeast 1343. 14.6 ## 2 Midwest 1293. 11.7 ## 3 South 1548. 10.3 ## 4 West 1211. 12.0 ## 5 National 1380. 5.38 We can also have more than one grouping variable as follows: recs_des %>% group_by(Region, Urbanicity) %>% cascade(DOLLAREL_mn = survey_mean(DOLLAREL), .fill = "Total") %>% ungroup() ## # A tibble: 17 × 4 ## Region Urbanicity DOLLAREL_mn DOLLAREL_mn_se ## <fct> <fct> <dbl> <dbl> ## 1 Northeast Urban Area 1315. 15.9 ## 2 Northeast Urban Cluster 1218. 59.5 ## 3 Northeast Rural 1529. 38.1 ## 4 Northeast Total 1343. 14.6 ## 5 Midwest Urban Area 1186. 13.6 ## 6 Midwest Urban Cluster 1214. 33.7 ## 7 Midwest Rural 1633. 32.1 ## 8 Midwest Total 1293. 11.7 ## 9 South Urban Area 1466. 12.9 ## 10 South Urban Cluster 1473. 29.3 ## 11 South Rural 1812. 22.1 ## 12 South Total 1548. 10.3 ## 13 West Urban Area 1179. 13.2 ## 14 West Urban Cluster 1174. 43.4 ## 15 West Rural 1544. 43.5 ## 16 West Total 1211. 12.0 ## 17 Total Total 1380. 5.38 We can move the summary row to the first row: recs_des %>% group_by(Region) %>% cascade(DOLLAREL_mn = survey_mean(DOLLAREL), .fill = "National", .fill_level_top = TRUE) %>% ungroup() ## # A tibble: 5 × 3 ## Region DOLLAREL_mn DOLLAREL_mn_se ## <fct> <dbl> <dbl> ## 1 National 1380. 5.38 ## 2 Northeast 1343. 14.6 ## 3 Midwest 1293. 11.7 ## 4 South 1548. 10.3 ## 5 West 1211. 12.0 5.10.5 Calculating Estimates for Many Outcomes Often, we are interested in a summary statistic across many variables. Two useful tools in doing this are the across() function in {dplyr} which has been shown a few times above and the map() function in {purrr}. The across() function allows you to apply the same function to several columns within summarize(). This works well for usage with all functions shown above except survey_prop(). In a later example, we will tackle several proportions. Example 1: across() Suppose we want to calculate the total consumption for each fuel type and the average consumption for each fuel type with coefficients of variation. These include the consumption of electricity (BTUEL), natural gas (BTUNG), liquid propane (BTULP), fuel oil (BTUFO), and wood (BTUWOOD), as illustrated in the discussion on design effects. These are the only variables that start with “BTU”, so we can use that to our advantage. consumption_ests <- recs_des %>% summarize(across(starts_with("BTU"), list(Total = ~survey_total(.x, vartype = "cv"), Mean = ~survey_mean(.x, vartype = "cv")), .unpack = "{outer}.{inner}")) consumption_ests ## # A tibble: 1 × 20 ## BTUEL_Total.coef BTUEL_Total._cv BTUEL_Mean.coef BTUEL_Mean._cv ## <dbl> <dbl> <dbl> <dbl> ## 1 4453284510065 0.00377 36051. 0.00377 ## # ℹ 16 more variables: BTUNG_Total.coef <dbl>, BTUNG_Total._cv <dbl>, ## # BTUNG_Mean.coef <dbl>, BTUNG_Mean._cv <dbl>, ## # BTULP_Total.coef <dbl>, BTULP_Total._cv <dbl>, ## # BTULP_Mean.coef <dbl>, BTULP_Mean._cv <dbl>, ## # BTUFO_Total.coef <dbl>, BTUFO_Total._cv <dbl>, ## # BTUFO_Mean.coef <dbl>, BTUFO_Mean._cv <dbl>, ## # BTUWOOD_Total.coef <dbl>, BTUWOOD_Total._cv <dbl>, … In the example above, this results in a very wide table. We may instead want a row for each fuel type. Using the pivot_longer() and pivot_wider() functions from {tidyr} can help us get there. We will first make the data longer and split out the components of the name with pivot_longer(): consumption_ests_long <- consumption_ests %>% pivot_longer(cols = everything(), names_to = c("FuelType", "Stat", "Type"), names_pattern = "BTU(.*)_(.*)\\\\.(.*)") consumption_ests_long ## # A tibble: 20 × 4 ## FuelType Stat Type value ## <chr> <chr> <chr> <dbl> ## 1 EL Total coef 4.45e+12 ## 2 EL Total _cv 3.77e- 3 ## 3 EL Mean coef 3.61e+ 4 ## 4 EL Mean _cv 3.77e- 3 ## 5 NG Total coef 4.24e+12 ## 6 NG Total _cv 9.08e- 3 ## 7 NG Mean coef 3.43e+ 4 ## 8 NG Mean _cv 9.08e- 3 ## 9 LP Total coef 3.91e+11 ## 10 LP Total _cv 3.80e- 2 ## 11 LP Mean coef 3.17e+ 3 ## 12 LP Mean _cv 3.80e- 2 ## 13 FO Total coef 3.96e+11 ## 14 FO Total _cv 3.43e- 2 ## 15 FO Mean coef 3.20e+ 3 ## 16 FO Mean _cv 3.43e- 2 ## 17 WOOD Total coef 3.45e+11 ## 18 WOOD Total _cv 4.54e- 2 ## 19 WOOD Mean coef 2.79e+ 3 ## 20 WOOD Mean _cv 4.54e- 2 Then, we make the names for each element more descriptive and informative before using pivot_wider() to create a table that is almost ready for publication. A bit more on that will be covered in Chapter 8. consumption_ests_long %>% mutate(Type = case_when(Type == "coef" ~ "", Type == "_cv" ~ " (CV)")) %>% pivot_wider(id_cols = FuelType, names_from = c(Stat, Type), names_glue = "{Stat}{Type}", values_from = value) ## # A tibble: 5 × 5 ## FuelType Total `Total (CV)` Mean `Mean (CV)` ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 EL 4.45e12 0.00377 36051. 0.00377 ## 2 NG 4.24e12 0.00908 34330. 0.00908 ## 3 LP 3.91e11 0.0380 3169. 0.0380 ## 4 FO 3.96e11 0.0343 3203. 0.0343 ## 5 WOOD 3.45e11 0.0454 2794. 0.0454 Example 2: Proportions with across() As mentioned earlier, proportions will not work as well directly with the across() method. If we want the proportion of houses with air conditioning and the proportion of houses with heating, we need two group_by() statements as follows: recs_des %>% group_by(ACUsed) %>% summarize(p = survey_prop()) ## # A tibble: 2 × 3 ## ACUsed p p_se ## <lgl> <dbl> <dbl> ## 1 FALSE 0.113 0.00306 ## 2 TRUE 0.887 0.00306 recs_des %>% group_by(SpaceHeatingUsed) %>% summarize(p = survey_prop()) ## # A tibble: 2 × 3 ## SpaceHeatingUsed p p_se ## <lgl> <dbl> <dbl> ## 1 FALSE 0.0469 0.00207 ## 2 TRUE 0.953 0.00207 If we are only interested in the TRUE outcomes, that is, the proportion that have air conditioning and the proportion that have heating, we can use the fact that survey_mean() applied to a logical variable is the same as using survey_prop(), as shown below: cool_heat_tab <- recs_des %>% summarize(across(c(ACUsed, SpaceHeatingUsed), ~survey_mean(.x), .unpack = "{outer}.{inner}")) cool_heat_tab ## # A tibble: 1 × 4 ## ACUsed.coef ACUsed._se SpaceHeatingUsed.coef SpaceHeatingUsed._se ## <dbl> <dbl> <dbl> <dbl> ## 1 0.887 0.00306 0.953 0.00207 Note that the estimates are the same as when using the separate group_by() statements. Like previously done, we can use pivot_longer() to create a table in a format better suited for distribution. cool_heat_tab %>% pivot_longer(everything(), names_to = c("Comfort", ".value"), names_pattern = "(.*)\\\\.(.*)") %>% rename(p = coef, se = `_se`) ## # A tibble: 2 × 3 ## Comfort p se ## <chr> <dbl> <dbl> ## 1 ACUsed 0.887 0.00306 ## 2 SpaceHeatingUsed 0.953 0.00207 Example 3: purrr::map() Loops are a common tool if we want to calculate the same thing for many elements. The {purrr} package has the map() functions. Like a loop, they allow you to do something in the same way many times. In our case, we may want to calculate proportions from the same design multiple times. An easy way to do this is to think about how we would do it for one outcome, build a function from there, and then iterate. Suppose we want to create a table that shows the proportion of people that trust in their government (TrustGovernment)13 as well as those that trust in people (TrustPeople)14. First, we do this for a single variable. We create a table that has the variable name as a column, the answer as a column, and then the percentage and its standard error. anes_des %>% drop_na(TrustGovernment) %>% group_by(TrustGovernment) %>% summarize(p = survey_prop() * 100) %>% mutate(Variable = "TrustGovernment") %>% rename(Answer = TrustGovernment) %>% select(Variable, everything()) ## # A tibble: 5 × 4 ## Variable Answer p p_se ## <chr> <fct> <dbl> <dbl> ## 1 TrustGovernment Always 1.55 0.204 ## 2 TrustGovernment Most of the time 13.2 0.553 ## 3 TrustGovernment About half the time 30.9 0.829 ## 4 TrustGovernment Some of the time 43.4 0.855 ## 5 TrustGovernment Never 11.0 0.566 Then, we create a function to replace TrustGovernment with the functions argument. To do this, we need to use a bit of tidy evaluation, which is a more advanced skill. If you want to learn more, we recommend Wickham (2019). calcps <- function(var) { anes_des %>% drop_na(!!sym(var)) %>% group_by(!!sym(var)) %>% summarize(p = survey_prop() * 100) %>% mutate(Variable = var) %>% rename(Answer := !!sym(var)) %>% select(Variable, everything()) } We can then run this function on the two variables of interest: calcps("TrustGovernment") ## # A tibble: 5 × 4 ## Variable Answer p p_se ## <chr> <fct> <dbl> <dbl> ## 1 TrustGovernment Always 1.55 0.204 ## 2 TrustGovernment Most of the time 13.2 0.553 ## 3 TrustGovernment About half the time 30.9 0.829 ## 4 TrustGovernment Some of the time 43.4 0.855 ## 5 TrustGovernment Never 11.0 0.566 calcps("TrustPeople") ## # A tibble: 5 × 4 ## Variable Answer p p_se ## <chr> <fct> <dbl> <dbl> ## 1 TrustPeople Always 0.809 0.164 ## 2 TrustPeople Most of the time 41.4 0.857 ## 3 TrustPeople About half the time 28.2 0.776 ## 4 TrustPeople Some of the time 24.5 0.670 ## 5 TrustPeople Never 5.05 0.422 Finally, we can use map to iterate over as many variables as we want. It will output a tibble with the variable name in the column “Variable”, the responses in “Answer”, the percentage, and then the standard error. This example extends nicely if we have many variables for which we want the percentage estimate. c("TrustGovernment", "TrustPeople") %>% map(calcps) %>% list_rbind() ## # A tibble: 10 × 4 ## Variable Answer p p_se ## <chr> <fct> <dbl> <dbl> ## 1 TrustGovernment Always 1.55 0.204 ## 2 TrustGovernment Most of the time 13.2 0.553 ## 3 TrustGovernment About half the time 30.9 0.829 ## 4 TrustGovernment Some of the time 43.4 0.855 ## 5 TrustGovernment Never 11.0 0.566 ## 6 TrustPeople Always 0.809 0.164 ## 7 TrustPeople Most of the time 41.4 0.857 ## 8 TrustPeople About half the time 28.2 0.776 ## 9 TrustPeople Some of the time 24.5 0.670 ## 10 TrustPeople Never 5.05 0.422 5.11 Exercises The exercises use the design objects anes_des and recs_des as provided in the Prerequisites box in the beginning of the chapter. How many females have a graduate degree? Hint: the variables Gender and Education will be useful. # Option 1: femgd_option1 <- anes_des %>% filter(Gender == "Female", Education == "Graduate") %>% survey_count(name = "n") femgd_option1 ## # A tibble: 1 × 2 ## n n_se ## <dbl> <dbl> ## 1 15072196. 837872. # Option 2: femgd_option2 <- anes_des %>% filter(Gender == "Female", Education == "Graduate") %>% summarize(N = survey_total(), .groups = "drop") femgd_option2 ## # A tibble: 1 × 2 ## N N_se ## <dbl> <dbl> ## 1 15072196. 837872. What percentage of people identify as “Strong Democrat”? Hint: The variable PartyID indicates someone’s party affiliation. psd <- anes_des %>% group_by(PartyID) %>% summarize(p = survey_mean()) %>% filter(PartyID == "Strong democrat") psd ## # A tibble: 1 × 3 ## PartyID p p_se ## <fct> <dbl> <dbl> ## 1 Strong democrat 0.219 0.00646 What percentage of people who voted in the 2020 election identify as “Strong Republican”? Hint: The variable VotedPres2020 indicates whether someone voted in 2020. psr <- anes_des %>% filter(VotedPres2020 == "Yes") %>% group_by(PartyID) %>% summarize(p = survey_mean()) %>% filter(PartyID == "Strong republican") psr ## # A tibble: 1 × 3 ## PartyID p p_se ## <fct> <dbl> <dbl> ## 1 Strong republican 0.228 0.00815 What percentage of people voted in both the 2016 election and the 2020 election? Include the logit confidence interval. Hint: The variable VotedPres2016 indicates whether someone voted in 2016. pvb <- anes_des %>% filter(!is.na(VotedPres2016), !is.na(VotedPres2020)) %>% group_by(interact(VotedPres2016, VotedPres2020)) %>% summarize(p = survey_prop(var = "ci", method = "logit"), ) %>% filter(VotedPres2016 == "Yes", VotedPres2020 == "Yes") pvb ## # A tibble: 1 × 5 ## VotedPres2016 VotedPres2020 p p_low p_upp ## <fct> <fct> <dbl> <dbl> <dbl> ## 1 Yes Yes 0.796 0.777 0.813 What is the design effect for the proportion of people who voted early? Hint: The variable EarlyVote2020 indicates whether someone voted early in 2020. pdeff <- anes_des %>% filter(!is.na(EarlyVote2020)) %>% group_by(EarlyVote2020) %>% summarize(p = survey_mean(deff = TRUE)) %>% filter(EarlyVote2020 == "Yes") pdeff ## # A tibble: 1 × 4 ## EarlyVote2020 p p_se p_deff ## <fct> <dbl> <dbl> <dbl> ## 1 Yes 0.0535 0.00426 2.27 What is the median temperature people set their thermostats to at night during the winter? Hint: The variable WinterTempNight indicates the temperature that people set their temperature in the winter at night. mean_wintertempnight <- recs_des %>% summarize(wtn_mean = survey_mean(x = WinterTempNight, na.rm = TRUE)) mean_wintertempnight ## # A tibble: 1 × 2 ## wtn_mean wtn_mean_se ## <dbl> <dbl> ## 1 68.3 0.0446 People sometimes set their temperature differently over different seasons and during the day. What median temperatures do people set their thermostat to in the summer and winter, both during the day and at night? Include confidence intervals. Hint: Use the variables WinterTempDay, WinterTempNight, SummerTempDay, and SummerTempNight. # Option 1 med_wintertempday <- recs_des %>% summarize(wtd_mean = survey_median(WinterTempDay, vartype = "se", na.rm = TRUE)) med_wintertempday ## # A tibble: 1 × 2 ## wtd_mean wtd_mean_se ## <dbl> <dbl> ## 1 70 0.250 med_wintertempnight <- recs_des %>% summarize(wtn_mean = survey_median(WinterTempNight, vartype = "se", na.rm = TRUE)) med_wintertempnight ## # A tibble: 1 × 2 ## wtn_mean wtn_mean_se ## <dbl> <dbl> ## 1 68 0.250 med_summertempday <- recs_des %>% summarize(std_mean = survey_median(SummerTempDay, vartype = "se", na.rm = TRUE)) med_summertempday ## # A tibble: 1 × 2 ## std_mean std_mean_se ## <dbl> <dbl> ## 1 72 0.250 med_summertempnight <- recs_des %>% summarize(stn_mean = survey_median(SummerTempNight, vartype = "se", na.rm = TRUE)) med_summertempnight ## # A tibble: 1 × 2 ## stn_mean stn_mean_se ## <dbl> <dbl> ## 1 72 0.250 # Alternatively, could use `survey_quantile()` as shown below for # WinterTempNight: quant_wintertemp <- recs_des %>% summarize(wnt_quant = survey_quantile(WinterTempNight, quantiles = 0.5, vartype = "se", na.rm = TRUE)) quant_wintertemp ## # A tibble: 1 × 2 ## wnt_quant_q50 wnt_quant_q50_se ## <dbl> <dbl> ## 1 68 0.250 What is the correlation between the temperature that people set their temperature at during the night and during the day in the summer? corr_summer_temp <- recs_des %>% summarize(summer_corr = survey_corr(SummerTempNight, SummerTempDay, na.rm = TRUE)) corr_summer_temp ## # A tibble: 1 × 2 ## summer_corr summer_corr_se ## <dbl> <dbl> ## 1 0.806 0.00806 What is the 1st, 2nd, and 3rd quartile of the amount of money spent on energy by Building America (BA) climate zone? Hint: TOTALDOL indicates the total amount spent on electricity, and ClimateRegion_BA indicates the BA climate zones. quant_baenergyexp <- recs_des %>% group_by(ClimateRegion_BA) %>% summarize(dol_quant = survey_quantile(TOTALDOL, quantiles = c(0.25, 0.5, 0.75), vartype = "se", na.rm = TRUE)) quant_baenergyexp ## # A tibble: 8 × 7 ## ClimateRegion_BA dol_quant_q25 dol_quant_q50 dol_quant_q75 ## <fct> <dbl> <dbl> <dbl> ## 1 Mixed-Dry 1091. 1541. 2139. ## 2 Mixed-Humid 1317. 1840. 2462. ## 3 Hot-Humid 1094. 1622. 2233. ## 4 Hot-Dry 926. 1513. 2223. ## 5 Very-Cold 1195. 1986. 2955. ## 6 Cold 1213. 1756. 2422. ## 7 Marine 938. 1380. 1987. ## 8 Subarctic 2404. 3535. 5219. ## # ℹ 3 more variables: dol_quant_q25_se <dbl>, dol_quant_q50_se <dbl>, ## # dol_quant_q75_se <dbl> References Shah, Babubhai V, and Akhil K Vaish. 2006. “Confidence Intervals for Quantile Estimation from Complex Survey Data.” In Proceedings of the Section on Survey Research Methods. Wickham, Hadley. 2019. Advanced R. https://adv-r.hadley.nz/; CRC press. RECS has two components: a household survey and an energy supplier survey. For each household that responds, their energy provider(s) are contacted to obtain their energy consumption and expenditure. This value reflects the dollars spent on electricity in 2020, according to the energy supplier. See https://www.eia.gov/consumption/residential/data/2020/pdf/2020%20RECS%20CE%20Methodology_Final.pdf for more details.↩︎ Question text: Is any air conditioning equipment used in your home?↩︎ The value of DOLLARLP reflects the annualized amount spent on liquid propane and BTULP reflects the annualized consumption in Btu of liquid propane.↩︎ Question text: What is the square footage of your home?↩︎ BTUEL is derived from the supplier side component of the survey where BTUEL represents the electricity consumption in British thermal units (Btus) converted from kilowatt hours (kWh) in a year↩︎ BTUNG is derived from the supplier side component of the survey where BTUNG represents the natural gas consumption in British thermal units (Btus) in a year↩︎ Question: How often can you trust the federal government in Washington to do what is right? (Always, most of the time, about half the time, some of the time, or never / Never, some of the time, about half the time, most of the time, or always)?↩︎ Question: Generally speaking, how often can you trust other people? (Always, most of the time, about half the time, some of the time, or never / Never, some of the time, about half the time, most of the time, or always)? ↩︎ "],["c06-statistical-testing.html", "Chapter 6 Statistical testing 6.1 Introduction 6.2 Dot Notation 6.3 Comparison of Proportions and Means 6.4 Chi-Square Tests 6.5 Exercises", " Chapter 6 Statistical testing Prerequisites For this chapter, load the following packages: library(tidyverse) library(survey) library(srvyr) library(srvyrexploR) library(broom) library(gt) We will be using data from ANES and RECS described in Chapter 4. As a reminder, here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter 4 for more information). targetpop <- 231592693 data(anes_2020) anes_adjwgt <- anes_2020 %>% mutate(Weight = Weight / sum(Weight) * targetpop) anes_des <- anes_adjwgt %>% as_survey_design( weights = Weight, strata = Stratum, ids = VarUnit, nest = TRUE ) For RECS, details are included in the RECS documentation and Chapters 4 and 10. data(recs_2020) recs_des <- recs_2020 %>% as_survey_rep( weights = NWEIGHT, repweights = NWEIGHT1:NWEIGHT60, type = "JK1", scale = 59/60, mse = TRUE ) 6.1 Introduction When analyzing results from a survey, the point estimates described in Chapter 5 help us understand the data at a high level. Still, researchers and the public often want to make comparisons between different groups. These comparisons are calculated through statistical testing. The general idea of statistical testing is the same for data obtained through surveys and data obtained through other methods, where we compare the point estimates and variance estimates of each statistic to see if statistically significant differences exist. However, statistical testing for complex surveys involves additional considerations due to the need to account for the sampling design in order to obtain accurate variance estimates. Statistical testing, also called hypothesis testing, involves declaring a null and alternative hypothesis. A null hypothesis is denoted as \\(H_0\\) and the alternative hypothesis is denoted as \\(H_A\\). The null hypothesis is the default assumption in that there are no differences in the data, or that the data is operating under “standard” behaviors. On the other hand, the alternative hypothesis is the break from the “standard” and what we are trying to determine if the data supports. Let’s review an example outside of survey data. If we are flipping a coin, a null hypothesis would be that the coin is fair and that each side has an equal chance of being flipped. In other words, the probability of the coin landing on each side is 1/2. Whereas an alternative hypothesis could be that the coin is unfair and that one side has a higher probability of being flipped (e.g., a probability of 1/4 to get heads, but a probability of 3/4 to get tails). We write this set of hypotheses as: \\(H_0: \\rho_{heads} = \\rho_{tails}\\), where \\(\\rho_{x}\\) is the probability of flipping the coin and having it land on heads (\\(\\rho_{heads}\\)) or tails (\\(\\rho_{tails}\\)) \\(H_A: \\rho_{heads} \\neq \\rho_{tails}\\) When we conduct hypothesis testing, the statistical models calculate a p-value, which shows how likely we are to observe the data if the null hypothesis is true. If the p-value (a probability between 0 and 1) is small, we have strong evidence to reject the null hypothesis as it is unlikely to see the data we are observing if the null hypothesis is true. However, if the p-value is large, we say we do not have evidence to reject the null hypothesis. The size of the p-value for this cut off is determined by type 1 error known as \\(\\alpha\\). A common type 1 error value for statistical testing is to use \\(\\alpha = 0.05\\).15 It is common for explanations of statistical testing to refer to confidence level. The confidence level is the inverse of the type 1 error. Thus, if \\(\\alpha = 0.05\\), the confidence level would be 95%. The functions in the {survey} package allow for the correct estimation of the variances. This chapter will cover the following statistical tests with survey data and functions: Comparison of proportions svyttest() Comparison of means svyttest() Goodness of fit tests svygofchisq() Tests of independence svychisq() Tests of homogeneity svychisq() 6.2 Dot Notation Up to this point, we have shown functions that use wrappers from the {srvyr} package. This means that the functions work with tidyverse syntax. However, the functions in this chapter do not have wrappers in the {srvyr} package and are instead used directly from the {survey} package. Therefore, the design object is not the first argument, and to use these functions with the magrittr pipe (%>%) and tidyverse syntax, we will need to use dot (.) notation16 Functions that work with the magrittr pipe (%>%) have the data as the first argument. When we run a function with the pipe, it automatically places anything to the left of the pipe into the first argument of the function to the right of the pipe. For example, if we wanted to take the mtcars data and filter to cars with six cylinders, we can write the code in at least four different ways: filter(mtcars, cyl == 6) mtcars %>% filter(cyl == 6) mtcars %>% filter(., cyl == 6) mtcars %>% filter(.data = ., cyl == 6) Each of these lines of code will produce the same output since the argument that takes the data is in the first spot in filter(). The first two are probably familiar to those who have worked with the tidyverse. The third option functions the same way as the second one but is explicit that mtcars goes into the first argument, and the fourth option indicates that mtcars is going into the named argument of .data. Here, we are telling R to take what’s on the left side of the pipe (mtcars) and pipe it into the spot with the dot (.)—the first argument. In functions that are not part of the tidyverse, the data argument may not be in the first spot. For example, in svyttest(), the data argument is in the second spot, which means we need to place the dot (.) in the second spot and not the first. For example: svydata_des %>% svyttest(x ~ y, .) By default, the pipe places the left-hand object in the first argument spot. Placing the dot (.) in the second argument spot indicates that the survey design object svydata_des should be used in the second argument and not the first. Alternatively, named arguments could be used to place the dot first as named arguments can appear at any location, as in the following: svydata_des %>% svyttest(design = ., x ~ y) However, the following code will not work as the svyttest() function expects the formula as the first argument when arguments are not named: svydata_des %>% svyttest(., x ~ y) 6.3 Comparison of Proportions and Means We use t-tests to compare two proportions or means. T-tests allow us to determine if one proportion or mean is statistically different from another. They are commonly used to determine if a single estimate differs from a known value (e.g., 0 or 50%) or to compare two group means (e.g., North versus South). Comparing a single estimate to a known value is called a one sample t-test, and we can set up the hypothesis test as follows: \\(H_0: \\mu = 0\\) where \\(\\mu\\) is the mean outcome and \\(0\\) is the value we are comparing it to \\(H_A: \\mu \\neq 0\\) For comparing two estimates, this is called a two-sample t-test and we can set up the hypothesis test as follows: \\(H_0: \\mu_1 = \\mu_2\\) where \\(\\mu_i\\) is the mean outcome for group \\(i\\) \\(H_A: \\mu_1 \\neq \\mu_2\\) Two sample t-tests can also be paired or unpaired. If the data come from two different populations (e.g., North versus South), the t-test run will be an unpaired or independent samples t-test. Paired t-tests occur when the data come from the same population. This is commonly seen with data from the same population in two different time periods (e.g., before and after an intervention). The difference between t-tests with non-survey data and survey data is based on the underlying variance estimation difference. Chapter 10 provides a detailed overview of the math behind the mean and sampling error calculations for various sample designs. The functions in the {survey} package will account for these nuances, provided the design object is correctly defined. 6.3.1 Syntax When we do not have survey data, we can use the t.test() function from the {stats} package. This function does not allow for weights or the variance structure that need to be accounted for with survey data. Therefore, we need to use the svyttest() function from {survey} when using survey data. Many of the arguments are the same between the two functions, but there are a few key differences: We need to use the survey design object instead of the original data frame We can only use a formula and not separate x and y data The confidence level cannot be specified and will always be set to 95%. However, we will show examples of how the confidence level can be changed after running the svyttest() function by using the confint() function. Here is the syntax for the svyttest() function: svyttest(formula, design, ...) The arguments are: formula: Formula, outcome~group for two-sample, outcome~0 or outcome~1 for one-sample. The group variable must be a factor or character with two levels, or be coded 0/1 or 1/2. We give more details on formula set-up below for different types of tests. design: survey design object ...: This passes options on for one-sided tests only, and thus, we can specify na.rm=TRUE Notice that the first argument here is the formula and not the design. This means we must use the dot (.) if we pipe in the survey design object (as described in Section 6.2). The formula argument can take several different forms depending on what we are measuring. Here are a few common scenarios: One-sample t-test: Comparison to 0: var ~ 0, where var is the measure of interest, and we compare it to the value 0. For example, we could test if the population mean of household debt is different from 0 given the sample data collected. Comparison to a different value: var - value ~ 0, where var is the measure of interest and value is what we are comparing to. For example, we could test if the proportion of the population that has blue eyes is different from 25% by using var - 0.25 ~ 0. Note that specifying the formula as var ~ 0.25 is not equivalent and will result in a syntax error. Two-sample t-test: Unpaired: 2 level grouping variable: var ~ groupVar, where var is the measure of interest and groupVar is a variable with two categories. For example, we could test if the average age of the population who voted for president in 2020 differed from the age of people who did not vote. In this case, age would be used for var, and a binary variable indicating voting activity would be the groupVar. 3+ level grouping variable: var ~ groupVar == level, where var is the measure of interest, groupVar is the categorical variable, and level is the category level to isolate. For example, we could test if the test scores in one classroom differed from all other classrooms where groupVar would be the variable holding the values for classroom IDs and level is the classroom ID we want to compare to the others. Paired: var_1 - var_2 ~ 0, where var_1 is the first variable of interest and var_2 is the second variable of interest. For example, we could test if test scores on a subject differed between the start and the end of a course so var_1 would be the test score at the beginning of the course and var_2 would be the score at the end of the course. The na.rm argument defaults to FALSE, which means if any data is missing, the t-test will not compute. Throughout this chapter, we will always set na.rm = TRUE, but before analyzing the survey data, review the notes provided in Chapter 3 to better understand how to handle missing data. Let’s walk through a few examples using the ANES and RECS data. 6.3.2 Examples Example 1: One-sample t-test for Mean RECS asks respondents to indicate what temperature they set their house to during the summer at night.17 In our data, we have called this variable SummerTempNight. If we want to see if the average U.S. household sets its temperature at a value different from 68\\(^\\circ\\)F18, we could set up the hypothesis as follows: \\(H_0: \\mu = 68\\) where \\(\\mu\\) is the average temperature U.S. households set their thermostat to in the summer at night \\(H_A: \\mu \\neq 68\\) To conduct this in R, we use svyttest() and subtract the temperature on the left-hand side of the formula: ttest_ex1 <- recs_des %>% svyttest( formula = SummerTempNight - 68 ~ 0, design = ., na.rm = TRUE ) ttest_ex1 ## ## Design-based one-sample t-test ## ## data: SummerTempNight - 68 ~ 0 ## t = 85, df = 58, p-value <2e-16 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## 3.288 3.447 ## sample estimates: ## mean ## 3.367 To pull out specific output, we can use R’s built-in $ operator. For instance, to obtain the estimate \\(\\mu - 68\\), we run ttest_ex1$estimate. If we want the average, we take our t-test estimate and add it to 68: ttest_ex1$estimate + 68 ## mean ## 71.37 Or, we can use the survey_mean() function described in Chapter 5: recs_des %>% summarize(mu = survey_mean(SummerTempNight, na.rm = TRUE)) ## # A tibble: 1 × 2 ## mu mu_se ## <dbl> <dbl> ## 1 71.4 0.0397 The result is the same in both methods, so we see that the average temperature U.S. households set their thermostat to in the summer at night is 71.4\\(^\\circ\\)F. Looking at the output from svyttest(), the t-statistic is 84.8, and the p-value is \\(<0.0001\\), indicating that the average is statistically different from 68\\(^\\circ\\)F at an \\(\\alpha\\) level of \\(0.05\\). If we want an 80% confidence interval for the test statistic, we can use the function confint() to change the confidence level. Below, we print both the original 95% confidence interval and the 80% confidence interval: confint(ttest_ex1, level = 0.95) ## 2.5 % 97.5 % ## as.numeric(SummerTempNight - 68) 3.288 3.447 ## attr(,"conf.level") ## [1] 0.95 confint(ttest_ex1, level = 0.8) ## [1] 3.316 3.419 ## attr(,"conf.level") ## [1] 0.8 In this case, neither confidence interval contains 0, and we draw the same conclusion from either that the average temperature households set their thermostat in the summer at night is significantly higher than 68\\(^\\circ\\)F. Example 2: One-sample t-test for Proportion RECS asked respondents if they use any air conditioning (AC) in their home.19 In our data, we call this variable ACUsed. Let’s look at the proportion of U.S. households that use AC in their homes using the survey_prop() function we learned in Chapter 5. acprop <- recs_des %>% group_by(ACUsed) %>% summarize(p = survey_prop()) acprop ## # A tibble: 2 × 3 ## ACUsed p p_se ## <lgl> <dbl> <dbl> ## 1 FALSE 0.113 0.00306 ## 2 TRUE 0.887 0.00306 Based on this, 88.7% of U.S. households use AC in their homes. If we wanted to know if this differs from 90%, we could set up our hypothesis as follows: \\(H_0: p = 0.90\\) where \\(p\\) is the proportion of the U.S. households that use AC in their homes \\(H_A: p \\neq 0.90\\) To conduct this in R, we use the svyttest() function as follows: ttest_ex2 <- recs_des %>% svyttest( formula = (ACUsed == TRUE) - 0.90 ~ 0, design = ., na.rm = TRUE ) ttest_ex2 ## ## Design-based one-sample t-test ## ## data: (ACUsed == TRUE) - 0.9 ~ 0 ## t = -4.4, df = 58, p-value = 5e-05 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## -0.019603 -0.007348 ## sample estimates: ## mean ## -0.01348 The output from the svyttest() function can be a bit hard to read. Using the {broom} package from tidymodels, a collection of packages for modeling using the tidyverse principles, we can clean up the output into a tibble to more easily understand what the test tells us. broom::tidy(ttest_ex2) ## # A tibble: 1 × 8 ## estimate statistic p.value parameter conf.low conf.high method ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 -0.0135 -4.40 0.0000466 58 -0.0196 -0.00735 Design-base… ## # ℹ 1 more variable: alternative <chr> The estimate differs from Example 1 in that the estimate is not displaying \\(\\mu - 0.90\\) but rather \\(\\mu\\), or the difference between the U.S. households that use AC and the proportion we are comparing to. We can see that there is a difference of -1.35 percentage points. Additionally, the t-statistic value in the statistic column is -4.4, and the p-value is <0.0001. These results indicate that the fewer than 90% of U.S. households use AC in their homes. Example 3: Unpaired two-sample t-test Two additional variables in the RECS data are the electric bill cost (DOLLAREL) and whether the house used AC or not (ACUsed).20 If we want to know if the U.S. households that used AC had higher electrical bills compared to those that did not, we could set up the hypothesis as follows: \\(H_0: \\mu_{AC} = \\mu_{noAC}\\) where \\(\\mu_{AC}\\) is the electrical bill cost for U.S. households that used AC and \\(\\mu_{noAC}\\) is the electrical bill cost for U.S. households that did not use AC \\(H_A: \\mu_{AC} \\neq \\mu_{noAC}\\) Let’s take a quick look at the data to see the format the data are in: recs_des %>% group_by(ACUsed) %>% summarize(mean = survey_mean(DOLLAREL, na.rm = TRUE)) ## # A tibble: 2 × 3 ## ACUsed mean mean_se ## <lgl> <dbl> <dbl> ## 1 FALSE 1056. 16.0 ## 2 TRUE 1422. 5.69 To conduct this in R, we use svyttest(): ttest_ex3 <- recs_des %>% svyttest(formula = DOLLAREL ~ ACUsed, design = ., na.rm = TRUE) broom::tidy(ttest_ex3) ## # A tibble: 1 × 8 ## estimate statistic p.value parameter conf.low conf.high method ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 366. 21.3 4.29e-29 58 331. 400. Design-based… ## # ℹ 1 more variable: alternative <chr> The results indicate that the difference in electrical bills for those that used AC and those that did not is, on average, $365.72. The difference appears to be statistically significant as the t-statistic is 21.3 and the p-value is \\(<0.0001\\). Households that used AC spent, on average, $365.72 more in 2020 on electricity than households without AC. Example 4: Paired two-sample t-test Let’s say we want to test whether the temperature that U.S. households set their thermostat at night differs depending on the season (comparing summer21 and winter22 temperatures). We could set up the hypothesis as follows: \\(H_0: \\mu_{summer} = \\mu_{winter}\\) where \\(\\mu_{summer}\\) is the temperature that U.S. households set their thermostat to during summer nights, and \\(\\mu_{winter}\\) is the temperature that U.S. households set their thermostat to during winter nights \\(H_A: \\mu_{summer} \\neq \\mu_{winter}\\) To conduct this in R, we use svyttest() by calculating the temperature difference on the left-hand side as follows: ttest_ex4 <- recs_des %>% svyttest( design = ., formula = SummerTempNight - WinterTempNight ~ 0, na.rm = TRUE ) broom::tidy(ttest_ex4) ## # A tibble: 1 × 8 ## estimate statistic p.value parameter conf.low conf.high method ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 2.85 50.8 8.45e-50 58 2.74 2.96 Design-based… ## # ℹ 1 more variable: alternative <chr> U.S. households set their thermostat on average 2.9\\(^\\circ\\)F warmer in summer nights than winter nights, which is statistically significant (t = 50.8, p-value = \\(<0.0001\\)). 6.4 Chi-Square Tests Chi-square tests (\\(\\chi^2\\)) allow us to examine multiple proportions using a goodness-of-fit test, a test of independence, or a test of homogeneity. These three tests have the same \\(\\chi^2\\) distributions but with slightly different underlying assumptions. First, goodness-of-fit tests are used when comparing observed data to expected data. For example, this could be used to determine if respondent demographics (the observed data in the sample) match known population information (the expected data). In this case, we can set up the hypothesis test as follows: \\(H_0: p_1 = \\pi_1, ~ p_2 = \\pi_2, ~ ..., ~ p_k = \\pi_k\\) where \\(p_i\\) is the observed proportion for category \\(i\\), \\(\\pi_i\\) is expected proportion for category \\(i\\), and \\(k\\) is the number of categories \\(H_A:\\) at least one level of \\(p_i\\) does not match \\(\\pi_i\\) Second, tests of independence are used when comparing two types of observed data to see if there is a relationship. For example, this could be used to determine if the proportion of respondents who voted for each political party in the presidential election matches the proportion of respondents who voted for each political party in a local election. In this case, we can set up the hypothesis test as follows: \\(H_0:\\) The two variables/factors are independent \\(H_A:\\) The two variables/factors are not independent Third, tests of homogeneity are used to compare two distributions to see if they match. For example, this could be used to determine if the highest education achieved is the same for both men and women. In this case, we can set up the hypothesis test as follows: \\(H_0: p_{1a} = p_{1b}, ~ p_{2a} = p_{2b}, ~ ..., ~ p_{ka} = p_{kb}\\) where \\(p_{ia}\\) is the observed proportion of category \\(i\\) for subgroup \\(a\\), \\(p_{ib}\\) is the observed proportion of category \\(i\\) for subgroup \\(a\\) and \\(k\\) is the number of categories \\(H_A:\\) at least one category of \\(p_{ia}\\) does not match \\(p_{ib}\\) As with t-tests, the difference between using \\(\\chi^2\\) tests with non-survey data and survey data is based on the underlying variance estimation. The functions in the {survey} package will account for these nuances, provided the design object is correctly defined. For basic variance estimation formulas for different survey design types, refer to Chapter 10. 6.4.1 Syntax When we do not have survey data, we may be able to use the chisq.test() function from the {stats} package. However, this function does not allow for weights or the variance structure to be accounted for with survey data. Therefore, when using survey data, we need to use one of two functions: svygofchisq(): For goodness of fit tests svychisq(): For tests of independence and homogeneity The non-survey data function of chisq.test() requires either a single set of counts and given proportions (for goodness of fit tests) or two sets of counts for tests of independence and homogeneity. The functions we use with survey data require respondent-level data and formulas instead of counts. This ensures that the variances are correctly calculated. First, the function for the goodness of fit tests is svygofchisq(): svygofchisq(formula, p, design, na.rm = TRUE, ...) The arguments are: formula: Formula specifying a single factor variable p: Vector of probabilities for the categories of the factor in the correct order. If they probabilities do not sum to 1, they will be rescaled to sum to 1. design: Survey design object …: Other arguments to pass on, such as na.rm Based on the order of the arguments, we again must use the dot (.) notation if we pipe in the survey design object or explicitly name the arguments as described in Section 6.2. For the goodness of fit tests, the formula will be a single variable formula = ~var as we compare the observed data from this variable to the expected data. The expected probabilities are then entered in the p argument and need to be a vector of the same length as the number of categories in the variable. For example, if we want to know if the proportion of males and females matches a distribution of 30/70, then the sex variable (with two categories) would be used formula = ~SEX, and the proportions would be included as p = c(.3, .7). It is important to note that the variable entered into the formula should be formatted as either a factor or a character. The examples below provide more detail and tips on how to make sure the levels match up correctly. For tests of homogeneity and independence, the svychisq() function should be used. The syntax is as follows: svychisq( formula, design, statistic = c("F", "Chisq", "Wald", "adjWald", "lincom", "saddlepoint"), na.rm = TRUE ) The arguments are: formula: Model formula specifying the table (shown in examples) design: Survey design object statistic: Type of test statistic to use in test (details below) na.rm: Remove missing values There are six statistics that are accepted in this formula. For tests of homogeneity (when comparing cross-tabulations), the F or Chisq statistics should be used.23 The F statistic is the default and uses the Rao-Scott second-order correction. This correction is designed to assist with complicated sampling designs (i.e., those other than a simple random sample) (Scott 2007). The Chisq statistic is an adjusted version of the Pearson \\(\\chi^2\\) statistic. The version of this statistic in the svychisq() function compares the design effect estimate from the provided survey data to what the \\(\\chi^2\\) distribution would have been if the data came from a simple random sampling. For tests of independence, the Wald and adjWald are recommended as they provide a better adjustment for variable comparisons (Lumley 2010). If the data has a small number of primary sampling units (PSUs) compared to the degrees of freedom, then the adjWald statistic should be used to account for this. The lincom and saddlepoint statistics are available for more complicated data structures. The formula argument will always be one-sided, unlike the svyttest() function. The two variables of interest should be included with a plus sign: formula = ~ var_1 + var_2. As with the svygofchisq() function, the variables entered into the formula should be formatted as either a factor or a character. Additionally, as with the t-test function, both svygofchisq() and svychisq() have the na.rm argument. If any data is missing, the \\(\\chi^2\\) tests will assume that NA is a category and include it in the calculation. Throughout this chapter, we will always set na.rm = TRUE, but before analyzing the survey data, review the notes provided in Chapter 3 to better understand how to handle missing data. 6.4.2 Examples Let’s walk through a few examples using the ANES data. Example 1: Goodness of Fit Test ANES asked respondents about their highest education level.24 Based on the data from the 2020 American Community Survey (ACS) 5-year estimates25, the education distribution of those aged 18+ in the United States (among the 50 states and District of Columbia) is as follows: 11% had less than High School degree 27% had a High School degree 29% had some college or associate’s degree 33% had a bachelor’s degree or higher If we want to see if the weighted distribution from the ANES 2020 data matches this distribution, we could set up the hypothesis as follows: \\(H_0: p_1 = 0.11, ~ p_2 = 0.27, ~ p_3 = 0.29, ~ p_4 = 0.33\\) \\(H_A:\\) at least one of the education levels does not match between the ANES and the ACS To conduct this in R, let’s first look at the education variable (Education) we have on the ANES data. Using the survey_mean() function discussed in Chapter 5, we can see the education levels and estimated proportions. anes_des %>% drop_na(Education) %>% group_by(Education) %>% summarize(p = survey_mean()) ## # A tibble: 5 × 3 ## Education p p_se ## <fct> <dbl> <dbl> ## 1 Less than HS 0.0805 0.00568 ## 2 High school 0.277 0.0102 ## 3 Post HS 0.290 0.00713 ## 4 Bachelor's 0.226 0.00633 ## 5 Graduate 0.126 0.00499 Based on this output, we can see that we have different levels than the ACS data provides. Specifically, the education data from ANES has two levels for Bachelor’s Degree or Higher (Bachelor’s and Graduate), so these two categories need to be collapsed into a single category to match the ACS data. For this, among other methods, we can use the {forcats} package from the tidyverse. The package’s fct_collapse() function helps us create a new variable by collapsing categories into a single one. Then, we will use the svygofchisq() function to compare the ANES data to the ACS data where we specify the updated design object, the formula using the collapsed education variable, the ACS estimates for education levels as p, and removing NA values. anes_des_educ <- anes_des %>% mutate(Education2 = fct_collapse(Education, "Bachelor or Higher" = c("Bachelor's", "Graduate"))) anes_des_educ %>% drop_na(Education2) %>% group_by(Education2) %>% summarize(p = survey_mean()) ## # A tibble: 4 × 3 ## Education2 p p_se ## <fct> <dbl> <dbl> ## 1 Less than HS 0.0805 0.00568 ## 2 High school 0.277 0.0102 ## 3 Post HS 0.290 0.00713 ## 4 Bachelor or Higher 0.352 0.00732 chi_ex1 <- anes_des_educ %>% svygofchisq( formula = ~ Education2, p = c(0.11, 0.27, 0.29, 0.33), design = ., na.rm = TRUE ) chi_ex1 ## ## Design-based chi-squared test for given probabilities ## ## data: ~Education2 ## X-squared = 2172220, scale = 1.1e+05, df = 2.3e+00, p-value = ## 9e-05 The output from the svygofchisq() indicates that at least one proportion from ANES does not match the ACS data (\\(\\chi^2 =\\) 2,172,220; p-value <0.0001). To get a better idea of the differences, we can use the expected output along with survey_mean() to create a comparison table: ex1_table <- anes_des_educ %>% drop_na(Education2) %>% group_by(Education2) %>% summarize(Observed = survey_mean(vartype = "ci")) %>% rename(Education = Education2) %>% mutate(Expected=c(0.11, 0.27, 0.29, 0.33)) %>% select(Education, Expected, everything()) ex1_table ## # A tibble: 4 × 5 ## Education Expected Observed Observed_low Observed_upp ## <fct> <dbl> <dbl> <dbl> <dbl> ## 1 Less than HS 0.11 0.0805 0.0691 0.0919 ## 2 High school 0.27 0.277 0.257 0.298 ## 3 Post HS 0.29 0.290 0.276 0.305 ## 4 Bachelor or Higher 0.33 0.352 0.337 0.367 This output includes our expected proportions from the ACS that we provided the svygofchisq() function along with the output of the observed proportions and their confidence intervals. This table shows that the “High school” and “Post HS” categories have nearly identical proportions but that the other two categories are slightly different. Looking at the confidence intervals, we can see that the ANES data skews to include fewer people in the “Less than HS” category and more people in the “Bachelor or Higher” category. This may be easier to see if we plot this. The code below uses the tabular output to create Figure 6.1. ex1_table %>% pivot_longer( cols = c("Expected", "Observed"), names_to = "Names", values_to = "Proportion" ) %>% mutate( Observed_low = if_else(Names == "Observed", Observed_low, NA_real_), Observed_upp = if_else(Names == "Observed", Observed_upp, NA_real_), Names = if_else(Names == "Observed", "ANES (observed)", "ACS (expected)") ) %>% ggplot(aes(x = Education, y = Proportion, color = Names)) + geom_point(alpha = 0.75, size = 2) + geom_errorbar(aes(ymin = Observed_low, ymax = Observed_upp), width = 0.25) + theme_bw() + scale_color_manual(name = "Type", values = book_colors[c(4, 1)]) + theme(legend.position = "bottom", legend.title=element_blank()) FIGURE 6.1: Expected and observed proportions of education, showing the confidence intervals for the expected proportions and whether the observed proportions lie within them. Example 2: Test of Independence ANES asked respondents two questions about trust: How often can you trust the federal government to do what is right? How often can you trust other people? If we want to see if the distributions of these two questions are similar or not, we can conduct a test of independence. Here is how the hypothesis could be set up: \\(H_0:\\) People’s trust in the federal government and their trust in other people are independent (i.e., not related) \\(H_A:\\) People’s trust in the federal government and their trust in other people are not independent (i.e., they are related) To conduct this in R, we use the svychisq() function to compare the two variables: chi_ex2 <- anes_des %>% svychisq( formula = ~ TrustGovernment + TrustPeople, design = ., statistic = "Wald", na.rm = TRUE ) chi_ex2 ## ## Design-based Wald test of association ## ## data: NextMethod() ## F = 21, ndf = 16, ddf = 51, p-value <2e-16 The output from svychisq() indicates that the distribution of people’s trust in the federal government and their trust in other people are not independent, meaning that they are related. Let’s output the distributions in a table to see the relationship. The observed output from the test provides a cross-tabulation of the counts for each category: chi_ex2$observed ## TrustPeople ## TrustGovernment Always Most of the time About half the time ## Always 16.470 25.009 31.848 ## Most of the time 11.020 539.377 196.258 ## About half the time 11.772 934.858 861.971 ## Some of the time 17.007 1353.779 839.863 ## Never 3.174 236.785 174.272 ## TrustPeople ## TrustGovernment Some of the time Never ## Always 36.854 5.523 ## Most of the time 206.556 27.184 ## About half the time 428.871 65.024 ## Some of the time 932.628 89.596 ## Never 217.994 189.307 However, as researchers, we often want to know about the proportions and not just the respondent counts from the survey. There are a couple of different ways that we can do this. The first is using the counts from chi_ex2$observed to calculate the proportion. We can then pivot the table to create a cross-tabulation similar to the counts table above. Adding group_by() to the code means that we are obtaining the proportions within each level of that variable. In this case, we are looking at the distribution of TrustGovernment for each level of TrustPeople. The resulting table is shown in Table 6.1 and in Chapter 8, we will discuss more on how to make publication-quality tables like this. chi_ex2_table<-chi_ex2$observed %>% as_tibble() %>% group_by(TrustPeople) %>% mutate(prop = round(n / sum(n), 3)) %>% select(-n) %>% pivot_wider(names_from = TrustPeople, values_from = prop) %>% gt(rowname_col = "TrustGovernment") %>% tab_stubhead(label = "Trust in Government") %>% tab_spanner(label = "Trust in People", columns = everything()) %>% cols_label(`Most of the time` = md("Most of<br />the time"), `About half the time` = md("About half<br />the time"), `Some of the time` = md("Some of<br />the time")) chi_ex2_table #wpsirohudl table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #wpsirohudl thead, #wpsirohudl tbody, #wpsirohudl tfoot, #wpsirohudl tr, #wpsirohudl td, #wpsirohudl th { border-style: none; } #wpsirohudl p { margin: 0; padding: 0; } #wpsirohudl .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #wpsirohudl .gt_caption { padding-top: 4px; padding-bottom: 4px; } #wpsirohudl .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #wpsirohudl .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #wpsirohudl .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #wpsirohudl .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #wpsirohudl .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #wpsirohudl .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #wpsirohudl .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #wpsirohudl .gt_column_spanner_outer:first-child { padding-left: 0; } #wpsirohudl .gt_column_spanner_outer:last-child { padding-right: 0; } #wpsirohudl .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #wpsirohudl .gt_spanner_row { border-bottom-style: hidden; } #wpsirohudl .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #wpsirohudl .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #wpsirohudl .gt_from_md > :first-child { margin-top: 0; } #wpsirohudl .gt_from_md > :last-child { margin-bottom: 0; } #wpsirohudl .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #wpsirohudl .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #wpsirohudl .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #wpsirohudl .gt_row_group_first td { border-top-width: 2px; } #wpsirohudl .gt_row_group_first th { border-top-width: 2px; } #wpsirohudl .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #wpsirohudl .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #wpsirohudl .gt_first_summary_row.thick { border-top-width: 2px; } #wpsirohudl .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #wpsirohudl .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #wpsirohudl .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #wpsirohudl .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #wpsirohudl .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #wpsirohudl .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #wpsirohudl .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #wpsirohudl .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #wpsirohudl .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #wpsirohudl .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #wpsirohudl .gt_left { text-align: left; } #wpsirohudl .gt_center { text-align: center; } #wpsirohudl .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #wpsirohudl .gt_font_normal { font-weight: normal; } #wpsirohudl .gt_font_bold { font-weight: bold; } #wpsirohudl .gt_font_italic { font-style: italic; } #wpsirohudl .gt_super { font-size: 65%; } #wpsirohudl .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #wpsirohudl .gt_asterisk { font-size: 100%; vertical-align: 0; } #wpsirohudl .gt_indent_1 { text-indent: 5px; } #wpsirohudl .gt_indent_2 { text-indent: 10px; } #wpsirohudl .gt_indent_3 { text-indent: 15px; } #wpsirohudl .gt_indent_4 { text-indent: 20px; } #wpsirohudl .gt_indent_5 { text-indent: 25px; } TABLE 6.1: Proportion of adults in the U.S. by levels of trust in people and government, ANES 2020 Trust in Government Trust in People Always Most ofthe time About halfthe time Some ofthe time Never Always 0.277 0.008 0.015 0.020 0.015 Most of the time 0.185 0.175 0.093 0.113 0.072 About half the time 0.198 0.303 0.410 0.235 0.173 Some of the time 0.286 0.438 0.399 0.512 0.238 Never 0.053 0.077 0.083 0.120 0.503 In Table 6.1, each column sums to 1. For example, we can say that it is estimated that of people who always trust in people, 27.7% also always trust in government based on the top-left cell but 5.3% never trust in government. The second option is to use group_by() and survey_mean() functions to calculate the proportions from the ANES design object. A reminder that with more than one variable listed in the group_by() statement, the proportions are within the first variable listed. As mentioned above, we are looking at the distribution of TrustGovernment for each level of TrustPeople. chi_ex2_obs <- anes_des %>% drop_na(TrustPeople, TrustGovernment) %>% group_by(TrustPeople, TrustGovernment) %>% summarize(Observed = round(survey_mean(vartype = "ci"), 3), .groups="drop") chi_ex2_obs_table<-chi_ex2_obs %>% mutate(prop = paste0(Observed, " (", Observed_low, ", ", Observed_upp, ")")) %>% select(TrustGovernment, TrustPeople, prop) %>% pivot_wider(names_from = TrustPeople, values_from = prop) %>% gt(rowname_col = "TrustGovernment") %>% tab_stubhead(label = "Trust in Government") %>% tab_spanner(label = "Trust in People", columns = everything()) %>% tab_options(page.orientation = "landscape") chi_ex2_obs_table #pwjosbtvff table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #pwjosbtvff thead, #pwjosbtvff tbody, #pwjosbtvff tfoot, #pwjosbtvff tr, #pwjosbtvff td, #pwjosbtvff th { border-style: none; } #pwjosbtvff p { margin: 0; padding: 0; } #pwjosbtvff .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #pwjosbtvff .gt_caption { padding-top: 4px; padding-bottom: 4px; } #pwjosbtvff .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #pwjosbtvff .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #pwjosbtvff .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #pwjosbtvff .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #pwjosbtvff .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #pwjosbtvff .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #pwjosbtvff .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #pwjosbtvff .gt_column_spanner_outer:first-child { padding-left: 0; } #pwjosbtvff .gt_column_spanner_outer:last-child { padding-right: 0; } #pwjosbtvff .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #pwjosbtvff .gt_spanner_row { border-bottom-style: hidden; } #pwjosbtvff .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #pwjosbtvff .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #pwjosbtvff .gt_from_md > :first-child { margin-top: 0; } #pwjosbtvff .gt_from_md > :last-child { margin-bottom: 0; } #pwjosbtvff .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #pwjosbtvff .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #pwjosbtvff .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #pwjosbtvff .gt_row_group_first td { border-top-width: 2px; } #pwjosbtvff .gt_row_group_first th { border-top-width: 2px; } #pwjosbtvff .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #pwjosbtvff .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #pwjosbtvff .gt_first_summary_row.thick { border-top-width: 2px; } #pwjosbtvff .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #pwjosbtvff .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #pwjosbtvff .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #pwjosbtvff .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #pwjosbtvff .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #pwjosbtvff .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #pwjosbtvff .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #pwjosbtvff .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #pwjosbtvff .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #pwjosbtvff .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #pwjosbtvff .gt_left { text-align: left; } #pwjosbtvff .gt_center { text-align: center; } #pwjosbtvff .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #pwjosbtvff .gt_font_normal { font-weight: normal; } #pwjosbtvff .gt_font_bold { font-weight: bold; } #pwjosbtvff .gt_font_italic { font-style: italic; } #pwjosbtvff .gt_super { font-size: 65%; } #pwjosbtvff .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #pwjosbtvff .gt_asterisk { font-size: 100%; vertical-align: 0; } #pwjosbtvff .gt_indent_1 { text-indent: 5px; } #pwjosbtvff .gt_indent_2 { text-indent: 10px; } #pwjosbtvff .gt_indent_3 { text-indent: 15px; } #pwjosbtvff .gt_indent_4 { text-indent: 20px; } #pwjosbtvff .gt_indent_5 { text-indent: 25px; } TABLE 6.2: Proportion of adults in the U.S. by levels of trust in people and government with confidence intervals, ANES 2020 Trust in Government Trust in People Always Most of the time About half the time Some of the time Never Always 0.277 (0.11, 0.444) 0.008 (0.004, 0.012) 0.015 (0.006, 0.024) 0.02 (0.008, 0.033) 0.015 (0, 0.029) Most of the time 0.185 (-0.009, 0.38) 0.175 (0.157, 0.192) 0.093 (0.078, 0.109) 0.113 (0.085, 0.141) 0.072 (0.021, 0.123) About half the time 0.198 (0.046, 0.35) 0.303 (0.281, 0.324) 0.41 (0.378, 0.441) 0.235 (0.2, 0.271) 0.173 (0.099, 0.246) Some of the time 0.286 (0.069, 0.503) 0.438 (0.415, 0.462) 0.399 (0.365, 0.433) 0.512 (0.481, 0.543) 0.238 (0.178, 0.298) Never 0.053 (-0.01, 0.117) 0.077 (0.064, 0.089) 0.083 (0.063, 0.103) 0.12 (0.097, 0.142) 0.503 (0.422, 0.583) Both methods produce the same output as the svychisq() function does account for the survey design. However, calculating the proportions directly from the design object means we can also obtain the variance information. In this case, the table output displays the survey estimate followed by the confidence intervals. Based on the output, we can see that of those who never trust people, 50.3% also never trust the government, while the proportions of never trusting the government are much lower for each of the other levels of trusting people. We may find it easier to look at these proportions graphically. We can use ggplot() and facets to provide an overview as shown below to create Figure 6.2: chi_ex2_obs %>% mutate(TrustPeople= fct_reorder(str_c("Trust in People:\\n", TrustPeople), order(TrustPeople))) %>% ggplot(aes(x = TrustGovernment, y = Observed, color = TrustGovernment)) + facet_wrap( ~ TrustPeople, ncol = 5) + geom_point() + geom_errorbar(aes(ymin = Observed_low, ymax = Observed_upp)) + ylab("Proportion") + xlab("") + theme_bw() + scale_color_manual(name="Trust in Government", values=book_colors) + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(), legend.position = "bottom") + guides(col = guide_legend(nrow=2)) FIGURE 6.2: Proportion of adults in the U.S. by levels of trust in people and government with confidence intervals, ANES 2020 Example 3: Test of Homogeneity Researchers and politicians often look at specific demographics each election cycle to understand how each group is leaning or voting toward candidates. The ANES data are collected post-election, but we can still see if there are differences in how specific demographic groups voted. If we want to see if there is a difference in how each age group voted for the 2020 candidates, this would be a test of homogeneity, and we can set up the hypothesis as follows: \\[\\begin{align*} H_0: p_{1_{Biden}} &= p_{1_{Trump}} = p_{1_{Other}},\\\\ p_{2_{Biden}} &= p_{2_{Trump}} = p_{2_{Other}},\\\\ p_{3_{Biden}} &= p_{3_{Trump}} = p_{3_{Other}},\\\\ p_{4_{Biden}} &= p_{4_{Trump}} = p_{4_{Other}},\\\\ p_{5_{Biden}} &= p_{5_{Trump}} = p_{5_{Other}},\\\\ p_{6_{Biden}} &= p_{6_{Trump}} = p_{6_{Other}} \\end{align*}\\] where \\(p_{i_{Biden}}\\) is the observed proportion of each age group (\\(i\\)) that voted for Joseph Biden, \\(p_{i_{Trump}}\\) is the observed proportion of each age group (\\(i\\)) that voted for Donald Trump, and \\(p_{i_{Other}}\\) is the observed proportion of each age group (\\(i\\)) that voted for another candidate \\(H_A:\\) at least one category of \\(p_{i_{Biden}}\\) does not match \\(p_{i_{Trump}}\\) or \\(p_{i_{Other}}\\) To conduct this in R, we use the svychisq() function to compare the two variables: chi_ex3 <- anes_des %>% drop_na(VotedPres2020_selection, AgeGroup) %>% svychisq( formula = ~ AgeGroup + VotedPres2020_selection, design = ., statistic = "Chisq", na.rm = TRUE ) chi_ex3 ## ## Pearson's X^2: Rao & Scott adjustment ## ## data: NextMethod() ## X-squared = 171, df = 10, p-value <2e-16 The output from svychisq() indicates a difference in how each age group voted in the 2020 election. To get a better idea of the different distributions, let’s output proportions to see the relationship. As we learned in Example 2 above, we can use chi_ex3$observed, or if we want to get the variance information (which is crucial with survey data), we can use survey_mean(). Remember, when we have two variables in group_by(), we obtain the proportions within each level of the variable listed. In this case, we are looking at the distribution of AgeGroup for each level of VotedPres2020_selection. chi_ex3_obs <- anes_des %>% filter(VotedPres2020 == "Yes") %>% drop_na(VotedPres2020_selection, AgeGroup) %>% group_by(VotedPres2020_selection, AgeGroup) %>% summarize(Observed = round(survey_mean(vartype = "ci"), 3)) chi_ex3_obs_table<-chi_ex3_obs %>% mutate(prop = paste0(Observed, " (", Observed_low, ", ", Observed_upp, ")")) %>% select(AgeGroup, VotedPres2020_selection, prop) %>% pivot_wider(names_from = VotedPres2020_selection, values_from = prop) %>% gt(rowname_col = "AgeGroup") %>% tab_stubhead(label = "Age Group") chi_ex3_obs_table #yaoyfmjtqg table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #yaoyfmjtqg thead, #yaoyfmjtqg tbody, #yaoyfmjtqg tfoot, #yaoyfmjtqg tr, #yaoyfmjtqg td, #yaoyfmjtqg th { border-style: none; } #yaoyfmjtqg p { margin: 0; padding: 0; } #yaoyfmjtqg .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #yaoyfmjtqg .gt_caption { padding-top: 4px; padding-bottom: 4px; } #yaoyfmjtqg .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #yaoyfmjtqg .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #yaoyfmjtqg .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #yaoyfmjtqg .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #yaoyfmjtqg .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #yaoyfmjtqg .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #yaoyfmjtqg .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #yaoyfmjtqg .gt_column_spanner_outer:first-child { padding-left: 0; } #yaoyfmjtqg .gt_column_spanner_outer:last-child { padding-right: 0; } #yaoyfmjtqg .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #yaoyfmjtqg .gt_spanner_row { border-bottom-style: hidden; } #yaoyfmjtqg .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #yaoyfmjtqg .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #yaoyfmjtqg .gt_from_md > :first-child { margin-top: 0; } #yaoyfmjtqg .gt_from_md > :last-child { margin-bottom: 0; } #yaoyfmjtqg .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #yaoyfmjtqg .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #yaoyfmjtqg .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #yaoyfmjtqg .gt_row_group_first td { border-top-width: 2px; } #yaoyfmjtqg .gt_row_group_first th { border-top-width: 2px; } #yaoyfmjtqg .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #yaoyfmjtqg .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #yaoyfmjtqg .gt_first_summary_row.thick { border-top-width: 2px; } #yaoyfmjtqg .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #yaoyfmjtqg .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #yaoyfmjtqg .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #yaoyfmjtqg .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #yaoyfmjtqg .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #yaoyfmjtqg .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #yaoyfmjtqg .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #yaoyfmjtqg .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #yaoyfmjtqg .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #yaoyfmjtqg .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #yaoyfmjtqg .gt_left { text-align: left; } #yaoyfmjtqg .gt_center { text-align: center; } #yaoyfmjtqg .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #yaoyfmjtqg .gt_font_normal { font-weight: normal; } #yaoyfmjtqg .gt_font_bold { font-weight: bold; } #yaoyfmjtqg .gt_font_italic { font-style: italic; } #yaoyfmjtqg .gt_super { font-size: 65%; } #yaoyfmjtqg .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #yaoyfmjtqg .gt_asterisk { font-size: 100%; vertical-align: 0; } #yaoyfmjtqg .gt_indent_1 { text-indent: 5px; } #yaoyfmjtqg .gt_indent_2 { text-indent: 10px; } #yaoyfmjtqg .gt_indent_3 { text-indent: 15px; } #yaoyfmjtqg .gt_indent_4 { text-indent: 20px; } #yaoyfmjtqg .gt_indent_5 { text-indent: 25px; } TABLE 6.3: Distribution of age group by presidential candidate selection with confidence intervals Age Group Biden Trump Other 18-29 0.204 (0.177, 0.231) 0.114 (0.095, 0.133) 0.228 (0.151, 0.306) 30-39 0.169 (0.153, 0.185) 0.147 (0.124, 0.17) 0.303 (0.21, 0.396) 40-49 0.163 (0.146, 0.18) 0.157 (0.136, 0.178) 0.21 (0.129, 0.291) 50-59 0.154 (0.136, 0.173) 0.234 (0.207, 0.261) 0.107 (0.041, 0.173) 60-69 0.179 (0.16, 0.199) 0.192 (0.172, 0.213) 0.102 (0.026, 0.179) 70 or older 0.13 (0.118, 0.143) 0.156 (0.139, 0.174) 0.049 (0, 0.099) We can see that the age group distribution that voted for Biden and other candidates was younger than those that voted for Trump. For example, of those who voted for Biden, 20.4% were in the 18-29 age group, compared to only 11.4% of those who voted for Trump were in that age group. On the other side, 23.4% of those who voted for Trump were in the 50-59 age group compared to only 15.4% of those who voted for Biden. 6.5 Exercises The exercises use the design objects anes_des and recs_des as provided in the Prerequisites box in the beginning of the chapter. Here are some exercises for practicing conducting t-tests using svyttest(): Using the RECS data, do more than 50% of U.S. households use AC (ACUsed)? ttest_solution1 <- recs_des %>% svyttest(design = ., formula = ((ACUsed == TRUE) - 0.5) ~ 0, na.rm = TRUE) ttest_solution1 ## ## Design-based one-sample t-test ## ## data: ((ACUsed == TRUE) - 0.5) ~ 0 ## t = 126, df = 58, p-value <2e-16 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## 0.3804 0.3927 ## sample estimates: ## mean ## 0.3865 Using the RECS data, does the average temperature that U.S. households set their thermostats to differ between the day and night in the winter (WinterTempDay and WinterTempNight)? ttest_solution2 <- recs_des %>% svyttest( design = ., formula = WinterTempDay - WinterTempNight ~ 0, na.rm = TRUE ) ttest_solution2 ## ## Design-based one-sample t-test ## ## data: WinterTempDay - WinterTempNight ~ 0 ## t = 46, df = 58, p-value <2e-16 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## 1.594 1.740 ## sample estimates: ## mean ## 1.667 Using the ANES data, does the average age (Age) of those who voted for Joseph Biden in 2020 (VotedPres2020_selection) differ from those who voted for another candidate? ttest_solution3 <- anes_des %>% svyttest( design = ., formula = Age ~ VotedPres2020_selection == "Biden", na.rm = TRUE ) ttest_solution3 ## ## Design-based t-test ## ## data: Age ~ VotedPres2020_selection == "Biden" ## t = -6, df = 50, p-value = 2e-07 ## alternative hypothesis: true difference in mean is not equal to 0 ## 95 percent confidence interval: ## -4.809 -2.388 ## sample estimates: ## difference in mean ## -3.598 If you wanted to determine if the political party affiliation differed for males and females, what test would you use? Goodness of fit test (svygofchisq()) Test of independence (svychisq()) Test of homogeneity (svychisq()) chisq_solution1 <- "c. Test of homogeneity (`svychisq()`)" chisq_solution1 ## [1] "c. Test of homogeneity (`svychisq()`)" In the RECS data, is there a relationship between the type of housing unit (HousingUnitType) and the year the house was built (YearMade)? chisq_solution2 <- recs_des %>% svychisq( formula = ~ HousingUnitType + YearMade, design = ., statistic = "Wald", na.rm = TRUE ) chisq_solution2 ## ## Design-based Wald test of association ## ## data: NextMethod() ## F = 68, ndf = 32, ddf = 59, p-value <2e-16 In the ANES data, is there a difference in the distribution of gender (Gender) across early voting status in 2020 (EarlyVote2020)? chisq_solution3 <- anes_des %>% svychisq( formula = ~ Gender + EarlyVote2020, design = ., statistic = "F", na.rm = TRUE ) chisq_solution3 ## ## Pearson's X^2: Rao & Scott adjustment ## ## data: NextMethod() ## F = 0.32, ndf = 1, ddf = 51, p-value = 0.6 References Lumley, Thomas. 2010. Complex Surveys: A Guide to Analysis Using r: A Guide to Analysis Using r. John Wiley; Sons. Scott, Alastair. 2007. Rao-Scott Corrections and Their Impact. Section on Survey Research Methods. http://www.asasrms.org/Proceedings/y2007/Files/JSM2007-000874.pdf; ASA. For more information on statistical testing, we recommend reviewing introduction to statistics textbooks.↩︎ This could change in the future if another package is built or {srvyr} is expanded to work with tidymodels but no such plans are known at this time.↩︎ During the summer, what is your home’s typical indoor temperature inside your home at night?↩︎ This is the temperature that Stephanie prefers at night during the summer, and she wanted to see if she was different from the population.↩︎ Is any air conditioning equipment used in your home?↩︎ Is any air conditioning equipment used in your home?↩︎ During the summer, what is your home’s typical indoor temperature inside your home at night?↩︎ During the winter, what is your home’s typical indoor temperature inside your home at night?↩︎ These two statistics can also be used for goodness of fit tests if the svygofchisq() function is not used.↩︎ What is the highest level of school you have completed or the highest degree you have received?↩︎ Data was pulled from data.census.gov using the S1501 Education Attainment 2020: ACS 5-Year Estimates Subject Tables↩︎ "],["c07-modeling.html", "Chapter 7 Modeling 7.1 Introduction 7.2 Analysis of Variance (ANOVA) 7.3 Gaussian Linear Regression 7.4 Logistic Regression 7.5 Exercises", " Chapter 7 Modeling Prerequisites For this chapter, load the following packages: library(tidyverse) library(survey) library(srvyr) library(srvyrexploR) library(broom) We will be using data from ANES and RECS described in Chapter 4. As a reminder, here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter 4 for more information). targetpop <- 231592693 data(anes_2020) anes_adjwgt <- anes_2020 %>% mutate(Weight = Weight / sum(Weight) * targetpop) anes_des <- anes_adjwgt %>% as_survey_design( weights = Weight, strata = Stratum, ids = VarUnit, nest = TRUE ) For RECS, details are included in the RECS documentation and Chapters 4 and 10. data(recs_2020) recs_des <- recs_2020 %>% as_survey_rep( weights = NWEIGHT, repweights = NWEIGHT1:NWEIGHT60, type = "JK1", scale = 59/60, mse = TRUE ) 7.1 Introduction Modeling data is a way for researchers to investigate the relationship between a single dependent variable and one or more independent variables. This builds upon the analyses conducted in Chapter 6, which looked at the relationships between just two variables. For example, in Example 3 in Section 6.3.2, we investigated if there is a relationship between the electrical bill cost and whether or not the household used air-conditioning. However, there are potentially other elements that could go into what the cost of electrical bill is in a household (e.g., outside temperature, desired internal temperature, types and number of appliances, etc.). T-tests only allow us to investigate the relationship of one independent variable at a time, but using models we can look into multiple variables and even explore interactions between these variables. There are several types of models, but in this chapter we will cover Analysis of Variance (ANOVA) and linear regression models following common Gaussian and logit distributions. Jonas Kristoffer Lindeløv has an interesting discussion of many statistical tests and models being equivalent to a linear model. For example, a one-way ANOVA is a linear model with one categorical independent variable, and a two-sample t-test is an ANOVA where the independent variable has exactly two levels. When modeling data, it is helpful to first create an equation that provides an overview as to what it is that we are modeling. The main structure of these models is as follows: \\[y_i=\\beta_0 +\\sum_{i=1}^p \\beta_i x_i + \\epsilon_i\\] where \\(y_i\\) is the outcome, \\(\\beta_0\\) is an intercept, \\(x_1, \\cdots, x_p\\) are the predictors with \\(\\beta_1, \\cdots, \\beta_p\\) as the associated coefficients, and \\(\\epsilon_i\\) is the error. Different models may not include an intercept, have interactions between different independent variables (\\(x_i\\)), or may have different underlying structures for the dependent variable (\\(y_i\\)). However, all linear models have the independent variables related to the dependent variable in a linear form. To specify these models in R, the formulas are the same with both survey data and other data. The left side of the formula is the response/dependent variable, and the right side of the formula has the predictor/independent variable(s). There are many symbols used in R to specify the formula. For example, a linear formula mathematically specified as \\[Y_i=\\beta_0+\\beta_1 X_i+\\epsilon_i\\] would be specified in R as y~x where the intercept is not explicitly included. To fit a model with no intercept, that is, \\[Y_i=\\beta_1 X_i+\\epsilon_i\\] it can be specified as y~x-1. Formula notation details in R can be found in the help file for formula26. A quick overview of the common formula notation is in the following table: Common symbols in formula notation Symbol Example Meaning + +X include this variable - -X delete this variable : X:Z include the interaction between these variables * X*Z include these variables and the interactions between them ^n (X+Z+Y)^3 include these variables and all interactions up to n-way I I(X-Z) as-as: include a new variable which is the difference of these variables There are often multiple ways to specify the same formula. For example, consider the following equation using the mtcars data \\[mpg_i=\\beta_0+\\beta_1cyl_{i}+\\beta_2disp_{i}+\\beta_3hp_{i}+\\beta_4cyl_{i}disp_{i}+\\beta_5cyl_{i}hp_{i}+\\beta_6disp_{i}hp_{i}+\\epsilon_i\\] This could be specified as any of the following: mpg~(cyl+disp+hp)^2 mpg~cyl+disp+hp+cyl:disp+cyl:hp+disp:hp mpg~cyl*disp+cyl*hp+disp*hp Note that the following two specifications are not the same: mpg~cyl:disp:hp this only has the interactions and not the main effect mpg~cyl*disp*hp this also has the 3-way interaction in addition to the main effects and 2-way interactions When using non-survey data such as experimental or observational data, researchers will use the glm() function for linear models. With survey data, however, we use svyglm() from the {survey} package to ensure that we account for the survey design and weights in modeling27. This allows us to generalize a model to the target population and accounts for the fact that the observations in the survey data may not be independent. As discussed in Chapter 6, modeling survey data cannot be directly done in {srvyr}, but can be done in the {survey} (Lumley 2010, 2023) package. In this chapter, we will provide syntax and examples for linear models, including ANOVA, Gaussian linear regression, and logistic regression. For details on other types of regression, including ordinal regression, log-linear models, and survival analysis, refer to Lumley (2010). Lumley (2010) also discusses custom models such as a negative binomial or Poisson model in Appendix E of his book. 7.2 Analysis of Variance (ANOVA) In ANOVA, we are testing whether the mean of an outcome is the same across two or more groups. Statistically, we set up this as follows: \\(H_0: \\mu_1 = \\mu_2= \\dots = \\mu_k\\) where \\(\\mu_i\\) is the mean outcome for group \\(i\\) \\(H_A: \\text{At least one mean is different}\\) Some assumptions when using ANOVA on survey data include: The outcome variable is normally distributed within each group The variances of the outcome variable between each group are approximately equal We do NOT assume independence between the groups as with general ANOVA. The covariance is accounted for in the survey design 7.2.1 Syntax To perform this type of analysis in R, the general syntax is as follows: des_obj %>% svyglm( formula = outcome ~ group, design = ., na.action = na.omit, df.resid = NULL ) The arguments are: formula: Formula in the form of outcome~group. The group variable must be a factor or character. design: a tbl_svy object created by as_survey na.action: handling of missing data df.resid: degrees of freedom for Wald tests (optional) - defaults to using degf(design)-(g-1) where \\(g\\) is the number of groups The function svyglm() does not have the design as the first argument so the dot (.) notation is used to pass it with a pipe (see Chapter 6 for more details). The default for missing data is na.omit, this means that we are removing all records with any missing data in either predictors or outcomes from analyses. There are other options for handling missing data and we recommend looking at the help documentation for na.omit (run help(na.omit) or ?na.omit) for more information on options to use for na.action. For a discussion of how to handle missing data see Chapter 3. 7.2.2 Example Looking at an example will help us discuss the output and how to interpret the results. In RECS, respondents are asked what temperature they set their thermostat to during the day and evening when using the air-conditioning during the summer. To analyze this data, we filter the respondents to only those using AC (ACUsed). Then if we want to see if there are differences by region, we can use group_by(). A descriptive analysis of the temperature at night (SummerTempNight) set by region and the sample sizes is displayed below. recs_des %>% filter(ACUsed) %>% group_by(Region) %>% summarize( SMN = survey_mean(SummerTempNight, na.rm = TRUE), n = unweighted(n()), n_na = unweighted(sum(is.na(SummerTempNight))) ) ## # A tibble: 4 × 5 ## Region SMN SMN_se n n_na ## <fct> <dbl> <dbl> <int> <int> ## 1 Northeast 69.7 0.103 3204 0 ## 2 Midwest 71.0 0.0897 3619 0 ## 3 South 71.8 0.0536 6065 0 ## 4 West 72.5 0.129 3283 0 In the following code, we test whether this temperature varies by region by first using svyglm() to run the test and then using broom::tidy() to display the output. Note that the temperature setting is set to NA when the household does not use air-conditioning, and thus na.action=na.omit is specified to ignore these cases. anova_out <- recs_des %>% svyglm(design = ., formula = SummerTempNight ~ Region, na.action = na.omit) tidy(anova_out) ## # A tibble: 4 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 69.7 0.103 674. 3.69e-111 ## 2 RegionMidwest 1.34 0.138 9.68 1.46e- 13 ## 3 RegionSouth 2.05 0.128 16.0 1.36e- 22 ## 4 RegionWest 2.80 0.177 15.9 2.27e- 22 In the output above, we can see the estimated coefficients (estimate), estimated standard errors of the coefficients (std.error), the t-statistic (statistic), and the p-value for each coefficient. In this output, the intercept represents the reference value of the Northeast region28. The other coefficients indicate the difference in temperature relative to the Northeast region. For example, in the Midwest, temperatures are set, on average, 1.34 degrees higher than in the Northeast during summer nights. 7.3 Gaussian Linear Regression Gaussian linear regression is a more generalized method than ANOVA where we fit a model of a continuous outcome with any number of categorical or continuous predictors, such that \\[y_i=\\beta_0 +\\sum_{i=1}^p \\beta_i x_i + \\epsilon_i\\] where \\(y_i\\) is the outcome, \\(\\beta_0\\) is an intercept, \\(x_1, \\cdots, x_n\\) are the predictors with \\(\\beta_1, \\cdots, \\beta_p\\) as the associated coefficients, and \\(\\epsilon_i\\) is the error. Assumptions in Gaussian linear regression using survey data include: The residuals (\\(\\epsilon_i\\)) are normally distributed, but there is not an assumption of independence, and the correlation structure is captured in the survey design object There is a linear relationship between the outcome variable and the independent variables The residuals are homoscedastic, that is, the error term is the same across all values of independent variables 7.3.1 Syntax The syntax for this regression uses the same function as ANOVA, but can have more than one variable listed on the right-hand side of the formula: des_obj %>% svyglm( formula = outcomevar ~ x1 + x2 + x3, design = ., na.action = na.omit, df.resid = NULL ) The arguments are: formula: Formula in the form of y~x design: a tbl_svy object created by as_survey na.action: handling of missing data df.resid: degrees of freedom for Wald tests (optional) - defaults to using degf(design)-p where \\(p\\) is the rank of the design matrix As discussed at the beginning of the chapter, the formula on the right-hand side can be specified in many ways, whether interactions are desired or not, for example. 7.3.2 Examples Example 1: Linear Regression with Single Variable On RECS, we can obtain information on the square footage of homes and the electric bills. We assume that square footage is related to the amount of money spent on electricity and examine a model for this. Before any modeling, we first plot the data to determine whether it is reasonable to assume a linear relationship. In Figure 7.1, each hexagon represents the weighted count of households in the bin and we can see a general positive linear trend (as the square footage increases so does the amount of money spent on electricity). FIGURE 7.1: Relationship between square footage and dollars spent on electricity, RECS 2020 Given that the plot shows a potential relationship, fitting a model will allow us to determine if the relationship is statistically significant. The model is fit below with electricity expenditure as the outcome. m_electric_sqft <- recs_des %>% svyglm(design = ., formula = DOLLAREL ~ TOTSQFT_EN, na.action = na.omit) tidy(m_electric_sqft) ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 837. 12.8 65.5 4.43e-56 ## 2 TOTSQFT_EN 0.299 0.00717 41.7 6.34e-45 In the output above, we can see the estimated coefficients (estimate), estimated standard errors of the coefficients (std.error), the t-statistic (statistic), and the p-value for each coefficient. In these results, we can say that, on average, for every additional square foot of house size, the electricity bill increases by 29.9 cents and that square footage is significantly associated with electricity expenditure. This is a very simple model, and there are likely many more factors in electricity expenditure, including the type of cooling, number of appliances, location, and more. However, often starting with one variable models can help researchers understand what potential relationships there are between variables before fitting more complex models. Often researchers start with known relationships before building models to determine what impact additional variables have on the model. Example 2: Linear Regression with Additional Variables and Interactions In the following example, a model is fit to predict electricity expenditure, including Census region (factor/categorical), urbanicity (factor/categorical), square footage (double/numeric), and whether air-conditioning is used (logical/categorical) with all two-way interactions also included. As a reminder, using -1 means that we are fitting this model without an intercept. m_electric_multi <- recs_des %>% svyglm( design = ., formula = DOLLAREL ~ (Region + Urbanicity + TOTSQFT_EN + ACUsed)^2 - 1, na.action = na.omit ) tidy(m_electric_multi) %>% print(n = 50) ## # A tibble: 25 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 RegionNortheast 5.44e+2 56.6 9.61 2.37e-11 ## 2 RegionMidwest 7.02e+2 78.1 8.99 1.28e-10 ## 3 RegionSouth 9.39e+2 47.0 20.0 1.02e-20 ## 4 RegionWest 6.03e+2 36.3 16.6 3.54e-18 ## 5 UrbanicityUrban Cluster 7.30e+1 81.5 0.896 3.76e- 1 ## 6 UrbanicityRural 2.04e+2 80.7 2.53 1.61e- 2 ## 7 TOTSQFT_EN 2.41e-1 0.0279 8.65 3.28e-10 ## 8 ACUsedTRUE 2.52e+2 54.1 4.66 4.42e- 5 ## 9 RegionMidwest:UrbanicityUrban … 1.83e+2 82.4 2.22 3.28e- 2 ## 10 RegionSouth:UrbanicityUrban Cl… 1.53e+2 76.0 2.01 5.26e- 2 ## 11 RegionWest:UrbanicityUrban Clu… 9.80e+1 75.2 1.30 2.01e- 1 ## 12 RegionMidwest:UrbanicityRural 3.13e+2 50.9 6.15 4.92e- 7 ## 13 RegionSouth:UrbanicityRural 2.20e+2 55.0 4.00 3.12e- 4 ## 14 RegionWest:UrbanicityRural 1.81e+2 58.7 3.08 3.98e- 3 ## 15 RegionMidwest:TOTSQFT_EN -4.88e-2 0.0234 -2.09 4.41e- 2 ## 16 RegionSouth:TOTSQFT_EN 2.97e-3 0.0264 0.113 9.11e- 1 ## 17 RegionWest:TOTSQFT_EN -2.93e-2 0.0294 -0.997 3.25e- 1 ## 18 RegionMidwest:ACUsedTRUE -2.93e+2 60.2 -4.86 2.42e- 5 ## 19 RegionSouth:ACUsedTRUE -2.94e+2 57.4 -5.12 1.12e- 5 ## 20 RegionWest:ACUsedTRUE -7.77e+1 47.0 -1.65 1.08e- 1 ## 21 UrbanicityUrban Cluster:TOTSQF… -3.93e-2 0.0241 -1.63 1.11e- 1 ## 22 UrbanicityRural:TOTSQFT_EN -6.45e-2 0.0248 -2.60 1.37e- 2 ## 23 UrbanicityUrban Cluster:ACUsed… -1.30e+2 60.3 -2.16 3.77e- 2 ## 24 UrbanicityRural:ACUsedTRUE -3.38e+1 59.3 -0.570 5.72e- 1 ## 25 TOTSQFT_EN:ACUsedTRUE 8.29e-2 0.0238 3.48 1.35e- 3 As shown above, there are many terms in this model. To test whether coefficients for a term are different from zero, the function regTermTest() can be used. For example, in the above regression, we can test whether the interaction of region and urbanicity is significant as follows: urb_reg_test <- regTermTest(m_electric_multi, ~Urbanicity:Region) urb_reg_test ## Wald test for Urbanicity:Region ## in svyglm(design = ., formula = DOLLAREL ~ (Region + Urbanicity + ## TOTSQFT_EN + ACUsed)^2 - 1, na.action = na.omit) ## F = 6.851 on 6 and 35 df: p= 7.2e-05 This output indicates there is a significant interaction between urbanicity and region (p-value=\\(<0.0001\\)). To examine the predictions, residuals and more from the model, the function augment() from {broom} can be used. The augment() function will return a tibble with the independent and dependent variables and other fit statistics. The augment() function has not been specifically written for objects of class svyglm, and as such, a warning will be displayed indicating this at this time. As it was not written exactly for this class of objects, a little tweaking needs to be done after using augment to get the predicted (.fitted) and standard error (.se.fit) values. To obtain the standard error of the fitted values we need to use the attr() function on the .fitted values created by augment(). fitstats <- augment(m_electric_multi) %>% mutate(.se.fit = sqrt(attr(.fitted, "var")), .fitted = as.numeric(.fitted)) fitstats ## # A tibble: 18,496 × 13 ## DOLLAREL Region Urbanicity TOTSQFT_EN ACUsed `(weights)` .fitted ## <dbl> <fct> <fct> <dbl> <lgl> <dbl> <dbl> ## 1 1955. West Urban Area 2100 TRUE 0.492 1397. ## 2 713. South Urban Area 590 TRUE 1.35 1090. ## 3 335. West Urban Area 900 TRUE 0.849 1043. ## 4 1425. South Urban Area 2100 TRUE 0.793 1584. ## 5 1087 Northeast Urban Area 800 TRUE 1.49 1055. ## 6 1896. South Urban Area 4520 TRUE 1.09 2375. ## 7 1418. South Urban Area 2100 TRUE 0.851 1584. ## 8 1237. South Urban Clust… 900 FALSE 1.45 1349. ## 9 538. South Urban Area 750 TRUE 0.185 1142. ## 10 625. West Urban Area 760 TRUE 1.06 1002. ## # ℹ 18,486 more rows ## # ℹ 6 more variables: .resid <dbl>, .hat <dbl>, .sigma <dbl>, ## # .cooksd <dbl>, .std.resid <dbl>, .se.fit <dbl> These results can then be used in a variety of ways, including examining residual plots as illustrated in the code below and Figure 7.2. fitstats %>% ggplot(aes(x = .fitted, .resid)) + geom_point() + geom_hline(yintercept = 0, color = "red") + theme_minimal() + xlab("Fitted value of electricity cost") + ylab("Residual of model") + scale_y_continuous(labels = scales::dollar_format()) + scale_x_continuous(labels = scales::dollar_format()) FIGURE 7.2: Residual plot of electric cost model with covariates Region, Urbanicity, TOTSQFT_EN, and ACUsed Additionally, augment() can be used to predict outcomes for data not used in modeling. Perhaps, we would like to predict the energy expenditure for a home in an urban area in the south that uses air-conditioning and is 2,500 square feet. To do this, we first make a tibble including that additional data and then use the newdata argument in the augment() function. As before, to obtain the standard error of the predicted values we need to use the attr() function. add_data <- recs_2020 %>% select(DOEID, Region, Urbanicity, TOTSQFT_EN, ACUsed, DOLLAREL) %>% rbind( tibble( DOEID = NA, Region = "South", Urbanicity = "Urban Area", TOTSQFT_EN = 2500, ACUsed = TRUE, DOLLAREL = NA ) ) %>% tail(1) pred_data <- augment(m_electric_multi, newdata = add_data) %>% mutate(.se.fit = sqrt(attr(.fitted, "var")), .fitted = as.numeric(.fitted)) pred_data ## # A tibble: 1 × 8 ## DOEID Region Urbanicity TOTSQFT_EN ACUsed DOLLAREL .fitted .se.fit ## <dbl> <fct> <fct> <dbl> <lgl> <dbl> <dbl> <dbl> ## 1 NA South Urban Area 2500 TRUE NA 1715. 22.6 In the above example, it is predicted that the energy expenditure would be $1714.57. 7.4 Logistic Regression Logistic regression is used to model a binary outcome and is a specific case of the generalized linear model (GLM). A GLM uses a link function to link the response variable to the linear model. In logistic regression, the link model is the logit function. Specifically, the model is specified as follows: \\[ y_i \\sim \\text{Bernoulli}(\\pi_i)\\] \\[\\begin{equation} \\log \\left(\\frac{\\pi_i}{1-\\pi_i} \\right)=\\beta_0 +\\sum_{i=1}^p \\beta_i x_i \\tag{7.1} \\end{equation}\\] which can be re-expressed as \\[ \\pi_i=\\frac{\\exp \\left(\\beta_0 +\\sum_{i=1}^p \\beta_i x_i \\right)}{1+\\exp \\left(\\beta_0 +\\sum_{i=1}^p \\beta_i x_i \\right)}.\\] where \\(y_i\\) is the outcome, \\(\\beta_0\\) is an intercept, and \\(x_1, \\cdots, x_n\\) are the predictors with \\(\\beta_1, \\cdots, \\beta_n\\) as the associated coefficients. Assumptions in logistic regression using survey data include: The outcome variable has two levels There is a linear relationship between the independent variables and the log odds (Equation (7.1)) The residuals are homoscedastic, that is, the error term is the same across all values of independent variables 7.4.1 Syntax The syntax for logistic regression is as follows: des_obj %>% svyglm( formula = outcomevar ~ x1 + x2 + x3, design = ., na.action = na.omit, df.resid = NULL, family = quasibinomial ) he arguments are: formula: Formula in the form of y~x design: a tbl_svy object created by as_survey na.action: handling of missing data df.resid: degrees of freedom for Wald tests (optional) - defaults to using degf(design)-p where \\(p\\) is the rank of the design matrix family: the error distribution/link function to be used in the model Note svyglm() is the same function used in both ANOVA and linear regression. However, we’ve added the link function quasibinomial. While we can use the binomial link function, it is recommended to use the quasibinomial as our weights may not be integers, and the quasibinomial also allows for overdispersion. The quasibinomial family has a default logit link which is what is specified in the equations above. When specifying the outcome variable, it will likely be specified in one of two ways with survey data: A two level factor variable where the first level of the factor indicates a “failure” and the second level indicates a “success” A numeric variable which is 1 or 0 where 1 indicates a success A logical variable where TRUE indicates a success 7.4.2 Examples Example 1: Logistic Regression with Single Variable In the following example, the ANES data is used, and we are modeling whether someone usually has trust in the government29 by who someone voted for president in 2020. As a reminder, the leading candidates were Biden and Trump though people could vote for someone else not in the Democratic or Republican parties. Those votes are all grouped into an “Other” category. We first create a binary outcome for trusting in the government and plot the data. A scatter plot of the raw data is not useful as it is all 0 and 1 outcomes, so instead, we plot a summary of the data. anes_des_der <- anes_des %>% mutate(TrustGovernmentUsually = case_when( is.na(TrustGovernment) ~ NA, TRUE ~ TrustGovernment %in% c("Always", "Most of the time") )) anes_des_der %>% group_by(VotedPres2020_selection) %>% summarize(pct_trust = survey_mean(TrustGovernmentUsually, na.rm = TRUE, proportion = TRUE, vartype = "ci"), .groups = "drop") %>% filter(complete.cases(.)) %>% ggplot(aes(x = VotedPres2020_selection, y = pct_trust, fill = VotedPres2020_selection)) + geom_bar(stat = "identity") + geom_errorbar(aes(ymin = pct_trust_low, ymax = pct_trust_upp), width = .2) + scale_fill_manual(values = c("#0b3954", "#bfd7ea", "#8d6b94")) + xlab("Election choice (2022)") + ylab("Usually trust the government") + scale_y_continuous(labels = scales::percent) + guides(fill = "none") + theme_minimal() FIGURE 7.3: Relationship between candidate selection and trust in government, ANES 2020 By looking at Figure 7.3 it appears that people who voted for Trump are more likely to say that they usually have trust in the government compared to those who voted for Biden and Other candidates. To determine if this insight is accurate, we next we fit the model. logistic_trust_vote <- anes_des_der %>% svyglm(design = ., formula = TrustGovernmentUsually ~ VotedPres2020_selection, family = quasibinomial) tidy(logistic_trust_vote) ## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -1.96 0.0714 -27.5 2.07e-31 ## 2 VotedPres2020_selectionTrump 0.435 0.0920 4.72 1.98e- 5 ## 3 VotedPres2020_selectionOther -0.655 0.440 -1.49 1.43e- 1 In the output above, we can see the estimated coefficients (estimate), estimated standard errors of the coefficients (std.error), the t-statistic (statistic), and the p-value for each coefficient. This output indicates that respondents who voted for Trump are 0.435 times more likely to usually have trust in the government compared to those who voted for Biden (the reference level). Sometimes it is easier to talk about the odds instead of the likelihood. In this case, we can also see the exponentiated coefficients which illustrates the odds: tidy(logistic_trust_vote, exponentiate = TRUE) %>% select(term, estimate) ## # A tibble: 3 × 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) 0.141 ## 2 VotedPres2020_selectionTrump 1.54 ## 3 VotedPres2020_selectionOther 0.520 We can interpret this as saying that the odds of usually trusting the government for someone who voted for Trump is 154% as likely to trust the government compared to a person who voted for Biden (the reference level). In comparison, a person who voted for neither Biden nor Trump is 52% as likely to trust the government as someone who voted for Biden. As with linear regression, the augment() can be used to predict values. By default, the prediction is the link function and not the probability. To predict the probability, add an argument of type.predict=\"response\" as demonstrated below: logistic_trust_vote %>% augment(type.predict = "response") %>% mutate(.se.fit = sqrt(attr(.fitted, "var")), .fitted = as.numeric(.fitted)) %>% select(TrustGovernmentUsually, VotedPres2020_selection, .fitted, .se.fit) ## # A tibble: 6,212 × 4 ## TrustGovernmentUsually VotedPres2020_selection .fitted .se.fit ## <lgl> <fct> <dbl> <dbl> ## 1 FALSE Other 0.0681 0.0279 ## 2 FALSE Biden 0.123 0.00772 ## 3 FALSE Biden 0.123 0.00772 ## 4 FALSE Trump 0.178 0.00919 ## 5 FALSE Biden 0.123 0.00772 ## 6 FALSE Trump 0.178 0.00919 ## 7 FALSE Biden 0.123 0.00772 ## 8 FALSE Biden 0.123 0.00772 ## 9 TRUE Biden 0.123 0.00772 ## 10 FALSE Biden 0.123 0.00772 ## # ℹ 6,202 more rows Example 2: Interaction Effects Let’s look at another example with interaction effects. If we’re interested in understanding the demographics of people who voted for Biden, we could include Gender and Education in our model. First we need to create an indicator for voted for Biden. Note that this indicator places anyone who did not vote at all into VoteBiden = 0. anes_des_ind <- anes_des %>% mutate(VoteBiden = case_when(VotedPres2020_selection == "Biden"~1, TRUE ~ 0)) Let’s first look at the main effects of gender and education. log_biden_main <- anes_des_ind %>% svyglm(design = ., formula = VoteBiden ~ Gender + Education, family = quasibinomial) tidy(log_biden_main) ## # A tibble: 6 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -1.24 0.191 -6.48 0.0000000545 ## 2 GenderFemale 0.157 0.0763 2.05 0.0458 ## 3 EducationHigh school 0.384 0.202 1.90 0.0631 ## 4 EducationPost HS 0.619 0.186 3.32 0.00175 ## 5 EducationBachelor's 1.20 0.191 6.32 0.0000000961 ## 6 EducationGraduate 1.53 0.211 7.26 0.00000000371 This main effect model indicates that respondents with a graduate degree are 1.53 times more likely to vote for Biden compared to respondents with less than a high school degree. However, we see that gender is not significant. It is possible that there is an interaction between gender and education. To determine this we can create a model that includes the interaction effects: log_biden_int <- anes_des_ind %>% svyglm(design = ., formula = VoteBiden ~ (Gender + Education)^2, family = quasibinomial) tidy(log_biden_int) ## # A tibble: 10 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -0.994 0.260 -3.82 4.32e-4 ## 2 GenderFemale -0.377 0.441 -0.856 3.97e-1 ## 3 EducationHigh school 0.0762 0.290 0.263 7.94e-1 ## 4 EducationPost HS 0.411 0.273 1.51 1.39e-1 ## 5 EducationBachelor's 1.01 0.270 3.75 5.30e-4 ## 6 EducationGraduate 1.13 0.282 4.02 2.36e-4 ## 7 GenderFemale:EducationHigh scho… 0.665 0.490 1.36 1.82e-1 ## 8 GenderFemale:EducationPost HS 0.474 0.452 1.05 3.00e-1 ## 9 GenderFemale:EducationBachelor's 0.436 0.451 0.967 3.39e-1 ## 10 GenderFemale:EducationGraduate 0.844 0.463 1.82 7.56e-2 The results from the interaction model show a single interaction effect that is significant. To better understand what this interaction means, we will want to plot the predicted probabilities. Let’s first obtain the predicted probabilities for each possible combination of variables using the augment() function. log_biden_pred <- log_biden_int %>% augment(type.predict = "response") %>% mutate(.se.fit = sqrt(attr(.fitted, "var")), .fitted = as.numeric(.fitted)) %>% select(VoteBiden, Gender, Education, .fitted, .se.fit) We can then use this information to plot the predicted probabilities to better understand the interaction effects. To create an interaction plot, the y-axis will be the predicted probabilities, and one of our x-variables will be on the x-axis and the other will be represented by multiple lines. Figure 7.4 shows the interaction plot with the gender variable on the x-axis and education represented by the lines. biden_int_plot <- log_biden_pred %>% filter(VoteBiden==1) %>% distinct() %>% arrange(Gender, Education) %>% mutate(Education = fct_reorder2(Education, Gender, .fitted)) %>% ggplot(aes(x = Gender, y = .fitted, group = Education, color = Education, linetype = Education)) + geom_line(linewidth = 1.1) + scale_color_manual(values = book_colors) + ylab("Predicted Probability of Voting for Biden") + guides(fill = "none") + theme_minimal() biden_int_plot FIGURE 7.4: Interaction Plot of Gender and Education Predicting the Probability of Voting for Biden From this plot we can see that respondents who indicated a male gender and had less than a high school education were more likely to vote for Biden than females among those with less than a high school education. Additionally, females with a graduate degree were more likely to vote for Biden than males with a graduate degree. Interactions in models can be difficult to understand from the coefficients alone. Using these interaction plots can help others understand the nuances of the results. 7.5 Exercises The type of housing unit may have an impact on energy expenses. Is there any relationship between housing unit type (HousingUnitType) and total energy expenditure (TOTALDOL)? First, find the average energy expenditure by housing unit type as a descriptive analysis and then do the test. The reference level in the comparison should be the housing unit type that is most common. recs_des %>% group_by(HousingUnitType) %>% summarize(Expense = survey_mean(TOTALDOL, na.rm = TRUE), HUs = survey_total()) %>% arrange(desc(HUs)) ## # A tibble: 5 × 5 ## HousingUnitType Expense Expense_se HUs HUs_se ## <fct> <dbl> <dbl> <dbl> <dbl> ## 1 Single-family detached 2205. 9.36 77067692. 0.00000277 ## 2 Apartment: 5 or more units 1108. 13.7 22835862. 0.000000226 ## 3 Apartment: 2-4 Units 1407. 24.2 9341795. 0.119 ## 4 Single-family attached 1653. 22.3 7451177. 0.114 ## 5 Mobile home 1773. 26.2 6832499. 0.0000000927 exp_unit_out <- recs_des %>% mutate(HousingUnitType = fct_infreq(HousingUnitType, NWEIGHT)) %>% svyglm( design = ., formula = TOTALDOL ~ HousingUnitType, na.action = na.omit ) tidy(exp_unit_out) ## # A tibble: 5 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 2205. 9.36 236. 2.53e-84 ## 2 HousingUnitTypeApartment: 5 or … -1097. 16.5 -66.3 3.52e-54 ## 3 HousingUnitTypeApartment: 2-4 U… -798. 28.0 -28.5 1.37e-34 ## 4 HousingUnitTypeSingle-family at… -551. 25.0 -22.1 5.28e-29 ## 5 HousingUnitTypeMobile home -431. 27.4 -15.7 5.36e-22 # Single-family detached units are most common # There is a significant relationship between energy expenditure and housing unit type Does temperature play a role in energy expenditure? Cooling degree days are a measure of how hot a place is. CDD65 for a given day indicates the number of degrees Fahrenheit warmer than 65°F (18.3°C) it is in a location. On a day that averages 65°F and below, CDD65=0. While a day that averages 85°F would have CDD80=20 because it is 20 degrees warmer. For each day in the year, this is summed to give an indicator of how hot the place is throughout the year. Similarly, HDD65 indicates the days colder than 65°F (18.3°C)30. Can energy expenditure be predicted using these temperature indicators along with square footage? Is there a significant relationship? Include main effects and two-way interactions. temps_sqft_exp <- recs_des %>% svyglm( design = ., formula = DOLLAREL ~ (TOTSQFT_EN + CDD65 + HDD65) ^ 2, na.action = na.omit ) tidy(temps_sqft_exp) ## # A tibble: 7 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 741. 70.5 10.5 1.44e-14 ## 2 TOTSQFT_EN 0.272 0.0471 5.77 4.27e- 7 ## 3 CDD65 0.0293 0.0227 1.29 2.02e- 1 ## 4 HDD65 -0.00111 0.0104 -0.107 9.15e- 1 ## 5 TOTSQFT_EN:CDD65 0.0000459 0.0000154 2.97 4.43e- 3 ## 6 TOTSQFT_EN:HDD65 -0.00000840 0.00000633 -1.33 1.90e- 1 ## 7 CDD65:HDD65 0.00000533 0.00000355 1.50 1.39e- 1 Continuing with our results from question 2, create a plot between the actual and predicted expenditures and a residual plot for the predicted expenditures. temps_sqft_exp_fit <- temps_sqft_exp %>% augment() %>% mutate(.se.fit = sqrt(attr(.fitted, "var")), # extract the variance of the fitted value .fitted = as.numeric(.fitted)) temps_sqft_exp_fit %>% ggplot(aes(x = DOLLAREL, y = .fitted)) + geom_point() + geom_abline(intercept = 0, slope = 1, color = "red") + xlab("Actual expenditures") + ylab("Predicted expenditures") + theme_minimal() FIGURE 7.5: Actual and predicted electricity expenditures temps_sqft_exp_fit %>% ggplot(aes(x = .fitted, y = .resid)) + geom_point() + geom_hline(yintercept = 0, color = "red") + xlab("Predicted expenditure") + ylab("Residual value of expenditure") + theme_minimal() FIGURE 7.6: Residual plot of electric cost model with covariates TOTSQFT_EN, CDD65, and HDD65 Early voting expanded in 202031. Build a logistic model predicting early voting in 2020 (EarlyVote2020) using age (Age), education (Education), and party identification (PartyID). Include two-way interactions. earlyvote_mod <- anes_des %>% filter(!is.na(EarlyVote2020)) %>% svyglm( design = ., formula = EarlyVote2020 ~ (Age + Education + PartyID) ^ 2 , family = quasibinomial ) tidy(earlyvote_mod) %>% arrange(p.value) ## # A tibble: 46 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Age:PartyIDIndependent -0.0587 0.0165 -3.55 0.0121 ## 2 PartyIDIndependent 4.98 1.71 2.92 0.0268 ## 3 Age:PartyIDNot very strong repu… -0.0501 0.0197 -2.54 0.0440 ## 4 PartyIDNot very strong republic… 4.14 1.65 2.51 0.0462 ## 5 (Intercept) 1.47 0.863 1.70 0.139 ## 6 EducationGraduate 1.52 0.954 1.59 0.163 ## 7 PartyIDStrong republican 1.77 1.29 1.37 0.221 ## 8 EducationHigh school:PartyIDStr… -1.37 1.01 -1.35 0.226 ## 9 EducationGraduate:PartyIDStrong… -1.28 1.00 -1.28 0.249 ## 10 EducationPost HS:PartyIDIndepen… -1.47 1.39 -1.06 0.331 ## # ℹ 36 more rows Continuing from Exercise 1, predict the probability of early voting for two people. Both are 28 years old and have a graduate degree, but one person is a strong Democrat, and the other is a strong Republican. add_vote_dat <- anes_2020 %>% select(EarlyVote2020, Age, Education, PartyID) %>% rbind(tibble( EarlyVote2020 = NA, Age = 28, Education = "Graduate", PartyID = c("Strong democrat", "Strong republican") )) %>% tail(2) log_ex_2_out <- earlyvote_mod %>% augment(newdata = add_vote_dat, type.predict = "response") %>% mutate(.se.fit = sqrt(attr(.fitted, "var")), # extract the variance of the fitted value .fitted = as.numeric(.fitted)) References Bollen, Kenneth A., Paul P. Biemer, Alan F. Karr, Stephen Tueller, and Marcus E. Berzofsky. 2016. “Are Survey Weights Needed? A Review of Diagnostic Tests in Regression Analysis.” Annual Review of Statistics and Its Application 3 (1): 375–92. https://doi.org/10.1146/annurev-statistics-011516-012958. Gelman, Andrew. 2007. “Struggles with Survey Weighting and Regression Modeling.” Statistical Science 22 (2): 153–64. https://doi.org/10.1214/088342306000000691. Lumley, Thomas. 2010. Complex Surveys: A Guide to Analysis Using r: A Guide to Analysis Using r. John Wiley; Sons. ———. 2023. Survey: Analysis of Complex Survey Samples. http://r-survey.r-forge.r-project.org/survey/. Use help(formula) or ?formula in R or find the documentation online at https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html↩︎ There is some debate about whether weights should be used in regression (Gelman 2007; Bollen et al. 2016). However, for the purposes of providing complete information on how to analyze complex survey data, this chapter will include weights.↩︎ To change the reference level, reorder the factor before modeling using the function relevel() from {stats} or using one of many factor ordering functions in {forcats} such as fct_relevel() or fct_infreq()↩︎ Question: How often can you trust the federal government in Washington to do what is right?↩︎ https://www.eia.gov/energyexplained/units-and-calculators/degree-days.php↩︎ https://www.npr.org/2020/10/26/927803214/62-million-and-counting-americans-are-breaking-early-voting-records↩︎ "],["c08-communicating-results.html", "Chapter 8 Communicating results 8.1 Introduction 8.2 Describing results through text 8.3 Visualizing data", " Chapter 8 Communicating results Prerequisites For this chapter, load the following packages: library(tidyverse) library(survey) library(srvyr) library(srvyrexploR) library(gt) library(gtsummary) We will be using data from ANES as described in Chapter 4. As a reminder, here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter 4 for more information). targetpop <- 231592693 data(anes_2020) anes_adjwgt <- anes_2020 %>% mutate(Weight = Weight / sum(Weight) * targetpop) anes_des <- anes_adjwgt %>% as_survey_design( weights = Weight, strata = Stratum, ids = VarUnit, nest = TRUE ) 8.1 Introduction After finishing the analysis and modeling, we proceed to the important task of communicating the survey results. Our audience may range from seasoned researchers familiar with our survey data to newcomers encountering the information for the first time. We should aim to explain the methodology and analysis while presenting findings in an accessible way, and it is our responsibility to report information with care. Before beginning any dissemination of results, consider questions such as: How will we present results? Examples include a website, print, or other media. Based on the media type, we might limit or enhance the use of graphical representation. What is the audience’s familiarity with the study and/or data? Audiences can range from the general public to data experts. If we anticipate limited knowledge about the study, we should provide detailed descriptions (we discuss recommendations later in the chapter). What are we trying to communicate? It could be summary statistics, trends, patterns, or other insights. Tables might suit summary statistics, while plots are better at conveying trends and patterns. Is the audience accustomed to interpreting plots? If not, include explanatory text to guide them on how to interpret the plots effectively. What is the audience’s statistical knowledge? If the audience does not have a strong statistics background, provide text on standard errors, confidence intervals, and other estimate types to enhance understanding. 8.2 Describing results through text As analysts, our emphasis is often on the data, and communicating results can sometimes be overlooked. First, we need to identify the appropriate information to share with our audience. Chapters 2 and 3 provide insights into factors we need to consider during analysis, and they remain relevant when presenting results to others. 8.2.1 Methodology If we are using existing data, methodologically-sound surveys will provide documentation about how the survey was fielded, the questionnaires, and other necessary information for analyses. For example, the survey’s methodology reports should include the population of interest, sampling procedures, response rates, questionnaire documentation, weighting, and a general overview of disclosure statements. Many American organizations follow the American Association for Public Opinion Research’s (AAPOR) Transparency Initiative. The AAPOR Transparency Initiative requires organizations to include specific details in their methodology, making it clear how we can and should analyze the results. Being transparent about these methods is vital for the scientific rigor of the field. The details provided in Chapter 2 about the survey process should be shared with the audience when presenting the results. When using publicly-available data, like the examples in this book, we can often link to the methodology report in our final output. We should also provide high-level information for the audience to quickly grasp the context around the findings. For example, we can mention when and where the study was conducted, the population’s age range, or other contextual details. This information helps the audience understand how generalizable the results are. Providing this material is especially important when there’s no methodology report available for the analyzed data. For example, if a researcher conducted a new survey for a specific purpose, we should document and present all the pertinent information during the analysis and reporting process. Adhering to the AAPOR Transparency Initiative guidelines is a reliable method to guarantee that all essential information is communicated to the audience. 8.2.2 Analysis Along with the survey methodology and weight calculations, we should also share our approach to preparing, cleaning, and analyzing the data. For example, in Chapter 6, we compared education distributions from the ANES survey to the American Community Survey (ACS). To make the comparison, we had to collapse education categories provided in the ANES data to match the ACS. The process for this particular example may seem straightforward (like combining Bachelor’s and Graduate Degrees into a single category), but there are multiple ways to deal with the data. Our choice is just one of many. We should document both the original ANES question and response options and the steps we took to match it with ACS data. This transparency helps clarify our analysis to our audience. Missing data is another instance where we want to be unambigious and upfront with our audience. In this book, numerous examples and exercises remove missing data, as this is often the easiest way to handle them. However, there are circumstances where missing data holds substantive importance, and excluding them could introduce bias (see Chapter 11). Being transparent about our handling of missing data is important to maintaining the integrity of our analysis and ensuring a comprehensive understanding of the results. 8.2.3 Results While tables and graphs are commonly used to communicate results, there are instances where text can be more effective in sharing information. Narrative details, such as context around point estimates or model coefficients, can go a long way in improving our communication. We have several strategies to effectively convey the significance of the data to the audience through text. First, we can highlight important data points in a sentence using plain language. For example, if we were looking at election polling data conducted before an election, we could say something like: As of [DATE], an estimated XX% of registered U.S. voters say they will vote for [CANDIDATE NAME] for president in the [YEAR] general election. This sentence provides key pieces of information in a straightforward way: [DATE]: Given that polling data is time-specific, providing the date of reference lets the audience know when this data was valid. Registered U.S. voters: This tells the audience who we surveyed, letting them know the target population. XX%: This part provides the estimated percentage of people voting for a specific candidate for a specific office. [YEAR] general election: As with the bullet above, adding this gives more context about the election type and year. The estimate would take on a different meaning if we changed it to a primary election instead of a general election. We also included the word “estimated.” When presenting aggregate survey results, we have errors around each estimate. We want to convey this uncertainty rather than talk in absolutes. Words like “estimated,” “on average,” or “around” can help communicate this uncertainty to the audience. Instead of saying ‘XX%,’ we can also say ‘XX% (+/- Y%)’ to show the margin of error. Confidence intervals can also be incorporated into the text to assist readers. Second, providing context and discussing the meaning behind a point estimate can help the audience glean some insight into why the data is important. For example, when comparing two values, it can be helpful to highlight if there are statistically significant differences and explain the impact and relevance of this information. This is where we, as analysts, should to do our best to be mindful of biases and present the facts logically. Keep in mind how we discuss these findings can greatly influence how the audience interprets them. If we include speculation, using phrases like “the authors speculate” or “these findings may indicate” relays the uncertainty around the notion while still lending a plausible solution. Additionally, we can present alternative viewpoints or competing discussion points to explain the uncertainty in the results. 8.3 Visualizing data Although discussing key findings in the text is important, presenting large amounts of data is often more digestible for the audience in tables or visualizations. Effectively combining text, tables, and graphs can be powerful in communicating results. This section provides examples of using the {gt}, {gtsummary}, and {ggplot2} packages to enhance the dissemination of results. 8.3.1 Tables Tables are a great way to provide a large amount of data when individual data points need to be examined. However, it is important to present tables in a reader-friendly format. Numbers should align, rows and columns should be easy to follow, and the table size should not compromise readability. Using key visualization techniques, we can create tables that are informative and nice to look at. Many packages create easy-to-read tables (e.g., {kable} + {kableExtra}, {gt}, {gtsummary}, {DT}, {formattable}, {flextable}, {reactable}). While we will focus on {gt} here, we encourage learning about others as they may have additional helpful features. We appreciate the flexibility, ability to use pipes (e.g., %>%), and numerous extensions of the {gt} package. Please note, at this time, {gtsummary} needs additional features to be widely used for survey analysis, particularly due to its lack of ability to work with replicate designs. We provide one example using {gtsummary} and hope it evolves into a more comprehensive tool over time. 8.3.1.1 Transitioning {srvyr} output to a {gt} table Let’s start by using some of the data we calculated earlier in this book. In Chapter 6, we looked at data on trust in government with the proportions calculated below: trust_gov <- anes_des %>% drop_na(TrustGovernment) %>% group_by(TrustGovernment) %>% summarize(trust_gov_p = survey_prop()) trust_gov ## # A tibble: 5 × 3 ## TrustGovernment trust_gov_p trust_gov_p_se ## <fct> <dbl> <dbl> ## 1 Always 0.0155 0.00204 ## 2 Most of the time 0.132 0.00553 ## 3 About half the time 0.309 0.00829 ## 4 Some of the time 0.434 0.00855 ## 5 Never 0.110 0.00566 The default output generated by R may work for initial viewing inside RStudio or when creating basic output in an R Markdown or Quarto document. However, when presenting these results in other publications, such as the print version of this book or with other formal dissemination modes, modifying the display can improve our reader’s experience. Looking at the output from trust_gov, a couple of improvements are obvious: (1) switching to percentages instead of proportions and (2) using the variable names as column headers. The {gt} package is a good tool for implementing better labeling and creating publishable tables. Let’s walk through some code as we make a few changes to improve the table’s usefulness. First, we initiate the table with the gt() function. Next, we use the argument rowname_col() to designate the TrustGovernment column as the labels for each row (called the table “stub”). We apply the cols_label() function to create informative column labels instead of variable names, and then the tab_spanner() function to add a label across multiple columns. In this case, we label all columns except the stub with “Trust in Government, 2020”. We then format the proportions into percentages with the fmt_percent() function and reduce the number of decimals shown with decimals = 1. Finally, the tab_caption() function adds a table title for HTML version of the book. We can use the caption for cross-referencing in R Markdown, Quarto, and bookdown, as well as adding it to the list of tables in the book. trust_gov_gt <- trust_gov %>% gt(rowname_col = "TrustGovernment") %>% cols_label(trust_gov_p = "%", trust_gov_p_se = "s.e. (%)") %>% tab_spanner(label = "Trust in Government, 2020", columns = c(trust_gov_p, trust_gov_p_se)) %>% fmt_percent(decimals = 1) trust_gov_gt %>% tab_caption("Example of gt table with trust in government estimate") #zvlxbwejwk table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #zvlxbwejwk thead, #zvlxbwejwk tbody, #zvlxbwejwk tfoot, #zvlxbwejwk tr, #zvlxbwejwk td, #zvlxbwejwk th { border-style: none; } #zvlxbwejwk p { margin: 0; padding: 0; } #zvlxbwejwk .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #zvlxbwejwk .gt_caption { padding-top: 4px; padding-bottom: 4px; } #zvlxbwejwk .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #zvlxbwejwk .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #zvlxbwejwk .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #zvlxbwejwk .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #zvlxbwejwk .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #zvlxbwejwk .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #zvlxbwejwk .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #zvlxbwejwk .gt_column_spanner_outer:first-child { padding-left: 0; } #zvlxbwejwk .gt_column_spanner_outer:last-child { padding-right: 0; } #zvlxbwejwk .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #zvlxbwejwk .gt_spanner_row { border-bottom-style: hidden; } #zvlxbwejwk .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #zvlxbwejwk .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #zvlxbwejwk .gt_from_md > :first-child { margin-top: 0; } #zvlxbwejwk .gt_from_md > :last-child { margin-bottom: 0; } #zvlxbwejwk .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #zvlxbwejwk .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #zvlxbwejwk .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #zvlxbwejwk .gt_row_group_first td { border-top-width: 2px; } #zvlxbwejwk .gt_row_group_first th { border-top-width: 2px; } #zvlxbwejwk .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #zvlxbwejwk .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #zvlxbwejwk .gt_first_summary_row.thick { border-top-width: 2px; } #zvlxbwejwk .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #zvlxbwejwk .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #zvlxbwejwk .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #zvlxbwejwk .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #zvlxbwejwk .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #zvlxbwejwk .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #zvlxbwejwk .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #zvlxbwejwk .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #zvlxbwejwk .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #zvlxbwejwk .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #zvlxbwejwk .gt_left { text-align: left; } #zvlxbwejwk .gt_center { text-align: center; } #zvlxbwejwk .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #zvlxbwejwk .gt_font_normal { font-weight: normal; } #zvlxbwejwk .gt_font_bold { font-weight: bold; } #zvlxbwejwk .gt_font_italic { font-style: italic; } #zvlxbwejwk .gt_super { font-size: 65%; } #zvlxbwejwk .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #zvlxbwejwk .gt_asterisk { font-size: 100%; vertical-align: 0; } #zvlxbwejwk .gt_indent_1 { text-indent: 5px; } #zvlxbwejwk .gt_indent_2 { text-indent: 10px; } #zvlxbwejwk .gt_indent_3 { text-indent: 15px; } #zvlxbwejwk .gt_indent_4 { text-indent: 20px; } #zvlxbwejwk .gt_indent_5 { text-indent: 25px; } TABLE 8.1: Example of gt table with trust in government estimate Trust in Government, 2020 % s.e. (%) Always 1.6% 0.2% Most of the time 13.2% 0.6% About half the time 30.9% 0.8% Some of the time 43.4% 0.9% Never 11.0% 0.6% We can add a few more enhancements, such as a title, a data source note, and a footnote with the question information, using the functions tab_header(), tab_source_note(), and tab_footnote(). If having the percentage sign in both the header and the cells seems redundant, we can opt for fmt_number() instead of fmt_percent() and scale the number by 100 with scale_by = 100. trust_gov_gt2 <- trust_gov_gt %>% tab_header("American voter's trust in the federal government, 2020") %>% tab_source_note("American National Election Studies, 2020") %>% tab_footnote( "Question text: How often can you trust the federal government in Washington to do what is right?" ) %>% fmt_number(scale_by = 100, decimals = 1) trust_gov_gt2 #tszdoybjxp table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #tszdoybjxp thead, #tszdoybjxp tbody, #tszdoybjxp tfoot, #tszdoybjxp tr, #tszdoybjxp td, #tszdoybjxp th { border-style: none; } #tszdoybjxp p { margin: 0; padding: 0; } #tszdoybjxp .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #tszdoybjxp .gt_caption { padding-top: 4px; padding-bottom: 4px; } #tszdoybjxp .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #tszdoybjxp .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #tszdoybjxp .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #tszdoybjxp .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #tszdoybjxp .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #tszdoybjxp .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #tszdoybjxp .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #tszdoybjxp .gt_column_spanner_outer:first-child { padding-left: 0; } #tszdoybjxp .gt_column_spanner_outer:last-child { padding-right: 0; } #tszdoybjxp .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #tszdoybjxp .gt_spanner_row { border-bottom-style: hidden; } #tszdoybjxp .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #tszdoybjxp .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #tszdoybjxp .gt_from_md > :first-child { margin-top: 0; } #tszdoybjxp .gt_from_md > :last-child { margin-bottom: 0; } #tszdoybjxp .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #tszdoybjxp .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #tszdoybjxp .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #tszdoybjxp .gt_row_group_first td { border-top-width: 2px; } #tszdoybjxp .gt_row_group_first th { border-top-width: 2px; } #tszdoybjxp .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #tszdoybjxp .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #tszdoybjxp .gt_first_summary_row.thick { border-top-width: 2px; } #tszdoybjxp .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #tszdoybjxp .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #tszdoybjxp .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #tszdoybjxp .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #tszdoybjxp .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #tszdoybjxp .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #tszdoybjxp .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #tszdoybjxp .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #tszdoybjxp .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #tszdoybjxp .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #tszdoybjxp .gt_left { text-align: left; } #tszdoybjxp .gt_center { text-align: center; } #tszdoybjxp .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #tszdoybjxp .gt_font_normal { font-weight: normal; } #tszdoybjxp .gt_font_bold { font-weight: bold; } #tszdoybjxp .gt_font_italic { font-style: italic; } #tszdoybjxp .gt_super { font-size: 65%; } #tszdoybjxp .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #tszdoybjxp .gt_asterisk { font-size: 100%; vertical-align: 0; } #tszdoybjxp .gt_indent_1 { text-indent: 5px; } #tszdoybjxp .gt_indent_2 { text-indent: 10px; } #tszdoybjxp .gt_indent_3 { text-indent: 15px; } #tszdoybjxp .gt_indent_4 { text-indent: 20px; } #tszdoybjxp .gt_indent_5 { text-indent: 25px; } TABLE 8.2: Example of gt table with trust in government estimates and additional context American voter's trust in the federal government, 2020 Trust in Government, 2020 % s.e. (%) Always 1.6 0.2 Most of the time 13.2 0.6 About half the time 30.9 0.8 Some of the time 43.4 0.9 Never 11.0 0.6 American National Election Studies, 2020 Question text: How often can you trust the federal government in Washington to do what is right? Expanding tables using {gtsummary} The {gtsummary} package simultaneously summarizes data and creates publication-ready tables. Initially designed for clinical trial data, it has been extended to include survey analysis in certain capacities. At this time, it is only compatible with survey objects using Taylor’s Series Linearization and not replicate methods. While it offers a restricted set of summary statistics, the following are available for categorical variables: {n} frequency {N} denominator, or cohort size {p} percentage {p.std.error} standard error of the sample proportion {deff} design effect of the sample proportion {n_unweighted} unweighted frequency {N_unweighted} unweighted denominator {p_unweighted} unweighted formatted percentage The following summary statistics are available for continuous variables: {median} median {mean} mean {mean.std.error} standard error of the sample mean {deff} design effect of the sample mean {sd} standard deviation {var} variance {min} minimum {max} maximum {p##} any integer percentile, where ## is an integer from 0 to 100 {sum} sum In the following example, we will build a table using {gtsummary}, similar to the table in the {gt} example. The main function we use is tbl_svysummary(). In this function, we include the variables we want to analyze in the include argument and define the statistics we want to display in the statistic argument. To specify the statistics, we apply the syntax from the {glue} package, where we enclose the variables we want to insert within curly brackets. We must specify the desired statistics using the names listed above. For example, to specify that we want the proportion followed by the standard error of the proportion in parentheses, we use {p} ({p.std.error}). anes_des_gtsum <- anes_des %>% tbl_svysummary(include = TrustGovernment, statistic = list(all_categorical() ~ "{p} ({p.std.error})")) anes_des_gtsum #vmekcwfmmx table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #vmekcwfmmx thead, #vmekcwfmmx tbody, #vmekcwfmmx tfoot, #vmekcwfmmx tr, #vmekcwfmmx td, #vmekcwfmmx th { border-style: none; } #vmekcwfmmx p { margin: 0; padding: 0; } #vmekcwfmmx .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #vmekcwfmmx .gt_caption { padding-top: 4px; padding-bottom: 4px; } #vmekcwfmmx .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #vmekcwfmmx .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #vmekcwfmmx .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #vmekcwfmmx .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #vmekcwfmmx .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #vmekcwfmmx .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #vmekcwfmmx .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #vmekcwfmmx .gt_column_spanner_outer:first-child { padding-left: 0; } #vmekcwfmmx .gt_column_spanner_outer:last-child { padding-right: 0; } #vmekcwfmmx .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #vmekcwfmmx .gt_spanner_row { border-bottom-style: hidden; } #vmekcwfmmx .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #vmekcwfmmx .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #vmekcwfmmx .gt_from_md > :first-child { margin-top: 0; } #vmekcwfmmx .gt_from_md > :last-child { margin-bottom: 0; } #vmekcwfmmx .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #vmekcwfmmx .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #vmekcwfmmx .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #vmekcwfmmx .gt_row_group_first td { border-top-width: 2px; } #vmekcwfmmx .gt_row_group_first th { border-top-width: 2px; } #vmekcwfmmx .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #vmekcwfmmx .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #vmekcwfmmx .gt_first_summary_row.thick { border-top-width: 2px; } #vmekcwfmmx .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #vmekcwfmmx .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #vmekcwfmmx .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #vmekcwfmmx .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #vmekcwfmmx .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #vmekcwfmmx .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #vmekcwfmmx .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #vmekcwfmmx .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #vmekcwfmmx .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #vmekcwfmmx .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #vmekcwfmmx .gt_left { text-align: left; } #vmekcwfmmx .gt_center { text-align: center; } #vmekcwfmmx .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #vmekcwfmmx .gt_font_normal { font-weight: normal; } #vmekcwfmmx .gt_font_bold { font-weight: bold; } #vmekcwfmmx .gt_font_italic { font-style: italic; } #vmekcwfmmx .gt_super { font-size: 65%; } #vmekcwfmmx .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #vmekcwfmmx .gt_asterisk { font-size: 100%; vertical-align: 0; } #vmekcwfmmx .gt_indent_1 { text-indent: 5px; } #vmekcwfmmx .gt_indent_2 { text-indent: 10px; } #vmekcwfmmx .gt_indent_3 { text-indent: 15px; } #vmekcwfmmx .gt_indent_4 { text-indent: 20px; } #vmekcwfmmx .gt_indent_5 { text-indent: 25px; } TABLE 8.3: Example of gtsummary table with trust in government estimates Characteristic N = 231,034,1251 PRE: How often trust government in Washington to do what is right [revised]     Always 1.6 (0.00)     Most of the time 13 (0.01)     About half the time 31 (0.01)     Some of the time 43 (0.01)     Never 11 (0.01)     Unknown 673,773 1 % (SE(%)) The default table includes the weighted number of missing (or Unknown) records. The standard error is reported as a proportion, while the proportion is styled as a percentage. In the next step, we remove the Unknown category by setting the missing argument to “no” and format the standard error as a percentage using the digits argument. To improve the table for publication, we provide a more polished label for the “TrustGovernment” variable using the label argument. anes_des_gtsum2 <- anes_des %>% tbl_svysummary( include = TrustGovernment, statistic = list(all_categorical() ~ "{p} ({p.std.error})"), missing = "no", digits = list(TrustGovernment ~ style_percent), label = list(TrustGovernment ~ "Trust in Government, 2020") ) anes_des_gtsum2 #ovpobmgvpe table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #ovpobmgvpe thead, #ovpobmgvpe tbody, #ovpobmgvpe tfoot, #ovpobmgvpe tr, #ovpobmgvpe td, #ovpobmgvpe th { border-style: none; } #ovpobmgvpe p { margin: 0; padding: 0; } #ovpobmgvpe .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #ovpobmgvpe .gt_caption { padding-top: 4px; padding-bottom: 4px; } #ovpobmgvpe .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #ovpobmgvpe .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #ovpobmgvpe .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #ovpobmgvpe .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #ovpobmgvpe .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #ovpobmgvpe .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #ovpobmgvpe .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #ovpobmgvpe .gt_column_spanner_outer:first-child { padding-left: 0; } #ovpobmgvpe .gt_column_spanner_outer:last-child { padding-right: 0; } #ovpobmgvpe .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #ovpobmgvpe .gt_spanner_row { border-bottom-style: hidden; } #ovpobmgvpe .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #ovpobmgvpe .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #ovpobmgvpe .gt_from_md > :first-child { margin-top: 0; } #ovpobmgvpe .gt_from_md > :last-child { margin-bottom: 0; } #ovpobmgvpe .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #ovpobmgvpe .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #ovpobmgvpe .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #ovpobmgvpe .gt_row_group_first td { border-top-width: 2px; } #ovpobmgvpe .gt_row_group_first th { border-top-width: 2px; } #ovpobmgvpe .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #ovpobmgvpe .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #ovpobmgvpe .gt_first_summary_row.thick { border-top-width: 2px; } #ovpobmgvpe .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #ovpobmgvpe .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #ovpobmgvpe .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #ovpobmgvpe .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #ovpobmgvpe .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #ovpobmgvpe .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #ovpobmgvpe .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #ovpobmgvpe .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #ovpobmgvpe .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #ovpobmgvpe .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #ovpobmgvpe .gt_left { text-align: left; } #ovpobmgvpe .gt_center { text-align: center; } #ovpobmgvpe .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #ovpobmgvpe .gt_font_normal { font-weight: normal; } #ovpobmgvpe .gt_font_bold { font-weight: bold; } #ovpobmgvpe .gt_font_italic { font-style: italic; } #ovpobmgvpe .gt_super { font-size: 65%; } #ovpobmgvpe .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #ovpobmgvpe .gt_asterisk { font-size: 100%; vertical-align: 0; } #ovpobmgvpe .gt_indent_1 { text-indent: 5px; } #ovpobmgvpe .gt_indent_2 { text-indent: 10px; } #ovpobmgvpe .gt_indent_3 { text-indent: 15px; } #ovpobmgvpe .gt_indent_4 { text-indent: 20px; } #ovpobmgvpe .gt_indent_5 { text-indent: 25px; } TABLE 8.4: Example of gtsummary table with trust in government estimates with labeling and digits options Characteristic N = 231,034,1251 Trust in Government, 2020     Always 1.6 (0.2)     Most of the time 13 (0.6)     About half the time 31 (0.8)     Some of the time 43 (0.9)     Never 11 (0.6) 1 % (SE(%)) To exclude the term “Characteristic” and the estimated population size, we can modify the header using themodify_header() function to update the label. Further adjustments can be made based on personal preferences, organizational guidelines, or other style guides. If we prefer having the standard error in the header, similar to the {gt} table, instead of in the footnote (the {gtsummary} default), we can make these changes by specifying stat_0 in the modify_header() function. Additionally, using modify_footnote() with update = everything() ~ NA removes the standard error from the footnote. After transforming the object into a gt table using as_gt(), we can add footnotes and a title using the same methods explained in Section 8.3.1.1. anes_des_gtsum3 <- anes_des %>% tbl_svysummary( include = TrustGovernment, statistic = list(all_categorical() ~ "{p} ({p.std.error})"), missing = "no", digits = list(TrustGovernment ~ style_percent), label = list(TrustGovernment ~ "Trust in Government, 2020") ) %>% modify_footnote(update = everything() ~ NA) %>% modify_header(label = " ", stat_0 = "% (s.e.)") %>% as_gt() %>% tab_header("American voter's trust in the federal government, 2020") %>% tab_source_note("American National Election Studies, 2020") %>% tab_footnote( "Question text: How often can you trust the federal government in Washington to do what is right?" ) anes_des_gtsum3 #swkriahjpq table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #swkriahjpq thead, #swkriahjpq tbody, #swkriahjpq tfoot, #swkriahjpq tr, #swkriahjpq td, #swkriahjpq th { border-style: none; } #swkriahjpq p { margin: 0; padding: 0; } #swkriahjpq .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #swkriahjpq .gt_caption { padding-top: 4px; padding-bottom: 4px; } #swkriahjpq .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #swkriahjpq .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #swkriahjpq .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #swkriahjpq .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #swkriahjpq .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #swkriahjpq .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #swkriahjpq .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #swkriahjpq .gt_column_spanner_outer:first-child { padding-left: 0; } #swkriahjpq .gt_column_spanner_outer:last-child { padding-right: 0; } #swkriahjpq .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #swkriahjpq .gt_spanner_row { border-bottom-style: hidden; } #swkriahjpq .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #swkriahjpq .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #swkriahjpq .gt_from_md > :first-child { margin-top: 0; } #swkriahjpq .gt_from_md > :last-child { margin-bottom: 0; } #swkriahjpq .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #swkriahjpq .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #swkriahjpq .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #swkriahjpq .gt_row_group_first td { border-top-width: 2px; } #swkriahjpq .gt_row_group_first th { border-top-width: 2px; } #swkriahjpq .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #swkriahjpq .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #swkriahjpq .gt_first_summary_row.thick { border-top-width: 2px; } #swkriahjpq .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #swkriahjpq .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #swkriahjpq .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #swkriahjpq .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #swkriahjpq .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #swkriahjpq .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #swkriahjpq .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #swkriahjpq .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #swkriahjpq .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #swkriahjpq .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #swkriahjpq .gt_left { text-align: left; } #swkriahjpq .gt_center { text-align: center; } #swkriahjpq .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #swkriahjpq .gt_font_normal { font-weight: normal; } #swkriahjpq .gt_font_bold { font-weight: bold; } #swkriahjpq .gt_font_italic { font-style: italic; } #swkriahjpq .gt_super { font-size: 65%; } #swkriahjpq .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #swkriahjpq .gt_asterisk { font-size: 100%; vertical-align: 0; } #swkriahjpq .gt_indent_1 { text-indent: 5px; } #swkriahjpq .gt_indent_2 { text-indent: 10px; } #swkriahjpq .gt_indent_3 { text-indent: 15px; } #swkriahjpq .gt_indent_4 { text-indent: 20px; } #swkriahjpq .gt_indent_5 { text-indent: 25px; } TABLE 8.5: Example of gtsummary table with trust in government estimates with more labeling options and context American voter's trust in the federal government, 2020 % (s.e.) Trust in Government, 2020     Always 1.6 (0.2)     Most of the time 13 (0.6)     About half the time 31 (0.8)     Some of the time 43 (0.9)     Never 11 (0.6) American National Election Studies, 2020 Question text: How often can you trust the federal government in Washington to do what is right? We can also include continuous variables in the table. Below, we add a summary of the age variable by updating the include, statistic, and digits arguments. anes_des_gtsum4 <- anes_des %>% tbl_svysummary( include = c(TrustGovernment, Age), statistic = list( all_categorical() ~ "{p} ({p.std.error})", all_continuous() ~ "{mean} ({mean.std.error})" ), missing = "no", digits = list(TrustGovernment ~ style_percent, Age ~ c(1, 2)), label = list(TrustGovernment ~ "Trust in Government, 2020") ) %>% modify_footnote(update = everything() ~ NA) %>% modify_header(label = " ", stat_0 = "% (s.e.)") %>% as_gt() %>% tab_header("American voter's trust in the federal government, 2020") %>% tab_source_note("American National Election Studies, 2020") %>% tab_footnote( "Question text: How often can you trust the federal government in Washington to do what is right?" ) %>% tab_caption("Example of gtsummary table with trust in government estimates and average age") anes_des_gtsum4 #pgqdblhyjr table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #pgqdblhyjr thead, #pgqdblhyjr tbody, #pgqdblhyjr tfoot, #pgqdblhyjr tr, #pgqdblhyjr td, #pgqdblhyjr th { border-style: none; } #pgqdblhyjr p { margin: 0; padding: 0; } #pgqdblhyjr .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #pgqdblhyjr .gt_caption { padding-top: 4px; padding-bottom: 4px; } #pgqdblhyjr .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #pgqdblhyjr .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #pgqdblhyjr .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #pgqdblhyjr .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #pgqdblhyjr .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #pgqdblhyjr .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #pgqdblhyjr .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #pgqdblhyjr .gt_column_spanner_outer:first-child { padding-left: 0; } #pgqdblhyjr .gt_column_spanner_outer:last-child { padding-right: 0; } #pgqdblhyjr .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #pgqdblhyjr .gt_spanner_row { border-bottom-style: hidden; } #pgqdblhyjr .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #pgqdblhyjr .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #pgqdblhyjr .gt_from_md > :first-child { margin-top: 0; } #pgqdblhyjr .gt_from_md > :last-child { margin-bottom: 0; } #pgqdblhyjr .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #pgqdblhyjr .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #pgqdblhyjr .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #pgqdblhyjr .gt_row_group_first td { border-top-width: 2px; } #pgqdblhyjr .gt_row_group_first th { border-top-width: 2px; } #pgqdblhyjr .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #pgqdblhyjr .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #pgqdblhyjr .gt_first_summary_row.thick { border-top-width: 2px; } #pgqdblhyjr .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #pgqdblhyjr .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #pgqdblhyjr .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #pgqdblhyjr .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #pgqdblhyjr .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #pgqdblhyjr .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #pgqdblhyjr .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #pgqdblhyjr .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #pgqdblhyjr .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #pgqdblhyjr .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #pgqdblhyjr .gt_left { text-align: left; } #pgqdblhyjr .gt_center { text-align: center; } #pgqdblhyjr .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #pgqdblhyjr .gt_font_normal { font-weight: normal; } #pgqdblhyjr .gt_font_bold { font-weight: bold; } #pgqdblhyjr .gt_font_italic { font-style: italic; } #pgqdblhyjr .gt_super { font-size: 65%; } #pgqdblhyjr .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #pgqdblhyjr .gt_asterisk { font-size: 100%; vertical-align: 0; } #pgqdblhyjr .gt_indent_1 { text-indent: 5px; } #pgqdblhyjr .gt_indent_2 { text-indent: 10px; } #pgqdblhyjr .gt_indent_3 { text-indent: 15px; } #pgqdblhyjr .gt_indent_4 { text-indent: 20px; } #pgqdblhyjr .gt_indent_5 { text-indent: 25px; } TABLE 8.6: Example of gtsummary table with trust in government estimates and average age American voter's trust in the federal government, 2020 % (s.e.) Trust in Government, 2020     Always 1.6 (0.2)     Most of the time 13 (0.6)     About half the time 31 (0.8)     Some of the time 43 (0.9)     Never 11 (0.6) PRE: SUMMARY: Respondent age 47.3 (0.36) American National Election Studies, 2020 Question text: How often can you trust the federal government in Washington to do what is right? With {gtsummary}, we can also calculate statistics by different groups. Let’s modify the previous example to analyze data on whether a respondent voted for president in 2020. We update the by argument and refine the header. anes_des_gtsum5 <- anes_des %>% drop_na(VotedPres2020) %>% tbl_svysummary( include = TrustGovernment, statistic = list(all_categorical() ~ "{p} ({p.std.error})"), missing = "no", digits = list(TrustGovernment ~ style_percent), label = list(TrustGovernment ~ "Trust in Government, 2020"), by = VotedPres2020 ) %>% modify_footnote(update = everything() ~ NA) %>% modify_header(label = " ", stat_1 = "Voted", stat_2 = "Didn't vote") %>% as_gt() %>% tab_header( "American voter's trust in the federal government by whether they voted in the 2020 presidential election" ) %>% tab_source_note("American National Election Studies, 2020") %>% tab_footnote( "Question text: How often can you trust the federal government in Washington to do what is right?" ) anes_des_gtsum5 #axovtpgtwc table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #axovtpgtwc thead, #axovtpgtwc tbody, #axovtpgtwc tfoot, #axovtpgtwc tr, #axovtpgtwc td, #axovtpgtwc th { border-style: none; } #axovtpgtwc p { margin: 0; padding: 0; } #axovtpgtwc .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #axovtpgtwc .gt_caption { padding-top: 4px; padding-bottom: 4px; } #axovtpgtwc .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #axovtpgtwc .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #axovtpgtwc .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #axovtpgtwc .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #axovtpgtwc .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #axovtpgtwc .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #axovtpgtwc .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #axovtpgtwc .gt_column_spanner_outer:first-child { padding-left: 0; } #axovtpgtwc .gt_column_spanner_outer:last-child { padding-right: 0; } #axovtpgtwc .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #axovtpgtwc .gt_spanner_row { border-bottom-style: hidden; } #axovtpgtwc .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #axovtpgtwc .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #axovtpgtwc .gt_from_md > :first-child { margin-top: 0; } #axovtpgtwc .gt_from_md > :last-child { margin-bottom: 0; } #axovtpgtwc .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #axovtpgtwc .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #axovtpgtwc .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #axovtpgtwc .gt_row_group_first td { border-top-width: 2px; } #axovtpgtwc .gt_row_group_first th { border-top-width: 2px; } #axovtpgtwc .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #axovtpgtwc .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #axovtpgtwc .gt_first_summary_row.thick { border-top-width: 2px; } #axovtpgtwc .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #axovtpgtwc .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #axovtpgtwc .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #axovtpgtwc .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #axovtpgtwc .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #axovtpgtwc .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #axovtpgtwc .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #axovtpgtwc .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #axovtpgtwc .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #axovtpgtwc .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #axovtpgtwc .gt_left { text-align: left; } #axovtpgtwc .gt_center { text-align: center; } #axovtpgtwc .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #axovtpgtwc .gt_font_normal { font-weight: normal; } #axovtpgtwc .gt_font_bold { font-weight: bold; } #axovtpgtwc .gt_font_italic { font-style: italic; } #axovtpgtwc .gt_super { font-size: 65%; } #axovtpgtwc .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #axovtpgtwc .gt_asterisk { font-size: 100%; vertical-align: 0; } #axovtpgtwc .gt_indent_1 { text-indent: 5px; } #axovtpgtwc .gt_indent_2 { text-indent: 10px; } #axovtpgtwc .gt_indent_3 { text-indent: 15px; } #axovtpgtwc .gt_indent_4 { text-indent: 20px; } #axovtpgtwc .gt_indent_5 { text-indent: 25px; } TABLE 8.7: Example of gtsummary table with trust in government estimates by voting status American voter's trust in the federal government by whether they voted in the 2020 presidential election Voted Didn’t vote Trust in Government, 2020     Always 1.1 (0.2) 1.0 (1.1)     Most of the time 13 (0.6) 19 (5.6)     About half the time 31 (0.9) 27 (7.4)     Some of the time 45 (0.9) 47 (7.7)     Never 9.0 (0.7) 5.1 (2.3) American National Election Studies, 2020 Question text: How often can you trust the federal government in Washington to do what is right? 8.3.2 Charts and plots Survey analysis can yield an abundance of printed summary statistics and models. Even with the most careful analysis, interpreting the results can be overwhelming. This is where charts and plots play a key role in our work. By transforming complex data into a visual representation, we can recognize patterns, relationships, and trends with greater ease. R has numerous packages for creating compelling and insightful charts. In this section, we will focus on {ggplot2}, a member of the {tidyverse} collection of packages. Known for its power and flexibility, {ggplot2} is an invaluable tool for creating a wide range of data visualizations. The {ggplot2} package follows the “grammar of graphics,” a framework that incrementally adds layers of chart components. This approach allows us to customize visual elements such as scales, colors, labels, and annotations to enhance the clarity of our results. After creating the survey design object, we can modify it to include additional outcomes and calculate estimates for our desired data points. Below, we create a binary variable TrustGovernmentUsually, which is TRUE when TrustGovernment is “Always” or “Most of the time” and FALSE otherwise. Then, we calculate the percentage of people who usually trust the government based on their vote in the 2020 presidential election (VotedPres2020_selection). We remove the cases where people did not vote or did not indicate their choice. anes_des_der <- anes_des %>% mutate(TrustGovernmentUsually = case_when( is.na(TrustGovernment) ~ NA, TRUE ~ TrustGovernment %in% c("Always", "Most of the time") )) %>% drop_na(VotedPres2020_selection) %>% group_by(VotedPres2020_selection) %>% summarize( pct_trust = survey_mean( TrustGovernmentUsually, na.rm = TRUE, proportion = TRUE, vartype = "ci" ), .groups = "drop" ) anes_des_der ## # A tibble: 3 × 4 ## VotedPres2020_selection pct_trust pct_trust_low pct_trust_upp ## <fct> <dbl> <dbl> <dbl> ## 1 Biden 0.123 0.109 0.140 ## 2 Trump 0.178 0.161 0.198 ## 3 Other 0.0681 0.0290 0.152 Now, we can begin creating our chart with {ggplot2}. First, we set up our plot with ggplot(). Next, we define the data points to be displayed using aesthetics, or aes. Aesthetics represent the visual properties of the objects in the plot. In the example below, we map the x variable to VotedPres2020_selection from the dataset and the y variable to pct_trust. Finally, we specify the type of plot with geom_*(), in this case, geom_bar(). The resulting plot is displayed in Figure 8.1. p <- anes_des_der %>% ggplot(aes(x = VotedPres2020_selection, y = pct_trust)) + geom_bar(stat = "identity") p FIGURE 8.1: Bar chart of trust in government by chosen 2020 presidential candidate This is a great starting point: we observe that a higher percentage of people stating they usually trust the government among those who voted for Trump compared to those who voted for Biden or other candidates. Now, what if we want to introduce color to better differentiate the three groups? We can add fill under aesthetics, indicating that we want to use distinct values of VotedPres2020_selection to color the bars. In this instance, Biden and Trump will be displayed in different colors. pcolor <- anes_des_der %>% ggplot(aes(x = VotedPres2020_selection, y = pct_trust, fill = VotedPres2020_selection)) + geom_bar(stat = "identity") pcolor FIGURE 8.2: Bar chart of trust in government by chosen 2020 presidential candidate with colors Let’s say we wanted to follow proper statistical analysis practice and incorporate variability in our plot. We can add another geom, geom_errorbar(), to display the confidence intervals on top of our existing geom_bar() layer. We can add the layer using a plus sign +. pcol_error <- anes_des_der %>% ggplot(aes(x = VotedPres2020_selection, y = pct_trust, fill = VotedPres2020_selection)) + geom_bar(stat = "identity") + geom_errorbar(aes(ymin = pct_trust_low, ymax = pct_trust_upp), width = .2) pcol_error FIGURE 8.3: Bar chart of trust in government by chosen 2020 presidential candidate with colors and error bars We can continue adding to our plot until we achieve our desired look. For example, we can eliminate the color legend as it doesn’t contribute meaningful information with guides(fill = \"none\"). We can specify specific colors for fill using scale_fill_manual(). Inside the function, we provide a vector of values corresponding to the colors in our plot. These values are hexadecimal (hex) color codes, denoted by a leading pound sign # followed by six letters or numbers. The hex code #0b3954 used below is a dark blue. There are many tools online that help pick hex codes, such as htmlcolorcodes.com/. pfull <- anes_des_der %>% ggplot(aes(x = VotedPres2020_selection, y = pct_trust, fill = VotedPres2020_selection)) + geom_bar(stat = "identity") + geom_errorbar(aes(ymin = pct_trust_low, ymax = pct_trust_upp), width = .2) + scale_fill_manual(values = c("#0b3954", "#bfd7ea", "#8d6b94")) + xlab("Election choice (2020)") + ylab("Usually trust the government") + scale_y_continuous(labels = scales::percent) + guides(fill = "none") + labs(title = "Percent of voters who usually trust the government by chosen 2020 presidential candidate", caption = "Source: American National Election Studies, 2020") pfull FIGURE 8.4: Bar chart of trust in government by chosen 2020 presidential candidate with colors, labels, error bars, and title What we’ve explored in this section are just the foundational aspects of {ggplot2}, and the capabilities of this package extend far beyond what we’ve covered. Advanced features such as annotation, faceting, and theming allow for more sophisticated and customized visualizations. The book Wickham (2023) is a comprehensive guide to learning more about this powerful tool. References ———. 2023. Ggplot2: Elegant Graphics for Data Analysis. 3rd Edition. https://ggplot2-book.org/; Springer. "],["c09-reprex-data.html", "Chapter 9 Reproducible research 9.1 Introduction 9.2 Project-based workflows 9.3 Functions and packages 9.4 Version control with Git 9.5 Package management with {renv} 9.6 R environments with Docker 9.7 Workflow management with {targets} 9.8 Documentation with Quarto and R Markdown 9.9 Other tips for reproducibility 9.10 Summary", " Chapter 9 Reproducible research 9.1 Introduction Reproducing a data analysis’s results is a crucial aspect of any research. First, reproducibility serves as a form of quality assurance. If we pass an analysis project to another person, they should be able to run the entire project from start to finish and obtain the same results. They can critically assess the methodology and code while detecting potential errors. Another goal of reproducibility is enabling the verification of our analysis. When someone else is able to check our results, it ensures the integrity of the analyses by determining that the conclusions are not dependent on a particular person running the code or workflow on a particular day or in a particular environment. Not only is reproducibility a key component in ethical and accurate research, but it is also a requirement for many scientific journals. For example, the Journal of Survey Statistics and Methodology (JSSAM) and Public Opinion Quarterly (POQ) require authors to make code, data, and methodology transparent and accessible to other researchers who wish to verify or build on existing work. Reproducible research requires that the key components of analysis are available, discoverable, documented, and shared with others. The four main components that we should consider are: Code: source code used for data cleaning, analysis, modeling, and reporting Data: raw data used in the workflow, or if data is sensitive or proprietary, as much data as possible that would allow others to run our workflow (e.g., access to a restricted use file (RUF)) Environment: environment of the project, including the R version, packages, operating system, and other dependencies used in the analysis Methodology: analysis methodology, including rationale behind decisions, interpretations, and assumptions In Chapter 8, we briefly mention how each of these is important to include in the methodology report and when communicating the findings of a study. However, to be transparent and effective researchers, we need to ensure we not only discuss these through text but also provide files and additional information when requested. Often, when starting a project, analysts will dive into the data and make decisions as they go without full documentation, which can be challenging if we need to go back and make changes or understand even what we did a few months ago. It benefits other analysts and potentially our future selves to better document everything from the start. The good news is that many tools, practices, and project management techniques make survey analysis projects easy to reproduce. For best results, analysts should decide which techniques and tools will be used before starting a project (or very early on). This chapter covers some of our suggestions for tools and techniques we can use in projects. This list is not comprehensive but aims to provide a starting point for those looking to create a reproducible workflow. 9.2 Project-based workflows We recommend a project-based workflow for analysis projects as described by Wickham, Çetinkaya-Rundel, and Grolemund (2023). A project-based workflow maintains a “source of truth” for our analyses. It helps with file system discipline by putting everything related to a project in a designated folder. Since all associated files are in a single location, they are easy to find and organize. When we reopen the project, we can recreate the environment in which we originally ran the code to reproduce our results. The RStudio IDE has built-in support for projects. When we create a project in RStudio, it creates a .Rproj file that store settings specific to that project. Once we have created a project, we can create folders that help us organize our workflow. For example, a project directory could look like this: | anes_analysis/ | anes_analysis.Rproj | README.md | codebooks | codebook2020.pdf | codebook2016.pdf | rawdata | anes2020_raw.csv | anes2016_raw.csv | scripts | data-prep.R | data | anes2020_clean.csv | anes2016_clean.csv | report | anes_report.Rmd | anes_report.html | anes_report.pdf In a project-based workflow, all paths are relative and, by default, relative to the project’s folder. By using relative paths, others can open and run our files even if their directory configuration differs from ours. The {here} package enables easy file referencing, and we can start with using the here::here() function to build the path for loading or saving data. Below, we ask R to read the CSV file anes_2020.csv in the project directory’s data folder: anes <- read_csv(here::here("data", "anes2020_clean.csv")) The combination of projects and the {here} package keep all associated files in an organized manner. This workflow makes it more likely that our analyses can be reproduced by us or our colleagues. 9.3 Functions and packages We may find ourselves repeating ourselves in our script, and the chances of errors increases whenever we copy and paste our code. By creating a function, we can create a consistent set of commands that reduce the likelihood of mistakes. Functions also organize our code, improve the code readability, and allow others to execute the same commands. Throughout this book, we have created functions, such as in Chapter 13, to run sequences of rename, filter, group_by, and summarize statements across different variables. The function helps us avoid overlooking necessary steps. A package is made up of a collection of functions. If we find ourselves sharing functions with others to replicate the same series of commands in a separate project, creating a package can be a useful tool for sharing the code along with data and documentation. 9.4 Version control with Git Often, a survey analysis project produces a lot of code. Keeping track of the latest version can become challenging as files evolve throughout a project. If a team of analysts is working on the same script, someone may use an outdated version, resulting in incorrect results or redundant work. Version control systems like Git can help alleviate these pains. Git is a system that helps track changes in computer files. Analysts can use Git to follow code evaluation and manage asynchronous work. With Git, it is easy to see any changes made in a script, revert changes, and resolve differences between code versions (called conflicts). Services such as GitHub or GitLab provide hosting and sharing of files as well as version control with Git. For example, we can visit the GitHub repository for this book (https://github.com/tidy-survey-r/tidy-survey-book) and see the files that build the book, when they were committed to the repository, and the history of modifications over time. In addition to code scripts, platforms like GitHub can store data and documentation. They provide a way to maintain a history of data modifications through versioning and timestamps. By saving the data and documentation alongside the code, it becomes easier for others to refer to and access everything they need in one place. Using version control in analysis projects makes collaboration and maintenance more manageable. For connecting Git with R, we recommend the book Happy Git and GitHub for the useR (Bryan and Hester 2023). 9.5 Package management with {renv} Ensuring reproducibility involves not only using version control of code, but also managing the versions of packages. If two people run the same code but use different versions of a package, the results might differ because of changes in those packages. For example, this book currently uses a version of the {srvyr} package from GitHub and not from CRAN. This is because the version of {srvyr} on CRAN has some bugs (errors) that result in incorrect calculations. The version on GitHub has corrected these errors, so we have asked readers to install the GitHub version to obtain the same results. One way to handle different package versions is with the {renv} package. This package allows researchers to set the versions for each package used and manage package dependencies. Specifically, {renv} creates isolated, project-specific environments that record the packages and their versions used in the code. When initiated by a new user, {renv} checks whether the installed packages are consistent with the recorded version for the project. If not, it installs the appropriate versions so that others can replicate the project’s environment to rerun the code and obtain consistent results. 9.6 R environments with Docker Just as different versions of packages can introduce discrepancies or compatibility issues, the version of R can also prevent reproducibility. Tools such as Docker can help with this potential issue by creating isolated environments that define the version of R being used, along with other dependencies and configurations. The entire environment is bundled in a container. The container, defined by a Dockerfile, can be shared so anybody, regardless of their local setup, can run the R code in the same environment. 9.7 Workflow management with {targets} With complex studies involving multiple code files and dependencies, it is important to ensures each step is executed in the intended sequence. We can do this manually, e.g., numbering files to indicate the order or providing detailed documentation on the order. Alternatively, we can automate the process so the code flows sequentially. Making sure that the code runs in the correct order helps ensure that the research is reproducible. Anyone should be able to pick up the set of scripts and get the same results by following the workflow. The {targets} package is growing as a popular workflow manager that documents, automates, and executes complex data workflows with multiple steps and dependencies. With this package, we first define the order of execution for our code, and then it will consistently execute the code in that order each time it is run. One beneficial feature of {targets} is that if you change code later in the workflow, only the affected code and its downstream targets (i.e., the subsequent code files) are re-executed when we change a script. The {targets} package also provides interactive progress monitoring and reporting, allowing us to track the status and progress of our analysis pipeline. 9.8 Documentation with Quarto and R Markdown Tools like Quarto and R Markdown aid in reproducibility by creating documents that weave together code, text, and results. We can present analysis results alongside the report’s narrative, so there’s no need to copy and paste code output into the final documentation. By eliminating manual steps, we can reduce the chances of errors in the final output. Quarto and R Markdown documents also allow users to re-execute the underlying code when needed. Another analyst can see the steps we took, follow the scripts, and recreate the report. We can include details about our work in one place thanks to the combination of text and code, making our work transparent and easier to verify. 9.8.1 Parameterization Another useful feature of Quarto and R Markdown is the ability to reduce repetitive code by parameterizing the files. Parameters can control various aspects of the analysis, such as dates, geography, or other analysis variables. We can define and modify these parameters to explore different scenarios or inputs. For example, suppose we start by creating a document that provides survey analysis results for North Carolina but then later decide we want to look at another state. In that case, we can define a state parameter and rerun the same analysis for a state like Washington without having to edit the code throughout the document. Parameters can be defined in the header or code chunks of our Quarto or R Markdown documents and easily be modified and documented. We reduce errors that may occur by manually editing code throughout the script, and offer a flexible way for others to replicate the analysis and explore variations. 9.9 Other tips for reproducibility 9.9.1 Random number seeds Some tasks in survey analysis require randomness, such as imputation, model training, or creating random samples. By default, the random numbers generated by R change each time we rerun the code, making it difficult to reproduce the same results. By “setting the seed,” we can control the randomness and ensure that the random numbers remain consistent whenever we rerun the code. Others can use the same seed value to reproduce our random numbers and achieve the same results. In R, we can use the set.seed() function to control the randomness in our code. Set a seed value by providing an integer to the function: set.seed(999) runif(5) The runif() function generates five random numbers from a uniform distribution. Since the seed is set to 999, running runif() multiple times will always produce the same sequence: [1] 0.38907138 0.58306072 0.09466569 0.85263123 0.78674676 The choice of the seed number is up to the analyst. For example, this could be the date (20240102) or time of day (1056) when the analysis was first conducted, a phone number (8675309), or the first few numbers that come to mind (369). As long as the seed is set for a given analysis, the actual number is up to the analyst to decide. It is important to note that set.seed() should be used before random number generation. Run it once per program, and the seed will be applied to the entire script. We recommend setting the seed at the beginning of a script, where libraries are loaded. 9.9.2 Descriptive names and labels Using descriptive variable names or labeling data can also assist with reproducible research. For example, in the ANES data, the variable names in the raw data all start with V20 and are a string of numbers. To make things easier to reproduce, we opted to change the variable names to be more descriptive of what they contained (e.g., Age). This can also be done with the data values themselves. One way to accomplish this is by creating factors for categorical data, which can ensure that we know that a value of 1 really means Female, for example. There are other ways of handling this, such as attaching labels to the data instead of recoding variables to be descriptive (see Chapter 11). As with random number seeds, the exact method is up to the analyst, but providing this information can help ensure our research is reproducible. 9.10 Summary We can promote accuracy and verification of results by making our analysis reproducible. There are various tools and guides available to help you achieve reproducibility in your work, a few of which were described in this chapter. Here are additional resources to explore: R for Data Science chapter on project-based workflows: https://r4ds.hadley.nz/workflow-scripts.html#projects Building reproducible analytical pipelines with R by Bruno Rodrigues: https://raps-with-r.dev/ Posit Solutions Site page on reproducible environments: https://solutions.posit.co/envs-pkgs/environments/ References Bryan, Jenny, and Jim Hester. 2023. Happy Git and GitHub for the useR. https://happygitwithr.com/. Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2rd Edition. https://r4ds.hadley.nz/; O’Reilly Media. "],["c10-specifying-sample-designs.html", "Chapter 10 Specifying sample designs and replicate weights in {srvyr} 10.1 Introduction 10.2 Common sampling designs 10.3 Combining sampling methods 10.4 Replicate weights 10.5 Exercises", " Chapter 10 Specifying sample designs and replicate weights in {srvyr} Prerequisites For this chapter, load the following packages: library(tidyverse) library(survey) library(srvyr) library(srvyrexploR) To help explain the different types of sample designs, this chapter will use the api and scd data that are included in the {survey} package: data(api) data(scd) This chapter uses data from the Residential Energy Consumption Survey (RECS) - both 2015 and 2020, so we will use the following code to load the RECS data from the {srvyr.data} package: data(recs_2015) data(recs_2020) 10.1 Introduction The primary reason for using packages like {survey} and {srvyr} is to account for the sampling design or replicate weights into estimates. By incorporating the sampling design or replicate weights, precision estimates (e.g., standard errors and confidence intervals) are appropriately calculated. In this chapter, we will introduce common sampling designs and common types of replicate weights, the mathematical methods for calculating estimates and standard errors for a given sampling design, and the R syntax to specify the sampling design or replicate weights. While we will show the math behind the estimates, the functions in these packages will do the calculation. To deeply understand the math and the derivation, refer to Penn State (2019), Särndal, Swensson, and Wretman (2003), Wolter (2007), or Fuller (2011) (these are listed in order of increasing statistical rigorousness). The general process for estimation in the {srvyr} package is to: Create a tbl_svy object (a survey object) using: as_survey_design() or as_survey_rep() Subset data (if needed) using filter() (subpopulations) Specify domains of analysis using group_by() Within summarize(), specify variables to calculate, including means, totals, proportions, quantiles, and more This chapter includes details on the first step - creating the survey object. Once this survey object is created, it can be used in the other steps (detailed in chapters 5 through 7) to account for the complex survey design. 10.2 Common sampling designs A sampling design is the method used to draw a sample. Both logistical and statistical elements are considered when developing a sampling design. When specifying a sampling design in R, the levels of sampling are specified along with the weights. The weight for each record is constructed so that the particular record represents that many units in the population. For example, in a survey of 6th-grade students in the United States, the weight associated with each responding student reflects how many 6th grade students across the country that record represents. Generally, the weights represent the inverse of the probability of selection such that the sum of the weights corresponds to the total population size, although some studies may have the sum of the weights equal to the number of respondent records. Some common terminology across the designs are: sample size, generally denoted as \\(n\\), is the number of units selected to be sampled population size, generally denoted as \\(N\\), is the number of units in the target population sampling frame, the list of units from which the sample is drawn (see Chapter 2 for more information) 10.2.1 Simple random sample without replacement The simple random sample (SRS) without replacement is a sampling design where a fixed sample size is selected from a sampling frame, and every possible subsample has an equal probability of selection. Without replacement refers to the fact that once a sampling unit has been selected, it is removed from the sample frame and cannot be selected again. Requirements: The sampling frame must include the entire population. Advantages: SRS requires no information about the units apart from contact information. Disadvantages: The sampling frame may not be available for the entire population. Example: Randomly select students in a university from a roster provided by the registrar’s office. The math The estimate for the population mean of variable \\(y\\) is: \\[\\bar{y}=\\frac{1}{n}\\sum_{i=1}^n y_i\\] where \\(\\bar{y}\\) represents the sample mean, \\(n\\) is the total number of respondents (or observations), and \\(y_i\\) is each individual value of \\(y\\). The estimate of the standard error of the mean is: \\[se(\\bar{y})=\\sqrt{\\frac{s^2}{n}\\left( 1-\\frac{n}{N} \\right)}\\] where \\[s^2=\\frac{1}{n-1}\\sum_{i=1}^n\\left(y_i-\\bar{y}\\right)^2.\\] and \\(N\\) is the population size. This standard error estimate might look very similar to equations in other applications except for the part on the right side of the equation: \\(1-\\frac{n}{N}\\). This is called the finite population correction (FPC) factor. If the size of the frame, \\(N\\), is very large in comparison to the sample, the FPC is negligible, so it is often ignored. A common guideline is if the sample is less than 10% of the population, the FPC is negligible. To estimate proportions, we define \\(x_i\\) as the indicator if the outcome is observed. That is, \\(x_i=1\\) if the outcome is observed, and \\(x_i=0\\) if the outcome is not observed for respondent \\(i\\). Then the estimated proportion from an SRS design is: \\[\\hat{p}=\\frac{1}{n}\\sum_{i=1}^n x_i \\] and the estimated standard error of the proportion is: \\[se(\\hat{p})=\\sqrt{\\frac{\\hat{p}(1-\\hat{p})}{n-1}\\left(1-\\frac{n}{N}\\right)} \\] The syntax If a sample was drawn through SRS and had no nonresponse or other weighting adjustments, in R, specify this design as: srs1_des <- dat %>% as_survey_design(fpc = fpcvar) where dat is a tibble or data.frame with the survey data, and fpcvar is a variable in the data indicating the sampling frame’s size (this variable will have the same value for all cases in an SRS design). If the frame is very large, sometimes the frame size is not provided. In that case, the FPC is not needed, and specify the design as: srs2_des <- dat %>% as_survey_design() If some post-survey adjustments were implemented and the weights are not all equal, specify the design as: srs3_des <- dat %>% as_survey_design(weights = wtvar, fpc = fpcvar) where wtvar is a variable in the data indicating the weight for each case. Again, the FPC can be omitted if it is unnecessary because the frame is large compared to the sample size. Example The {survey} package in R provides some example datasets that we will use throughout this chapter. The documentation provides detailed information about the variables. One of the example datasets we will use is from the Academic Performance Index (API). The API was a program administered by the California Department of Education, and the {survey} package includes a population file (sample frame) of all schools with at least 100 students and several different samples pulled from that data using different sampling methods. For this first example, we will use the apisrs dataset, which contains an SRS of 200 schools. For printing purposes, we create a new dataset called apisrs_slim, which sorts the data by the school district and school ID and subsets the data to only a few columns. The SRS sample data is illustrated below: apisrs_slim <- apisrs %>% as_tibble() %>% arrange(dnum, snum) %>% select(cds, dnum, snum, dname, sname, fpc, pw) apisrs_slim ## # A tibble: 200 × 7 ## cds dnum snum dname sname fpc pw ## <chr> <int> <dbl> <chr> <chr> <dbl> <dbl> ## 1 19642126061220 1 1121 ABC Unified Haske… 6194 31.0 ## 2 19642126066716 1 1124 ABC Unified Stowe… 6194 31.0 ## 3 36675876035174 5 3895 Adelanto Elementary Adela… 6194 31.0 ## 4 33669776031512 19 3347 Alvord Unified Arlan… 6194 31.0 ## 5 33669776031595 19 3352 Alvord Unified Wells… 6194 31.0 ## 6 31667876031033 39 3271 Auburn Union Elementary Cain … 6194 31.0 ## 7 19642876011407 42 1169 Baldwin Park Unified Deanz… 6194 31.0 ## 8 19642876011464 42 1175 Baldwin Park Unified Heath… 6194 31.0 ## 9 19642956011589 48 1187 Bassett Unified Erwin… 6194 31.0 ## 10 41688586043392 49 4948 Bayshore Elementary Baysh… 6194 31.0 ## # ℹ 190 more rows Table 10.1 provides details on all the variables in this dataset. TABLE 10.1: Overview of Variables in api Data Variable Name Description cds Unique identifier for each school dnum School district identifier within county snum School identifier within district dname District Name sname School Name fpc Finite population correction factor (FPC) pw Weight To create the tbl_survey object for this SRS data, the design should be specified as follows: apisrs_des <- apisrs_slim %>% as_survey_design(weights = pw, fpc = fpc) apisrs_des ## Independent Sampling design ## Called via srvyr ## Sampling variables: ## - ids: `1` ## - fpc: fpc ## - weights: pw ## Data variables: ## - cds (chr), dnum (int), snum (dbl), dname (chr), sname (chr), fpc ## (dbl), pw (dbl) In the printed design object above, the design is described as an “Independent Sampling design,” which is another term for SRS. The ids are specified as 1, which means there is no clustering (a topic described in Section 10.2.4), the FPC variable is indicated, and the weights are indicated. We can also look at the summary of the design object, and see the distribution of the probabilities (inverse of the weights) along with the population size and a list of the variables in the dataset. summary(apisrs_des) ## Independent Sampling design ## Called via srvyr ## Probabilities: ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0323 0.0323 0.0323 0.0323 0.0323 0.0323 ## Population size (PSUs): 6194 ## Data variables: ## [1] "cds" "dnum" "snum" "dname" "sname" "fpc" "pw" 10.2.2 Simple random sample with replacement Similar to the SRS design, the simple random sample with replacement (SRSWR) design randomly selects the sample from the entire sampling frame. However, while SRS removes sampled units before selecting again, the SRSWR instead replaces each sampled unit before drawing again, so units can be selected more than once. Requirements: The sampling frame must include the entire population. Advantages: SRSWR requires no information about the units apart from contact information. Disadvantages: The sampling frame may not be available for the entire population. Units can be selected more than once, resulting in a smaller realized sample size because receiving duplicate information from a single respondent does not provide additional information. For small populations, SRSWR has larger standard errors than SRS designs. Example: A professor puts all students’ names on paper slips and selects them randomly to ask students questions, but the professor replaces the paper after calling on the student so they can be selected again at any time. In general for surveys, using an SRS design (without replacement) is preferred as we do not want respondents to answer a survey more than once. The math The estimate for the population mean of variable \\(y\\) is: \\[\\bar{y}=\\frac{1}{n}\\sum_{i=1}^n y_i\\] and the estimate of the standard error of mean is: \\[se(\\bar{y})=\\sqrt{\\frac{s^2}{n}}\\] where \\[s^2=\\frac{1}{n-1}\\sum_{i=1}^n\\left(y_i-\\bar{y}\\right)^2.\\] To calculate the estimated proportion, we define \\(x_i\\) as the indicator that the outcome is observed (as we did with SRS): \\[\\hat{p}=\\frac{1}{n}\\sum_{i=1}^n x_i \\] and the estimated standard error of the proportion is: \\[se(\\hat{p})=\\sqrt{\\frac{\\hat{p}(1-\\hat{p})}{n}} \\] The syntax If we had a sample that was drawn through SRSWR and had no nonresponse or other weighting adjustments, in R, we should specify this design as: srswr1_des <- dat %>% as_survey_design() where dat is a tibble or data.frame containing our survey data. This syntax is the same as a SRS design, except a finite population correction (FPC) is not included. This is because when you claculate a sample with replacement, the population pool to select from is no longer finite, so a correction is not needed. Therefore, with large populations where the FPC is negligble, the underlying formulas for SRS and SRSWR designs are the same. If some post-survey adjustments were implemented and the weights are not all equal, specify the design as: srswr2_des <- dat %>% as_survey_design(weights = wtvar) where wtvar is the variable for the weight on the data. Example The {survey} package does not include an example of SRSWR, so to illustrate this design we need to create an example. We use the api population data provided by the {survey} package apipop and select a sample of 200 cases using the slice_sample() function from the tidyverse. One of the arguments in the slice_sample() function is replace. If replace=TRUE, then we are conducting a SRSWR. We then calculate selection weights as the inverse of the probability of selection and call this new dataset apisrswr. set.seed(409963) apisrswr <- apipop %>% as_tibble() %>% slice_sample(n = 200, replace = TRUE) %>% select(cds, dnum, snum, dname, sname) %>% mutate( weight = nrow(apipop)/200 ) head(apisrswr) ## # A tibble: 6 × 6 ## cds dnum snum dname sname weight ## <chr> <int> <dbl> <chr> <chr> <dbl> ## 1 43696416060065 533 5348 Palo Alto Unified Jordan (Da… 31.0 ## 2 07618046005060 650 509 San Ramon Valley Unified Alamo Elem… 31.0 ## 3 19648086085674 457 2134 Montebello Unified La Merced … 31.0 ## 4 07617056003719 346 377 Knightsen Elementary Knightsen … 31.0 ## 5 19650606023022 744 2351 Torrance Unified Carr (Evel… 31.0 ## 6 01611196090120 6 13 Alameda City Unified Paden (Wil… 31.0 Because this is a SRS design with replacement, there will be duplicates in the data. It is important to keep the duplicates in the data for proper estimation, but for reference we can view the duplicates in the example data we just created. apisrswr %>% group_by(cds) %>% filter(n()>1) %>% arrange(cds) ## # A tibble: 4 × 6 ## # Groups: cds [2] ## cds dnum snum dname sname weight ## <chr> <int> <dbl> <chr> <chr> <dbl> ## 1 15633216008841 41 869 Bakersfield City Elem Chipman Junio… 31.0 ## 2 15633216008841 41 869 Bakersfield City Elem Chipman Junio… 31.0 ## 3 39686766042782 716 4880 Stockton City Unified Tyler Skills … 31.0 ## 4 39686766042782 716 4880 Stockton City Unified Tyler Skills … 31.0 We created a weight variable in this example data, which is the inverse of the probability of selection. To specify the sampling design for apisrswr, the following syntax should be used: apisrswr_des <- apisrswr %>% as_survey_design(weights = weight) apisrswr_des ## Independent Sampling design (with replacement) ## Called via srvyr ## Sampling variables: ## - ids: `1` ## - weights: weight ## Data variables: ## - cds (chr), dnum (int), snum (dbl), dname (chr), sname (chr), weight ## (dbl) summary(apisrswr_des) ## Independent Sampling design (with replacement) ## Called via srvyr ## Probabilities: ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0323 0.0323 0.0323 0.0323 0.0323 0.0323 ## Data variables: ## [1] "cds" "dnum" "snum" "dname" "sname" "weight" In the output above, the design object and the object summary are shown. Both note that the sampling is done “with replacement” because no FPC was specified. The probabilities, which are derived from the weights, are summarized in the summary. 10.2.3 Stratified sampling Stratified sampling occurs when a population is divided into mutually exclusive subpopulations (strata), and then samples are selected independently within each stratum. Requirements: The sampling frame must include the information to divide the population into groups for every unit. Advantages: This design ensures sample representation in all subpopulations. If the strata are correlated with survey outcomes, a stratified sample has smaller standard errors compared to a SRS sample of the same size. This results in a more efficient design. Disadvantages: Auxiliary data may not exist to divide the sampling frame into groups, or the data may be outdated. Examples: Example 1: A population of North Carolina residents could be separated (stratified) into urban and rural areas, and then a SRS of residents from both rural and urban areas is selected independently. This ensures there are residents from both areas in the sample. Example 2: Law enforcement agencies could be separated (stratified) into the three primary general-purpose categories in the US: local police, sheriff’s departments, and state police. A SRS of agencies from each of the three types is then selected independently to ensure all three types of agencies are represented. The math Let \\(\\bar{y}_h\\) be the sample mean for stratum \\(h\\), \\(N_h\\) be the population size of stratum \\(h\\), and \\(n_h\\) be the sample size of stratum \\(h\\). Then the estimate for the population mean under stratified SRS sampling is: \\[\\bar{y}=\\frac{1}{N}\\sum_{h=1}^H N_h\\bar{y}_h\\] and the estimate of the standard error of \\(\\bar{y}\\) is: \\[se(\\bar{y})=\\sqrt{\\frac{1}{N^2} \\sum_{h=1}^H N_h^2 \\frac{s_h^2}{n_h}\\left(1-\\frac{n_h}{N_h}\\right)} \\] where \\[s_h^2=\\frac{1}{n_h-1}\\sum_{i=1}^{n_h}\\left(y_{i,h}-\\bar{y}_h\\right)^2.\\] For estimates of proportions, let \\(\\hat{p}_h\\) be the estimated proportion in stratum \\(h\\). Then the population proportion estimate is: \\[\\hat{p}= \\frac{1}{N}\\sum_{h=1}^H N_h \\hat{p}_h\\] where \\(H\\) is the total number of strata. The standard error of the proportion is: \\[se(\\hat{p}) = \\frac{1}{N} \\sqrt{ \\sum_{h=1}^H N_h^2 \\frac{\\hat{p}_h(1-\\hat{p}_h)}{n_h-1} \\left(1-\\frac{n_h}{N_h}\\right)}\\] The syntax In addition to the fpc and weights arguments discussed in the types above, stratified designs requires the addition of the strata argument. For example, to specify a stratified SRS design in {srvyr} when using the FPC, that is, where the population sizes of the strata are not too large and are known, specify the design as: stsrs1_des <- dat %>% as_survey_design(fpc = fpcvar, strata = stratvar) where fpcvar is a variable on our data that indicates \\(N_h\\) for each row, and stratavar is a variable indicating the stratum for each row. You can omit the FPC if it is not applicable. Additionally, we can indicate the weight variable if it is present where wtvar is a variable on our data with a numeric weight. stsrs2_des <- dat %>% as_survey_design(weights = wtvar, strata = stratvar) Example In the example API data, apistrat is a stratified random sample, stratified by school type (stype) with three levels: E for elementary school, M for middle school, and H for high school. As with the SRS example above, we sort and select specific variables for use in printing. The data are illustrated below, including a count of the number of cases per stratum: apistrat_slim <- apistrat %>% as_tibble() %>% arrange(dnum, snum) %>% select(cds, dnum, snum, dname, sname, stype, fpc, pw) apistrat_slim %>% count(stype, fpc) ## # A tibble: 3 × 3 ## stype fpc n ## <fct> <dbl> <int> ## 1 E 4421 100 ## 2 H 755 50 ## 3 M 1018 50 The FPC is the same for each case within each stratum. This output also shows that 100 elementary schools, 50 middle schools, and 50 high schools were sampled. It is often common for the number of units sampled from each strata to be different based on the goals of the project, or to mirror the size of each strata in the population. This design should be specified as follows: apistrat_des <- apistrat_slim %>% as_survey_design(strata = stype, weights = pw, fpc = fpc) apistrat_des ## Stratified Independent Sampling design ## Called via srvyr ## Sampling variables: ## - ids: `1` ## - strata: stype ## - fpc: fpc ## - weights: pw ## Data variables: ## - cds (chr), dnum (int), snum (dbl), dname (chr), sname (chr), stype ## (fct), fpc (dbl), pw (dbl) summary(apistrat_des) ## Stratified Independent Sampling design ## Called via srvyr ## Probabilities: ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0226 0.0226 0.0359 0.0401 0.0534 0.0662 ## Stratum Sizes: ## E H M ## obs 100 50 50 ## design.PSU 100 50 50 ## actual.PSU 100 50 50 ## Population stratum sizes (PSUs): ## E H M ## 4421 755 1018 ## Data variables: ## [1] "cds" "dnum" "snum" "dname" "sname" "stype" "fpc" "pw" When printing the object, it is specified as a “Stratified Independent Sampling design,” also known as a stratified SRS, and the strata variable is included. Printing the summary we see a distribution of probabilities, as we saw with SRS, but we also see the sample and populations sizes by stratum. 10.2.4 Clustered sampling Clustered sampling occurs when a population is divided into mutually exclusive subgroups called clusters or primary sampling units (PSUs). A random selection of PSUs is sampled, and then another level of sampling is done within these clusters. There can be multiple levels of this selection. Clustered sampling is often used when a list of the entire population is not available, or data collection involves interviewers needing direct contact with respondents. Requirements: There must be a way to divide the population into clusters. Clusters are commonly structural such as institutions (e.g., schools, prisons) or geography (e.g., states, counties). Advantages: Clustered sampling is advantageous when data collection is done in person, so interviewers are sent to specific sampled areas rather than completely at random across a country. With clustered sampling, a list of the entire population is not necessary. For example, if sampling students, we do not need a list of all students but only a list of all schools. Once the schools are sampled, lists of students can be obtained within the sampled schools. Disadvantages: Compared to a simple random sample for the same sample size, clustered samples generally have larger standard errors of estimates. Examples: Example 1: Consider a study needing a sample of 6th-grade students in the United States, no list likely exists of all these students. However, it is more likely to obtain a list of schools that have 6th graders, so a study design could select a random sample of schools that have 6th graders. The selected schools can then provide a list of students to do a second stage of sampling where 6th-grade students are randomly sampled within each of the sampled schools. This is a one-stage sample design (the one representing the number of clusters) and will be the type of design we will discuss in the formulas below. Example 2: Consider a study sending interviewers to households for a survey. This is a more complicated example that requires two levels of clustering (two-stage sample design) to efficiently use interviewers in geographic clusters. First, in the U.S., counties could be selected as the PSU, then Census block groups within counties could be selected as the secondary sampling unit (SSU). Households could then be randomly sampled within the block groups. This type of design is popular for in-person surveys as it reduces the travel necessary for interviewers. The math Consider a survey where a sample of \\(a\\) clusters are sampled from a population of \\(A\\) clusters via SRS. Units within each sampled cluster are sampled via SRS as well. Within each sampled cluster, \\(i\\), there are \\(B_i\\) units and \\(b_i\\) units are sampled via SRS. Let \\(\\bar{y}_{i}\\) be the sample mean of cluster \\(i\\). Then, a ratio estimator of the population mean is: \\[\\bar{y}=\\frac{\\sum_{i=1}^a B_i \\bar{y}_{i}}{ \\sum_{i=1}^a B_i}\\] Note this is a consistent but biased estimator. Often the population size is not known, so this is a method to estimate a mean without knowing the population size. The estimated standard error of the mean is: \\[se(\\bar{y})= \\frac{1}{\\hat{N}}\\sqrt{\\left(1-\\frac{a}{A}\\right)\\frac{s_a^2}{a} + \\frac{A}{a} \\sum_{i=1}^a \\left(1-\\frac{b_i}{B_i}\\right) \\frac{s_i^2}{b_i} }\\] where \\(\\hat{N}\\) is the estimated population size, \\(s_a^2\\) is the between-cluster variance and \\(s_i^2\\) is the within-cluster variance. The formula for the between-cluster variance (\\(s_a^2\\)) is: \\[s_a^2=\\frac{1}{a-1}\\sum_{i=1}^a \\left( \\hat{y}_i - \\frac{\\sum_{i=1}^a \\hat{y}_{i} }{a}\\right)^2\\] where \\(\\hat{y}_i =B_i\\bar{y_i}\\) . The formula for the within-cluster variance (\\(s_i^2\\)) is: \\[s_b^2=\\frac{1}{a(b_i-1)} \\sum_{j=1}^{b_i} \\left(y_{ij}-\\bar{y}_i\\right)^2\\] where \\(y_{ij}\\) is the outcome for sampled unit \\(j\\) within cluster \\(i\\). The syntax Clustered sampling designs require the addition of the ids argument which specifies what variables are the cluster levels. To specify a two-stage clustered design without replacement, use the following syntax: clus2_des <- dat %>% as_survey_design(weights = wtvar, ids = c(PSU, SSU), fpc = c(A, B)) where PSU and SSU are the variables indicating the PSU and SSU identifiers, and A and B are the variables indicating the population sizes for each level (i.e., A is the number of clusters, and B is the number of units within each cluster). Note that A will be the same for all records (within a strata), and B will be the same for all records within the same cluster. If clusters were sampled with replacement or from a very large population, a FPC is unnecessary. Additionally, only the first stage of selection is necessary regardless of whether the units were selected with replacement at any stage. The subsequent stages of selection are ignored in computation as their contribution to the variance is overpowered by the first stage (see Särndal, Swensson, and Wretman (2003) or Wolter (2007) for a more in-depth discussion). Therefore, the syntax below will yield the same estimates in the end: clus2wra_des <- dat %>% as_survey_design(weights = wtvar, ids = c(PSU, SSU)) clus2wrb_des <- dat %>% as_survey_design(weights = wtvar, ids = PSU) Note that there is one additional argument that is sometimes necessary which is nest = TRUE. This option relabels cluster IDs to enforce nesting within strata. Sometimes, as an example, there may be a cluster 1 and a cluster 2 within each stratum but these are actually different clusters. This option indicates that the repeated use of numbering does not mean it is the same cluster. If this option is not used and there are repeated cluster IDs across different strata, an error will be generated. Example The survey package includes a two-stage cluster sample data, apiclus2, in which school districts were sampled, and then a random sample of five schools was selected within each district. For districts with fewer than five schools, all schools were sampled. School districts are identified by dnum, and schools are identified by snum. The variable fpc1 indicates how many districts there are in California (A), and fpc2 indicates how many schools were in a given district with at least 100 students (B). The data has a row for each school. In the data printed below, there are 757 school districts, as indicated by fpc1, and there are nine schools in District 731, one school in District 742, two schools in District 768, and so on as indicated by fpc2. For illustration purposes, the object apiclus2_slim has been created from apiclus2, which subsets the data to only the necessary columns and sorts data. apiclus2_slim <- apiclus2 %>% as_tibble() %>% arrange(desc(dnum), snum) %>% select(cds, dnum, snum, fpc1, fpc2, pw) apiclus2_slim ## # A tibble: 126 × 6 ## cds dnum snum fpc1 fpc2 pw ## <chr> <int> <dbl> <dbl> <int[1d]> <dbl> ## 1 47704826050942 795 5552 757 1 18.9 ## 2 07618126005169 781 530 757 6 22.7 ## 3 07618126005177 781 531 757 6 22.7 ## 4 07618126005185 781 532 757 6 22.7 ## 5 07618126005193 781 533 757 6 22.7 ## 6 07618126005243 781 535 757 6 22.7 ## 7 19650786023337 768 2371 757 2 18.9 ## 8 19650786023345 768 2372 757 2 18.9 ## 9 54722076054423 742 5898 757 1 18.9 ## 10 50712906053086 731 5781 757 9 34.1 ## # ℹ 116 more rows To specify this design in R, the following syntax should be used: apiclus2_des <- apiclus2_slim %>% as_survey_design(ids = c(dnum, snum), fpc = c(fpc1, fpc2), weights = pw) apiclus2_des ## 2 - level Cluster Sampling design ## With (40, 126) clusters. ## Called via srvyr ## Sampling variables: ## - ids: `dnum + snum` ## - fpc: `fpc1 + fpc2` ## - weights: pw ## Data variables: ## - cds (chr), dnum (int), snum (dbl), fpc1 (dbl), fpc2 (int[1d]), pw ## (dbl) summary(apiclus2_des) ## 2 - level Cluster Sampling design ## With (40, 126) clusters. ## Called via srvyr ## Probabilities: ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.00367 0.03774 0.05284 0.04239 0.05284 0.05284 ## Population size (PSUs): 757 ## Data variables: ## [1] "cds" "dnum" "snum" "fpc1" "fpc2" "pw" The design objects are described as “2 - level Cluster Sampling design” and include the ids (cluster), FPC, and weight variables. The summary notes that the sample includes 40 first-level clusters (PSUs), which are school districts, and 126 second-level clusters (SSUs), which are schools. Additionally, the summary includes a numeric summary of the probabilities of selection and the population size (number of PSUs) as 757. 10.3 Combining sampling methods SRS, stratified, and clustered designs are the backbone of sampling designs, and the features are often combined in one design. Additionally, rather than using SRS for selection, other sampling mechanisms are commonly used, such as probability proportional to size (PPS), systematic sampling, or selection with unequal probabilities, which are briefly described here. In PPS sampling, a size measure is constructed for each unit (e.g., the population of the PSU or the number of occupied housing units) and then units with larger size measures are more likely to be sampled. Systematic sampling is commonly used to ensure representation across a population. Units are sorted by a feature and then every \\(k\\) units are selected from a random start point so the sample is spread across the population. In addition to PPS, other unequal probabilities of selection may be used. For example, in a study of establishments (e.g., businesses or public institutions) that conducts a survey every year, an establishment that recently participated (e.g., participated last year) may have a reduced chance of selection in a subsequent round to reduce the burden on the establishment. To learn more about sampling designs, refer to Valliant, Dever, and Kreuter (2013), Cox et al. (2011), Cochran (1977), and Deming (1991). A common method of sampling is to stratify PSUs, select PSUs within the stratum using PPS selection, and then select units within the PSUs either with SRS or PPS. Reading survey documentation is an important first step in survey analysis to understand the design of the survey we are using and variables necessary to specify the design. Good documentation will highlight the variables necessary to specify the design. This is often found in User’s Guides, methodology, analysis guides, or technical documentation (see Chapter 3 for more details). Example For example, the 2017-2019 National Survey of Family Growth (NSFG)32 had a stratified multi-stage area probability sample: 1. In the first stage, PSUs are counties or collections of counties and are stratified by Census region/division, size (population), and MSA status. Within each stratum, PSUs were selected via PPS. 2. In the second stage, neighborhoods were selected within the sampled PSUs using PPS selection. 3. In the third stage, housing units were selected within the sampled neighborhoods. 4. In the fourth stage, a person was randomly chosen within the selected housing units among eligible persons using unequal probabilities based on the person’s age and sex. The public use file does not include all these levels of selection and instead has pseudo-strata and pseudo-clusters, which are the variables used in R to specify the design. As specified on page 4 of the documentation, the stratum variable is SEST, the cluster variable is SECU, and the weight variable is WGT2017_2019. Thus, to specify this design in R, use the following syntax: nsfg_des <- nsfgdata %>% as_survey_design(ids = SECU, strata = SEST, weights = WGT2017_2019) 10.4 Replicate weights Replicate weights are often included on analysis files instead of, or in addition to, the design variables (strata and PSUs). Replicate weights are used as another method to estimate variability. Often researchers choose to use replicate weights to avoid publishing design variables (strata or clustering variables) as a measure to reduce the risk of disclosure. There are several types of replicate weights, including balanced repeated replication (BRR), Fay’s BRR, jackknife, and bootstrap methods. An overview of the process for using replicate weights is as follows: Divide the sample into subsample replicates that mirror the design of the sample Calculate weights for each replicate using the same procedures for the full-sample weight (i.e., nonresponse and post-stratification) Calculate estimates for each replicate using the same method as the full-sample estimate Calculate the estimated variance, which will be proportional to the variance of the replicate estimates The different types of replicate weights largely differ between step 1 (how the sample is divided into subsamples) and step 4 (which multiplication factors (scales) are used to multiply the variance). The general format for the standard error is: \\[ \\sqrt{\\alpha \\sum_{r=1}^R \\alpha_r (\\hat{\\theta}_r - \\hat{\\theta})^2 }\\] where \\(R\\) is the number of replicates, \\(\\alpha\\) is a constant that depends on the replication method, \\(\\alpha_r\\) is a factor associated with each replicate, \\(\\hat{\\theta}\\) is the weighted estimate based on the full sample, and \\(\\hat{\\theta}_r\\) is the weighted estimate of \\(\\theta\\) based on the \\(r^{\\text{th}}\\) replicate. To create the design object for surveys with replicate weights, we use as_survey_rep() instead of as_survey_design() that we use for the common sampling designs in the sections above. 10.4.1 Balanced Repeated Replication (BRR) method The BRR method requires a stratified sample design with two PSUs in each stratum. Each replicate is constructed by deleting one PSU per stratum using a Hadamard matrix. For the PSU that is included, the weight is generally multiplied by two but may have other adjustments, such as post-stratification. A Hadamard matrix is a special square matrix with entries of +1 or -1 with mutually orthogonal rows. Hadamard matrices must have one row, two rows, or a multiple of four rows. The size of the Hadamard matrix is determined by the first multiple of 4 greater than or equal to the number of strata. For example, if a survey had 7 strata, the Hadamard matrix would be an \\(8\\times8\\) matrix. Additionally, a survey with 8 strata would also have an \\(8\\times8\\) Hadamard matrix. The columns in the matrix specify the strata and the rows specify the replicate. In each replicate (row), a +1 means to use the first PSU and a -1 means to use the second PSU in the estimate. For example, here is a \\(4\\times4\\) Hadamard matrix: \\[ \\begin{array}{rrrr} +1 &+1 &+1 &+1\\\\ +1&-1&+1&-1\\\\ +1&+1&-1&-1\\\\ +1 &-1&-1&+1 \\end{array} \\] In the first replicate (row), all the values are +1, so in each stratum, the first PSU would be used in the estimate. In the second replicate, the first PSU would be used in stratum 1 and 3, while the second PSU would be used in stratum 2 and 4. In the third replicate, the first PSU would be used in stratum 1 and 2, while the second PSU would be used in strata 3 and 4. Finally, in the fourth replicate, the first PSU would be used in strata 1 and 4, while the second PSU would be used in strata 2 and 3. For more information about Hadamard matrices see Wolter (2007). Note that supplied BRR weights from a data provider will already incorporate this adjustment, and the {survey} package generates the Hadamard matrix, if necessary for calculating BRR weights so an analyst will not need to provide the matrix. The math A weighted estimate for the full sample is calculated as \\(\\hat{\\theta}\\), and then a weighted estimate for each replicate is calculated as \\(\\hat{\\theta}_r\\) for \\(R\\) replicates. Using the generic notation above, \\(\\alpha=\\frac{1}{R}\\) and \\(\\alpha_r=1\\) for each \\(r\\). The standard error of the estimate is calculated as follows: \\[se(\\hat{\\theta})=\\sqrt{\\frac{1}{R} \\sum_{r=1}^R \\left( \\hat{\\theta}_r-\\hat{\\theta}\\right)^2}\\] Specifying replicate weights in R requires specifying the type of replicate weights, the main weight variable, the replicate weight variables, and other options. One of the key options is for the mean squared error (MSE). If mse=TRUE, variances are computed around the point estimate \\((\\hat{\\theta})\\), whereas if mse=FALSE, variances are computed around the mean of the replicates \\((\\bar{\\theta})\\) instead which looks like this: \\[se(\\hat{\\theta})=\\sqrt{\\frac{1}{R} \\sum_{r=1}^R \\left( \\hat{\\theta}_r-\\bar{\\theta}\\right)^2}\\] where \\[\\bar{\\theta}=\\frac{1}{R}\\sum_{r=1}^R \\hat{\\theta}_r\\] The default option for mse is to use the global option of “survey.replicates.mse” which is set to FALSE initially unless a user changes it. To determine if mse should be set to TRUE or FALSE, read the survey documentation. If there is no indication in the survey documentation, for BRR, we recommend setting mse to TRUE as this is the default in other software (e.g., SAS, SUDAAN). The syntax Replicate weights generally come in groups and are sequentially numbered, such as PWGTP1, PWGTP2, …, PWGTP80 for the person weights in the American Community Survey (ACS) (U.S. Census Bureau 2021) or BRRWT1, BRRWT2, …, BRRWT96 in the 2015 Residential Energy Consumption Survey (RECS) (U.S. Energy Information Administration 2017). This makes it easy to use some of the tidy selection33 functions in R. To specify a BRR design, we need to specify the weight variable (weights), the replicate weight variables (repweights), the type of replicate weights is BRR (type = BRR), and whether the mean squared error should be used (mse = TRUE) or not (mse = FALSE). For example, if a dataset had WT0 for the main weight and had 20 BRR weights indicated WT1, WT2, …, WT20, we can use the following syntax (both are equivalent): brr_des <- dat %>% as_survey_rep(weights = WT0, repweights = all_of(str_c("WT", 1:20)), type = "BRR", mse = TRUE) brr_des <- dat %>% as_survey_rep(weights = WT0, repweights = num_range("WT", 1:20), type = "BRR", mse = TRUE) If a dataset had WT for the main weight and had 20 BRR weights indicated REPWT1, REPWT2, …, REPWT20, the following syntax could be used (both are equivalent): brr_des <- dat %>% as_survey_rep(weights = WT, repweights = all_of(str_c("REPWT", 1:20)), type = "BRR", mse = TRUE) brr_des <- dat %>% as_survey_rep(weights = WT, repweights = starts_with("REPWT"), type = "BRR", mse = TRUE) If the replicate weight variables are in the file consecutively, the following syntax can also be used: brr_des <- dat %>% as_survey_rep(weights = WT, repweights = REPWT1:REPWT20, type = "BRR", mse = TRUE) Typically, each replicate weight sums to a value similar to the main weight, as both the replicate weights and the main weight are supposed to provide population estimates. Rarely, an alternative method will be used where the replicate weights have values of 0 or 2 in the case of BRR weights. This would be indicated in the documentation (see Chapter 3 for more information on how to understand the provided documentation). In this case, the replicate weights are not combined, and the option combined_weights = FALSE should be indicated, as the default value for this argument is TRUE. This specific syntax is shown below: brr_des <- dat %>% as_survey_rep(weights = WT, repweights = starts_with("REPWT"), type = "BRR", combined_weights = FALSE, mse = TRUE) Example The {survey} package includes a data example from Section 12.2 of Levy and Lemeshow (2013). In this fictional data, two out of five ambulance stations were sampled from each of three emergency service areas (ESAs), thus BRR weights are appropriate with 2 PSUs (stations) sampled in each stratum (ESA). In the code below, BRR weights are created as was done by Levy and Lemeshow (2013). scdbrr <- scd %>% as_tibble() %>% mutate(wt = 5 / 2, rep1 = 2 * c(1, 0, 1, 0, 1, 0), rep2 = 2 * c(1, 0, 0, 1, 0, 1), rep3 = 2 * c(0, 1, 1, 0, 0, 1), rep4 = 2 * c(0, 1, 0, 1, 1, 0)) scdbrr ## # A tibble: 6 × 9 ## ESA ambulance arrests alive wt rep1 rep2 rep3 rep4 ## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 1 120 25 2.5 2 2 0 0 ## 2 1 2 78 24 2.5 0 0 2 2 ## 3 2 1 185 30 2.5 2 0 2 0 ## 4 2 2 228 49 2.5 0 2 0 2 ## 5 3 1 670 80 2.5 2 0 0 2 ## 6 3 2 530 70 2.5 0 2 2 0 To specify the BRR weights, the following syntax is used: scdbrr_des <- scdbrr %>% as_survey_rep(type = "BRR", repweights = starts_with("rep"), combined_weights = FALSE, weight = wt) scdbrr_des ## Call: Called via srvyr ## Balanced Repeated Replicates with 4 replicates. ## Sampling variables: ## - repweights: `rep1 + rep2 + rep3 + rep4` ## - weights: wt ## Data variables: ## - ESA (int), ambulance (int), arrests (dbl), alive (dbl), wt (dbl), ## rep1 (dbl), rep2 (dbl), rep3 (dbl), rep4 (dbl) summary(scdbrr_des) ## Call: Called via srvyr ## Balanced Repeated Replicates with 4 replicates. ## Sampling variables: ## - repweights: `rep1 + rep2 + rep3 + rep4` ## - weights: wt ## Data variables: ## - ESA (int), ambulance (int), arrests (dbl), alive (dbl), wt (dbl), ## rep1 (dbl), rep2 (dbl), rep3 (dbl), rep4 (dbl) ## Variables: ## [1] "ESA" "ambulance" "arrests" "alive" "wt" ## [6] "rep1" "rep2" "rep3" "rep4" Note that combined_weights was specified as FALSE because these weights are simply specified as 0 and 2 and do not incorporate the overall weight. When printing the object, the type of replication is noted as Balanced Repeated Replicates, and the replicate weights and the weight variable are specified. Additionally, the summary lists the variables included. 10.4.2 Fay’s BRR method Fay’s BRR method for replicate weights is similar to the BRR method in that it uses a Hadamard matrix to construct replicate weights. However, rather than deleting PSUs for each replicate, with Fay’s BRR half of the PSUs have a replicate weight which is the main weight multiplied by \\(\\rho\\), and the other half have the main weight multiplied by \\((2-\\rho)\\) where \\(0 \\le \\rho < 1\\). Note that when \\(\\rho=0\\), this is equivalent to the standard BRR weights, and as \\(\\rho\\) becomes closer to 1, this method is more similar to jackknife discussed in the next section. To obtain the value of \\(\\rho\\), it is necessary to read the survey documentation (see Chapter 3). The math The standard error estimate for \\(\\hat{\\theta}\\) is slightly different than the BRR, due to the addition of the multiplier of \\(\\rho\\). Using the generic notation above, \\(\\alpha=\\frac{1}{R \\left(1-\\rho\\right)^2}\\) and \\(\\alpha_r=1 \\text{ for all } r\\). The standard error is calculated as: \\[se(\\hat{\\theta})=\\sqrt{\\frac{1}{R (1-\\rho)^2} \\sum_{r=1}^R \\left( \\hat{\\theta}_r-\\hat{\\theta}\\right)^2}\\] The syntax The syntax is very similar for BRR and Fay’s BRR. To specify a Fay’s BRR design, we need to specify the weight variable (weights), the replicate weight variables (repweights), the type of replicate weights is Fay’s BRR (type = Fay), whether the mean squared error should be used (mse = TRUE) or not (mse = FALSE), and Fay’s multiplier (rho). For example, if a dataset had WT0 for the main weight and had 20 BRR weights indicated as WT1, WT2, …, WT20, and Fay’s multiplier is 0.3, use the following syntax: fay_des <- dat %>% as_survey_rep(weights = WT0, repweights = num_range("WT", 1:20), type = "Fay", mse = TRUE, rho = 0.3) Example The 2015 RECS (U.S. Energy Information Administration 2017) uses Fay’s BRR weights with the final weight as NWEIGHT and replicate weights as BRRWT1 - BRRWT96 and the documentation specifies a Fay’s multiplier of 0.5. On the file, DOEID is a unique identifier for each respondent, TOTALDOL is the total cost of energy, TOTSQFT_EN is the total square footage of the residence, and REGOINC is the Census region. We have already pulled in the 2015 RECS data from the {srvyrexploR} package that provides data for this book. To specify the design for the recs_2015 data, use the following syntax: recs_2015_des <- recs_2015 %>% as_survey_rep(weights = NWEIGHT, repweights = BRRWT1:BRRWT96, type = "Fay", rho = 0.5, mse = TRUE, variables = c(DOEID, TOTALDOL, TOTSQFT_EN, REGIONC)) recs_2015_des ## Call: Called via srvyr ## Fay's variance method (rho= 0.5 ) with 96 replicates and MSE variances. ## Sampling variables: ## - repweights: `BRRWT1 + BRRWT2 + BRRWT3 + BRRWT4 + BRRWT5 + BRRWT6 + ## BRRWT7 + BRRWT8 + BRRWT9 + BRRWT10 + BRRWT11 + BRRWT12 + BRRWT13 + ## BRRWT14 + BRRWT15 + BRRWT16 + BRRWT17 + BRRWT18 + BRRWT19 + BRRWT20 ## + BRRWT21 + BRRWT22 + BRRWT23 + BRRWT24 + BRRWT25 + BRRWT26 + ## BRRWT27 + BRRWT28 + BRRWT29 + BRRWT30 + BRRWT31 + BRRWT32 + BRRWT33 ## + BRRWT34 + BRRWT35 + BRRWT36 + BRRWT37 + BRRWT38 + BRRWT39 + ## BRRWT40 + BRRWT41 + BRRWT42 + BRRWT43 + BRRWT44 + BRRWT45 + BRRWT46 ## + BRRWT47 + BRRWT48 + BRRWT49 + BRRWT50 + BRRWT51 + BRRWT52 + ## BRRWT53 + BRRWT54 + BRRWT55 + BRRWT56 + BRRWT57 + BRRWT58 + BRRWT59 ## + BRRWT60 + BRRWT61 + BRRWT62 + BRRWT63 + BRRWT64 + BRRWT65 + ## BRRWT66 + BRRWT67 + BRRWT68 + BRRWT69 + BRRWT70 + BRRWT71 + BRRWT72 ## + BRRWT73 + BRRWT74 + BRRWT75 + BRRWT76 + BRRWT77 + BRRWT78 + ## BRRWT79 + BRRWT80 + BRRWT81 + BRRWT82 + BRRWT83 + BRRWT84 + BRRWT85 ## + BRRWT86 + BRRWT87 + BRRWT88 + BRRWT89 + BRRWT90 + BRRWT91 + ## BRRWT92 + BRRWT93 + BRRWT94 + BRRWT95 + BRRWT96` ## - weights: NWEIGHT ## Data variables: ## - DOEID (dbl), TOTALDOL (dbl), TOTSQFT_EN (dbl), REGIONC (dbl) summary(recs_2015_des) ## Call: Called via srvyr ## Fay's variance method (rho= 0.5 ) with 96 replicates and MSE variances. ## Sampling variables: ## - repweights: `BRRWT1 + BRRWT2 + BRRWT3 + BRRWT4 + BRRWT5 + BRRWT6 + ## BRRWT7 + BRRWT8 + BRRWT9 + BRRWT10 + BRRWT11 + BRRWT12 + BRRWT13 + ## BRRWT14 + BRRWT15 + BRRWT16 + BRRWT17 + BRRWT18 + BRRWT19 + BRRWT20 ## + BRRWT21 + BRRWT22 + BRRWT23 + BRRWT24 + BRRWT25 + BRRWT26 + ## BRRWT27 + BRRWT28 + BRRWT29 + BRRWT30 + BRRWT31 + BRRWT32 + BRRWT33 ## + BRRWT34 + BRRWT35 + BRRWT36 + BRRWT37 + BRRWT38 + BRRWT39 + ## BRRWT40 + BRRWT41 + BRRWT42 + BRRWT43 + BRRWT44 + BRRWT45 + BRRWT46 ## + BRRWT47 + BRRWT48 + BRRWT49 + BRRWT50 + BRRWT51 + BRRWT52 + ## BRRWT53 + BRRWT54 + BRRWT55 + BRRWT56 + BRRWT57 + BRRWT58 + BRRWT59 ## + BRRWT60 + BRRWT61 + BRRWT62 + BRRWT63 + BRRWT64 + BRRWT65 + ## BRRWT66 + BRRWT67 + BRRWT68 + BRRWT69 + BRRWT70 + BRRWT71 + BRRWT72 ## + BRRWT73 + BRRWT74 + BRRWT75 + BRRWT76 + BRRWT77 + BRRWT78 + ## BRRWT79 + BRRWT80 + BRRWT81 + BRRWT82 + BRRWT83 + BRRWT84 + BRRWT85 ## + BRRWT86 + BRRWT87 + BRRWT88 + BRRWT89 + BRRWT90 + BRRWT91 + ## BRRWT92 + BRRWT93 + BRRWT94 + BRRWT95 + BRRWT96` ## - weights: NWEIGHT ## Data variables: ## - DOEID (dbl), TOTALDOL (dbl), TOTSQFT_EN (dbl), REGIONC (dbl) ## Variables: ## [1] "DOEID" "TOTALDOL" "TOTSQFT_EN" "REGIONC" In specifying the design, the variables option was also used to include which variables might be used in analyses. This is optional but can make our object smaller and easier to work with. When printing the design object or looking at the summary, the replicate weight type is re-iterated as Fay's variance method (rho= 0.5) with 96 replicates and MSE variances, and the variables are included. No weight or probability summary is included in this output as we have seen in some other design objects. 10.4.3 Jackknife method There are three jackknife estimators implemented in {srvyr} - jackknife 1 (JK1), jackknife n (JKn), and jackknife 2 (JK2). The JK1 method can be used for unstratified designs, and replicates are created by removing one PSU at a time so the number of replicates is the same as the number of PSUs. If there is no clustering, then the PSU is the ultimate sampling unit (e.g., unit). The JKn method is used for stratified designs and requires two or more PSUs per stratum. In this case, each replicate is created by deleting one PSU from a single stratum, so the number of replicates is the number of total PSUs across all strata. The JK2 method is a special case of JKn when there are exactly 2 PSUs sampled per stratum. For variance estimation, scaling constants must also be specified. The math Using the generic notation above, \\(\\alpha=\\frac{R-1}{R}\\) and \\(\\alpha_r=1 \\text{ for all } r\\). For the JK1 method, the standard error estimate for \\(\\hat{\\theta}\\) is calculated as: \\[se(\\hat{\\theta})=\\sqrt{\\frac{R-1}{R} \\sum_{r=1}^R \\left( \\hat{\\theta}_r-\\hat{\\theta}\\right)^2}\\] The JKn method is a bit more complex, but the coefficients are generally provided with restricted and public-use files. For each replicate, one stratum has a PSU removed, and the weights are adjusted by \\(n_h/(n_h-1)\\) where \\(n_h\\) is the number of PSUs in stratum \\(h\\). The coefficients in other strata are set to 1. Denote the coefficient that results from this process for replicate \\(r\\) as \\(\\alpha_r\\), then the standard error estimate for \\(\\hat{\\theta}\\) is calculated as: \\[se(\\hat{\\theta})=\\sqrt{\\sum_{r=1}^R \\alpha_r \\left( \\hat{\\theta}_r-\\hat{\\theta}\\right)^2}\\] The syntax To specify the jackknife method, we use the survey documentation to understand the type of jackknife (1, n, or 2) and the multiplier. In the syntax we need to specify the weight variable (weights), the replicate weight variables (repweights), the type of replicate weights as jackknife 1 (type = \"JK1\"), n (type = \"JKN\"), or 2 (type = \"JK2\"), whether the mean squared error should be used (mse = TRUE) or not (mse = FALSE), and the multiplier (scale). For example, if the survey is a jackknife 1 method with a multiplier of \\(\\alpha_r=(R-1)/R=19/20=0.95\\), the dataset has WT0 for the main weight and 20 replicate weights indicated as WT1, WT2, …, WT20, use the following syntax: jk1_des <- dat %>% as_survey_rep(weights = WT0, repweights= num_range("WT", 1:20), type="JK1", mse=TRUE, scale=0.95) For a jackknife n method, we need to specify the multiplier for all replicates. In this case we use the rscales argument to specify each one. The documentation will provide details on what the multipliers (\\(\\alpha_r\\)) are, and they may be the same for all replicates. For example, consider a case where \\(\\alpha_r=0.1\\) for all replicates and the dataset had WT0 for the main weight and had 20 replicate weights indicated as WT1, WT2, …, WT20. We specify the type as type = \"JKN\", and the multiplier as rscales=rep(0.1,20): jkn_des <- dat %>% as_survey_rep(weights = WT0, repweights= num_range("WT", 1:20), type="JKN", mse=TRUE, rscales=rep(0.1, 20)) Example The 2020 RECS (U.S. Energy Information Administration 2023b) uses jackknife weights with the final weight as NWEIGHT and replicate weights as NWEIGHT1 - NWEIGHT60 with a scale of \\((R-1)/R=59/60\\). On the file, DOEID is a unique identifier for each respondent, TOTALDOL is the total cost of energy, TOTSQFT_EN is the total square footage of the residence, and REGOINC is the Census region. We have already read in the RECS data and created a dataset called recs_2020 above in the prerequisites. To specify this design, use the following syntax: recs_des <- recs_2020 %>% as_survey_rep( weights = NWEIGHT, repweights = NWEIGHT1:NWEIGHT60, type = "JK1", scale = 59/60, mse = TRUE, variables = c(DOEID, TOTALDOL, TOTSQFT_EN, REGIONC) ) recs_des ## Call: Called via srvyr ## Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances. ## Sampling variables: ## - repweights: `NWEIGHT1 + NWEIGHT2 + NWEIGHT3 + NWEIGHT4 + NWEIGHT5 + ## NWEIGHT6 + NWEIGHT7 + NWEIGHT8 + NWEIGHT9 + NWEIGHT10 + NWEIGHT11 + ## NWEIGHT12 + NWEIGHT13 + NWEIGHT14 + NWEIGHT15 + NWEIGHT16 + ## NWEIGHT17 + NWEIGHT18 + NWEIGHT19 + NWEIGHT20 + NWEIGHT21 + ## NWEIGHT22 + NWEIGHT23 + NWEIGHT24 + NWEIGHT25 + NWEIGHT26 + ## NWEIGHT27 + NWEIGHT28 + NWEIGHT29 + NWEIGHT30 + NWEIGHT31 + ## NWEIGHT32 + NWEIGHT33 + NWEIGHT34 + NWEIGHT35 + NWEIGHT36 + ## NWEIGHT37 + NWEIGHT38 + NWEIGHT39 + NWEIGHT40 + NWEIGHT41 + ## NWEIGHT42 + NWEIGHT43 + NWEIGHT44 + NWEIGHT45 + NWEIGHT46 + ## NWEIGHT47 + NWEIGHT48 + NWEIGHT49 + NWEIGHT50 + NWEIGHT51 + ## NWEIGHT52 + NWEIGHT53 + NWEIGHT54 + NWEIGHT55 + NWEIGHT56 + ## NWEIGHT57 + NWEIGHT58 + NWEIGHT59 + NWEIGHT60` ## - weights: NWEIGHT ## Data variables: ## - DOEID (dbl), TOTALDOL (dbl), TOTSQFT_EN (dbl), REGIONC (chr) summary(recs_des) ## Call: Called via srvyr ## Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances. ## Sampling variables: ## - repweights: `NWEIGHT1 + NWEIGHT2 + NWEIGHT3 + NWEIGHT4 + NWEIGHT5 + ## NWEIGHT6 + NWEIGHT7 + NWEIGHT8 + NWEIGHT9 + NWEIGHT10 + NWEIGHT11 + ## NWEIGHT12 + NWEIGHT13 + NWEIGHT14 + NWEIGHT15 + NWEIGHT16 + ## NWEIGHT17 + NWEIGHT18 + NWEIGHT19 + NWEIGHT20 + NWEIGHT21 + ## NWEIGHT22 + NWEIGHT23 + NWEIGHT24 + NWEIGHT25 + NWEIGHT26 + ## NWEIGHT27 + NWEIGHT28 + NWEIGHT29 + NWEIGHT30 + NWEIGHT31 + ## NWEIGHT32 + NWEIGHT33 + NWEIGHT34 + NWEIGHT35 + NWEIGHT36 + ## NWEIGHT37 + NWEIGHT38 + NWEIGHT39 + NWEIGHT40 + NWEIGHT41 + ## NWEIGHT42 + NWEIGHT43 + NWEIGHT44 + NWEIGHT45 + NWEIGHT46 + ## NWEIGHT47 + NWEIGHT48 + NWEIGHT49 + NWEIGHT50 + NWEIGHT51 + ## NWEIGHT52 + NWEIGHT53 + NWEIGHT54 + NWEIGHT55 + NWEIGHT56 + ## NWEIGHT57 + NWEIGHT58 + NWEIGHT59 + NWEIGHT60` ## - weights: NWEIGHT ## Data variables: ## - DOEID (dbl), TOTALDOL (dbl), TOTSQFT_EN (dbl), REGIONC (chr) ## Variables: ## [1] "DOEID" "TOTALDOL" "TOTSQFT_EN" "REGIONC" When printing the design object or looking at the summary, the replicate weight type is re-iterated as Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances, and the variables are included. No weight or probability summary is included. 10.4.4 Bootstrap method In bootstrap resampling, replicates are created by selecting random samples of the PSUs with replacement (SRSWR). If there are \\(M\\) PSUs in the sample, then each replicate will be created by selecting a random sample of \\(M\\) PSUs with replacement. Each replicate is created independently, and the weights for each replicate are adjusted to reflect the population, generally using the same method as how the analysis weight was adjusted. The math A weighted estimate for the full sample is calculated as \\(\\hat{\\theta}\\), and then a weighted estimate for each replicate is calculated as \\(\\hat{\\theta}_r\\) for \\(R\\) replicates. Then the standard error of the estimate is calculated as follows: \\[se(\\hat{\\theta})=\\sqrt{\\alpha \\sum_{r=1}^R \\left( \\hat{\\theta}_r-\\hat{\\theta}\\right)^2}\\] where \\(\\alpha\\) is the scaling constant. Note that the scaling constant (\\(\\alpha\\)) is provided in the survey documentation as there are many types of bootstrap methods which generate custom scaling constants. The syntax To specify a bootstrap method, we need to specify the weight variable (weights), the replicate weight variables (repweights), the type of replicate weights as bootstrap (type = \"bootstrap\"), whether the mean squared error should be used (mse = TRUE) or not (mse = FALSE), and the multiplier (scale). For example, if a dataset had WT0 for the main weight, 20 bootstrap weights indicated WT1, WT2, …, WT20, and a multiplier of \\(\\alpha=.02\\), use the following syntax: bs_des <- dat %>% as_survey_rep(weights = WT0, repweights= num_range("WT", 1:20), type="bootstrap", mse=TRUE, scale=.02) Example Returning to the api example, we are going to create a dataset with bootstrap weights to use as an example. In this example, we construct a one-cluster design with fifty replicate weights.34 apiclus1_slim <- apiclus1 %>% as_tibble() %>% arrange(dnum) %>% select(cds, dnum, fpc, pw) set.seed(662152) apibw <- bootweights(psu = apiclus1_slim$dnum, strata = rep(1, nrow(apiclus1_slim)), fpc = apiclus1_slim$fpc, replicates = 50) bwmata <- apibw$repweights$weights[apibw$repweights$index,] * apiclus1_slim$pw apiclus1_slim <- bwmata %>% as.data.frame() %>% set_names(str_c("pw", 1:50)) %>% cbind(apiclus1_slim) %>% as_tibble() %>% select(cds, dnum, fpc, pw, everything()) apiclus1_slim ## # A tibble: 183 × 54 ## cds dnum fpc pw pw1 pw2 pw3 pw4 pw5 pw6 pw7 ## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 43693776… 61 757 33.8 33.8 0 0 33.8 0 33.8 0 ## 2 43693776… 61 757 33.8 33.8 0 0 33.8 0 33.8 0 ## 3 43693776… 61 757 33.8 33.8 0 0 33.8 0 33.8 0 ## 4 43693776… 61 757 33.8 33.8 0 0 33.8 0 33.8 0 ## 5 43693776… 61 757 33.8 33.8 0 0 33.8 0 33.8 0 ## 6 43693776… 61 757 33.8 33.8 0 0 33.8 0 33.8 0 ## 7 43693776… 61 757 33.8 33.8 0 0 33.8 0 33.8 0 ## 8 43693776… 61 757 33.8 33.8 0 0 33.8 0 33.8 0 ## 9 43693776… 61 757 33.8 33.8 0 0 33.8 0 33.8 0 ## 10 43693776… 61 757 33.8 33.8 0 0 33.8 0 33.8 0 ## # ℹ 173 more rows ## # ℹ 43 more variables: pw8 <dbl>, pw9 <dbl>, pw10 <dbl>, pw11 <dbl>, ## # pw12 <dbl>, pw13 <dbl>, pw14 <dbl>, pw15 <dbl>, pw16 <dbl>, ## # pw17 <dbl>, pw18 <dbl>, pw19 <dbl>, pw20 <dbl>, pw21 <dbl>, ## # pw22 <dbl>, pw23 <dbl>, pw24 <dbl>, pw25 <dbl>, pw26 <dbl>, ## # pw27 <dbl>, pw28 <dbl>, pw29 <dbl>, pw30 <dbl>, pw31 <dbl>, ## # pw32 <dbl>, pw33 <dbl>, pw34 <dbl>, pw35 <dbl>, pw36 <dbl>, … The output of apiclus1_slim includes the same variables we have seen in other api examples (see Table 10.1), but now additionally includes bootstrap weights pw1, …, pw50. When creating the survey design object, we use the bootstrap weights as the replicate weights. Additionally, with replicate weights we need to include the scale (\\(\\alpha\\)). For this example we created, \\[\\alpha=\\frac{M}{(M-1)(R-1)}=\\frac{15}{(15-1)*(50-1)}=0.02186589\\] where \\(M\\) is the average number of PSUs per strata and \\(R\\) is the number of replicates. There is only 1 stratum and the number of clusters/PSUs is 15 so \\(M=15\\). api1_bs_des <- apiclus1_slim %>% as_survey_rep(weights = pw, repweights = pw1:pw50, type = "bootstrap", scale = 0.02186589, mse = TRUE) api1_bs_des ## Call: Called via srvyr ## Survey bootstrap with 50 replicates and MSE variances. ## Sampling variables: ## - repweights: `pw1 + pw2 + pw3 + pw4 + pw5 + pw6 + pw7 + pw8 + pw9 + ## pw10 + pw11 + pw12 + pw13 + pw14 + pw15 + pw16 + pw17 + pw18 + pw19 ## + pw20 + pw21 + pw22 + pw23 + pw24 + pw25 + pw26 + pw27 + pw28 + ## pw29 + pw30 + pw31 + pw32 + pw33 + pw34 + pw35 + pw36 + pw37 + pw38 ## + pw39 + pw40 + pw41 + pw42 + pw43 + pw44 + pw45 + pw46 + pw47 + ## pw48 + pw49 + pw50` ## - weights: pw ## Data variables: ## - cds (chr), dnum (int), fpc (dbl), pw (dbl), pw1 (dbl), pw2 (dbl), ## pw3 (dbl), pw4 (dbl), pw5 (dbl), pw6 (dbl), pw7 (dbl), pw8 (dbl), ## pw9 (dbl), pw10 (dbl), pw11 (dbl), pw12 (dbl), pw13 (dbl), pw14 ## (dbl), pw15 (dbl), pw16 (dbl), pw17 (dbl), pw18 (dbl), pw19 (dbl), ## pw20 (dbl), pw21 (dbl), pw22 (dbl), pw23 (dbl), pw24 (dbl), pw25 ## (dbl), pw26 (dbl), pw27 (dbl), pw28 (dbl), pw29 (dbl), pw30 (dbl), ## pw31 (dbl), pw32 (dbl), pw33 (dbl), pw34 (dbl), pw35 (dbl), pw36 ## (dbl), pw37 (dbl), pw38 (dbl), pw39 (dbl), pw40 (dbl), pw41 (dbl), ## pw42 (dbl), pw43 (dbl), pw44 (dbl), pw45 (dbl), pw46 (dbl), pw47 ## (dbl), pw48 (dbl), pw49 (dbl), pw50 (dbl) summary(api1_bs_des) ## Call: Called via srvyr ## Survey bootstrap with 50 replicates and MSE variances. ## Sampling variables: ## - repweights: `pw1 + pw2 + pw3 + pw4 + pw5 + pw6 + pw7 + pw8 + pw9 + ## pw10 + pw11 + pw12 + pw13 + pw14 + pw15 + pw16 + pw17 + pw18 + pw19 ## + pw20 + pw21 + pw22 + pw23 + pw24 + pw25 + pw26 + pw27 + pw28 + ## pw29 + pw30 + pw31 + pw32 + pw33 + pw34 + pw35 + pw36 + pw37 + pw38 ## + pw39 + pw40 + pw41 + pw42 + pw43 + pw44 + pw45 + pw46 + pw47 + ## pw48 + pw49 + pw50` ## - weights: pw ## Data variables: ## - cds (chr), dnum (int), fpc (dbl), pw (dbl), pw1 (dbl), pw2 (dbl), ## pw3 (dbl), pw4 (dbl), pw5 (dbl), pw6 (dbl), pw7 (dbl), pw8 (dbl), ## pw9 (dbl), pw10 (dbl), pw11 (dbl), pw12 (dbl), pw13 (dbl), pw14 ## (dbl), pw15 (dbl), pw16 (dbl), pw17 (dbl), pw18 (dbl), pw19 (dbl), ## pw20 (dbl), pw21 (dbl), pw22 (dbl), pw23 (dbl), pw24 (dbl), pw25 ## (dbl), pw26 (dbl), pw27 (dbl), pw28 (dbl), pw29 (dbl), pw30 (dbl), ## pw31 (dbl), pw32 (dbl), pw33 (dbl), pw34 (dbl), pw35 (dbl), pw36 ## (dbl), pw37 (dbl), pw38 (dbl), pw39 (dbl), pw40 (dbl), pw41 (dbl), ## pw42 (dbl), pw43 (dbl), pw44 (dbl), pw45 (dbl), pw46 (dbl), pw47 ## (dbl), pw48 (dbl), pw49 (dbl), pw50 (dbl) ## Variables: ## [1] "cds" "dnum" "fpc" "pw" "pw1" "pw2" "pw3" "pw4" "pw5" ## [10] "pw6" "pw7" "pw8" "pw9" "pw10" "pw11" "pw12" "pw13" "pw14" ## [19] "pw15" "pw16" "pw17" "pw18" "pw19" "pw20" "pw21" "pw22" "pw23" ## [28] "pw24" "pw25" "pw26" "pw27" "pw28" "pw29" "pw30" "pw31" "pw32" ## [37] "pw33" "pw34" "pw35" "pw36" "pw37" "pw38" "pw39" "pw40" "pw41" ## [46] "pw42" "pw43" "pw44" "pw45" "pw46" "pw47" "pw48" "pw49" "pw50" As with other replicate design objects, when printing the object or looking at the summary, the replicate weights are provided along with the data variables. 10.5 Exercises The National Health Interview Survey (NHIS) is an annual household survey conducted by the National Center for Health Statistics (NCHS). The NHIS includes a wide variety of health topics for adults including health status and conditions, functioning and disability, health care access and health service utilization, health-related behaviors, health promotion, mental health, barriers to care, and community engagement. Like many national in-person surveys, the sampling design is a stratified clustered design with details included in the Survey Description35. The Survey Description provides information on setting up syntax in SUDAAN, Stata, SPSS, SAS, and R ({survey} package implementation). How would you specify the design using {srvyr} using either as_survey_design or as_survey_rep()? nhis_adult_des <- nhis_adult_data %>% as_survey_design(ids=PPSU, strata=PSTRAT, nest=TRUE, weights=WTFA_A) The General Social Survey is a survey that has been administered since 1972 on social, behavioral, and attitudinal topics. The 2016-2020 GSS Panel codebook36 provides examples of setting up syntax in SAS and Stata but not R. How would you specify the design in R? gss_des <- gss_data %>% as_survey_design(ids = VPSU_2, strata = VSTRAT_2, weights = WTSSNR_2) References Cochran, William G. 1977. Sampling Techniques. John Wiley & Sons. Cox, Brenda G, David A Binder, B Nanjamma Chinnappa, Anders Christianson, Michael J Colledge, and Phillip S Kott. 2011. Business Survey Methods. John Wiley & Sons. Deming, W Edwards. 1991. Sample Design in Business Research. Vol. 23. John Wiley & Sons. Fuller, Wayne A. 2011. Sampling Statistics. John Wiley & Sons. Levy, Paul S, and Stanley Lemeshow. 2013. Sampling of Populations: Methods and Applications. John Wiley & Sons. Penn State. 2019. “STAT 506: Sampling Theory and Methods [Online Course].” https://online.stat.psu.edu/stat506/. Särndal, Carl-Erik, Bengt Swensson, and Jan Wretman. 2003. Model Assisted Survey Sampling. Springer Science & Business Media. U.S. Census Bureau. 2021. “Understanding and Using the American Community Survey Public Use Microdata Sample Files What Data Users Need to Know.” U.S. Government Printing Office; https://www.census.gov/content/dam/Census/library/publications/2021/acs/acs_pums_handbook_2021.pdf. U.S. Energy Information Administration. 2017. “Residential Energy Consumption Survey (RECS): Using the 2015 microdata file to compute estimates and standard errors (RSEs).” https://www.eia.gov/consumption/residential/data/2015/pdf/microdata_v3.pdf. ———. 2023b. “2020 Residential Energy Consumption Survey: Using the microdata file to compute estimates and relative standard errors (RSEs).” https://www.eia.gov/consumption/residential/data/2020/pdf/microdata-guide.pdf. Valliant, Richard, Jill A Dever, and Frauke Kreuter. 2013. Practical Tools for Designing and Weighting Survey Samples. Vol. 1. Springer. Wolter, Kirk M. 2007. Introduction to Variance Estimation. Vol. 53. Springer. 2017-2019 National Survey of Family Growth (NSFG): Sample Design Documentation - https://www.cdc.gov/nchs/data/nsfg/NSFG-2017-2019-Sample-Design-Documentation-508.pdf↩︎ dplyr documentation on tidy-select: https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html↩︎ We provide the code here for you to replicate this example, but are not focusing on the creation of the weights as that is outside the scope of this book. We recommend you reference Wolter (2007) for more information on creating bootstrap weights.↩︎ 2022 National Health Interview Survey (NHIS) Survey Description: https://www.cdc.gov/nchs/nhis/2022nhis.htm↩︎ 2016-2020 GSS Panel Codebook Release 1a: https://gss.norc.org/Documents/codebook/2016-2020%20GSS%20Panel%20Codebook%20-%20R1a.pdf↩︎ "],["c11-missing-data.html", "Chapter 11 Missing data 11.1 Introduction 11.2 Missing data mechanisms 11.3 Assessing missing data 11.4 Analysis with missing data", " Chapter 11 Missing data Prerequisites For this chapter, load the following packages: library(tidyverse) library(survey) library(srvyr) library(srvyrexploR) library(naniar) library(haven) library(gt) We will be using data from ANES and RECS. Here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter 3 for more information). targetpop <- 231592693 data(anes_2020) anes_adjwgt <- anes_2020 %>% mutate(Weight = Weight / sum(Weight) * targetpop) anes_des <- anes_adjwgt %>% as_survey_design( weights = Weight, strata = Stratum, ids = VarUnit, nest = TRUE ) For RECS, details are included in the RECS documentation and Chapter 10. data(recs_2020) recs_des <- recs_2020 %>% as_survey_rep( weights = NWEIGHT, repweights = NWEIGHT1:NWEIGHT60, type = "JK1", scale = 59/60, mse = TRUE ) 11.1 Introduction Missing data in surveys refers to situations where participants do not provide complete responses to survey questions. Respondents may not have seen a question by design. Or, they may not respond to a question for various other reasons, such as not wanting to answer a particular question, not understanding the question, or simply forgetting to answer. Missing data is important to consider and account for, as it can introduce bias and reduce the representativeness of the data. This chapter provides an overview of the types of missing data, how to assess missing data in surveys, and how to conduct analysis when missing data is present. Understanding this complex topic can help ensure accurate reporting of survey results and can provide insight into potential changes to the survey design for the future. 11.2 Missing data mechanisms There are two main categories that missing data typically fall into: missing by design or unintentional missing data. Missing by design is part of the survey plan and can be more easily incorporated into weights and analyses. Unintentional missing data on the other hand, can lead to bias in survey estimates if not correctly accounted for. Below we provide more information on the types of missing data. Missing by design/questionnaire skip logic: This type of missingness occurs when certain respondents are intentionally directed to skip specific questions based on their previous responses or characteristics. For example, in a survey about employment, if a respondent indicates that they are not employed, they may be directed to skip questions related to their job responsibilities. Additionally, some surveys randomize questions or modules so that not all participants respond to all questions. In these instances, respondents would have missing data for the modules not randomly assigned to them. Unintentional missing data: This type of missingness occurs when researchers do not intend for there to be missing data on a particular question, for example, if respondents did not finish the survey or refused to answer individual questions. There are three main types of unintentional missing data that each should be considered and handled differently (Mack, Su, and Westreich 2018; Schafer and Graham 2002): Missing completely at random (MCAR): The missing data is unrelated to both observed and unobserved data, and the probability of being missing is the same across all cases. For example, if a respondent missed a question because they had to leave the survey early due to an emergency. Missing at random (MAR): The missing data is related to observed data but not unobserved data, and the probability of being missing is the same within groups. For example, if older respondents choose not to answer specific questions but younger respondents do answer them and we know the respondent’s age. Missing not at random (MNAR): The missing data is related to unobserved data, and the probability of being missing varies for reasons we are not measuring. For example, if respondents with depression do not answer a question about depression severity. 11.3 Assessing missing data Before beginning analysis, we should explore the data to determine if there is missing data and what types of missing data are present. Conducting this descriptive analysis can help with analysis and reporting of survey data (see Section 12), and can inform the survey design in future studies. For example, large amounts of unexpected missing data may indicate the questions were unclear or difficult to recall. There are several ways to explore missing data which we walk through below. When assessing the missing data, we recommend using a data.frame object and not the survey object as most of the analysis is about patterns of records and weights are not necessary. 11.3.1 Summarize data A very rudimentary first exploration is to use the summary() function to summarize the data which will illuminate NA values in the data. Let’s look at a few analytic variables on the ANES 2020 data using summary(): anes_2020 %>% select(V202051:EarlyVote2020) %>% summary() ## V202051 V202066 V202072 VotedPres2020 ## Min. :-9.000 Min. :-9.0 Min. :-9.000 Yes :5952 ## 1st Qu.:-1.000 1st Qu.: 4.0 1st Qu.: 1.000 No : 77 ## Median :-1.000 Median : 4.0 Median : 1.000 NA's:1424 ## Mean :-0.726 Mean : 3.4 Mean : 0.623 ## 3rd Qu.:-1.000 3rd Qu.: 4.0 3rd Qu.: 1.000 ## Max. : 3.000 Max. : 4.0 Max. : 2.000 ## V202073 V202109x V202110x ## Min. :-9.000 Min. :-2.000 Min. :-9.00 ## 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.00 ## Median : 1.000 Median : 1.000 Median : 1.00 ## Mean : 0.942 Mean : 0.858 Mean : 0.99 ## 3rd Qu.: 2.000 3rd Qu.: 1.000 3rd Qu.: 2.00 ## Max. :12.000 Max. : 1.000 Max. : 5.00 ## VotedPres2020_selection EarlyVote2020 ## Biden:3509 Yes : 371 ## Trump:2567 No :5949 ## Other: 158 NA's:1133 ## NA's :1219 ## ## We see that there are NA values in several of the derived variables (those not beginning with “V”) and negative values in the original variables (those beginning with “V”). We can also use the count() function to get an understanding of the different types of missing data on the original variables. For example, let’s look at the count of data for V202072, which corresponds to our VotedPres2020 variable. anes_2020 %>% count(VotedPres2020,V202072) ## # A tibble: 5 × 3 ## VotedPres2020 V202072 n ## <fct> <dbl+lbl> <int> ## 1 Yes 1 [1. Yes, voted for President] 5952 ## 2 No 2 [2. No, didn't vote for President] 77 ## 3 <NA> -9 [-9. Refused] 2 ## 4 <NA> -6 [-6. No post-election interview] 4 ## 5 <NA> -1 [-1. Inapplicable] 1418 Here we can see that there are three types of missing data, and that the majority of them fall under the “Inapplicable” category. This is usually a term associated with data missing due to skip patterns and is considered to be missing data by design. Based on the documentation from ANES (DeBell 2010), we can see that this question was only asked to respondents who voted in the election. 11.3.2 Visualization of missing data It can be challenging to look at tables for every variable, and instead may be more efficient to view missing data in a graphical format to help narrow in on patterns or unique variables. The {naniar} package is very useful in exploring missing data visually. It provides quick graphics to explore the missingness patterns in the data. We can use the vis_miss() function available in both {visdat} and {naniar} packages to view the amount of missing data by variable. anes_2020_derived<-anes_2020 %>% select(!starts_with("V2"),-CaseID,-InterviewMode,-Weight,-Stratum,-VarUnit) anes_2020_derived %>% vis_miss(cluster= TRUE, show_perc = FALSE) + scale_fill_manual(values = book_colors[c(3,1)], labels = c("Present","Missing"), name = "") FIGURE 11.1: Visual depiction of missing data in the ANES 2020 data From this visualization, we can start to get a picture of what questions may be related to each other in terms of missing data. Even if we did not have the informative variable names, we could be able to deduce that VotedPres2020, VotedPres2020_selection, and EarlyVote2020 are likely related since their missing data patterns are similar. Additionally, we can also look at VotedPres2016_selection and see that there is a lot of missing data in that variable. Most likely this is due to a skip pattern, and we can look at further graphics to see how it might be related to other variables. The {naniar} package has multiple visualization functions that can help dive deeper such as the gg_miss_fct() function which looks at missing data for all variables by levels of another variable. anes_2020_derived %>% gg_miss_fct(VotedPres2016) + scale_fill_gradientn( guide = "colorbar", name = "% Miss", colors = book_colors[c(3, 2, 1)] ) + ylab("Variable") + xlab("Voted for President in 2016") ## Scale for fill is already present. ## Adding another scale for fill, which will replace the existing scale. FIGURE 11.2: Missingness in variables for each level of VotedPres2016 in the ANES 2020 data In this case, we can see that if they did not vote for president in 2016 or did not answer that question, then they were not asked about who they voted for in 2016 (the percentage of missing data if 100%). Additionally, we can see with this graphic, that there is more missing data across all questions if they did not provide an answer to VotedPres2016. There are other graphics that work well with numeric data. For example, in the RECS 2020 data we can plot two continuous variables and the missing data associated with it to see if there are any patterns to the missingness. To do this, we can use the bind_shadow() function from the {naniar} package. This creates a nabular (combination of “na” with “tabular”), which features the original columns followed by the same number of columns with a specific NA format. These NA columns are indicators of if the value in the original data is missing or not. The example printed below shows how most levels of HeatingBehavior are not missing !NA in the NA variable of HeatingBehavior_NA, but those missing in HeatingBehavior are also missing in HeatingBehavior_NA. recs_2020_shadow <- recs_2020 %>% bind_shadow() ncol(recs_2020) ## [1] 118 ncol(recs_2020_shadow) ## [1] 236 recs_2020_shadow %>% count(HeatingBehavior,HeatingBehavior_NA) ## # A tibble: 7 × 3 ## HeatingBehavior HeatingBehavior_NA n ## <fct> <fct> <int> ## 1 Set one temp and leave it !NA 7806 ## 2 Manually adjust at night/no one home !NA 4654 ## 3 Programmable or smart thermostat automatical… !NA 3310 ## 4 Turn on or off as needed !NA 1491 ## 5 No control !NA 438 ## 6 Other !NA 46 ## 7 <NA> NA 751 We can then use these new variables to plot the missing data along side the actual data. For example, let’s plot a histogram of the total electric bill grouped by those that are missing and not missing by heating behavior. recs_2020_shadow %>% filter(TOTALDOL < 5000) %>% ggplot(aes(x=TOTALDOL,fill=HeatingBehavior_NA)) + geom_histogram() + scale_fill_manual(values = book_colors[c(3, 1)], labels = c("Present", "Missing"), name = "Heating Behavior") + theme_minimal() + xlab("Total Energy Cost (Truncated at $5000)") + ylab("Number of Households") + labs(title = "Histogram of Energy Cost by Heating Behavior Missing Data") ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. FIGURE 11.3: Histogram of Energy Cost by Heating Behavior Missing Data This plot indicates that respondents who did not provide a response for the heating behavior question may have a different distribution of total energy cost compared to respondents who did provide a response. This view of the raw data and missingness could indicate some bias in the data. Researchers take these different bias aspects into account when calculating weights and we need to make sure that the weights are incorporated when analyzing the data. There are many other visualizations that can be helpful in reviewing the data, and we recommend reviewing the {naniar} documentation for more information (Tierney and Cook 2023). 11.4 Analysis with missing data Once we understand the types of missingness, we can begin the analysis of the data. Different missingness types may be handled in different ways. In most publicly available datasets, researchers will have already calculated weights and imputed missing values if deemed necessary. Those interested in learning more about how to calculate weights and impute data for different missing data mechanisms, we recommended Kim and Shao (2021) and Valliant and Dever (2018). Even with weights and imputation, missing data will still most likely exist in the data and need to be accounted for in analysis. This section provides an overview on how to recode missing data in R, and how to account for skip patterns in analysis. 11.4.1 Recoding missing data Even within a variable, there can be different reasons for missing data. In publicly released data negative values are often present to provide different meaning for values. For example, in the ANES 2020 data they have the following negative values to represent different types of missing data: * -9: Refused * -8: Don’t Know * -7: No post-election data, deleted due to incomplete interview * -6: No post-election interview * -5: Interview breakoff (sufficient partial IW) * -4: Technical error * -3: Restricted * -2: Other missing reason (question specific) * -1: Inapplicable When we created the derived variables for use in this book, we coded all negative values as NA and proceeded to analyze the data. For most cases this is an appropriate approach as long as you filter the data appropriately to account for skip patterns (see next section). However, the {nanair} package does have the option to code special missing values. For example, if we wanted to have two NA values, one that indicated the question was missing by design (e.g., due to skip patterns) and one for the other missing categories we can use the nabular format to incorporate these with the recode_shadow() function. anes_2020_shadow<-anes_2020 %>% select(starts_with("V2")) %>% mutate(across(everything(),~case_when(.x < -1 ~ NA, TRUE~.x))) %>% bind_shadow() %>% recode_shadow(V201103 = .where(V201103==-1~"skip")) anes_2020_shadow %>% count(V201103,V201103_NA) ## # A tibble: 5 × 3 ## V201103 V201103_NA n ## <dbl+lbl> <fct> <int> ## 1 -1 [-1. Inapplicable] NA_skip 1643 ## 2 1 [1. Hillary Clinton] !NA 2911 ## 3 2 [2. Donald Trump] !NA 2466 ## 4 5 [5. Other {SPECIFY}] !NA 390 ## 5 NA NA 43 However it is important to note that at the time of publication, there is no easy way to implement recode_shadow() to multiple variables at once (e.g., we cannot use the tidyverse feature of across()). The example code above only implements this for a single variable, so this would have to be done to all variables of interest manually or in a loop. 11.4.2 Accounting for skip patterns When questions are skipped by design in a survey, it is meaningful that the data is later missing. For example the RECS survey asks people how they control the heat in their home in the winter (HeatingBehavior). This is only among those who have heat in their home (SpaceHeatingUsed). If no there is no heating equipment used, the value of HeatingBehavior is missing. One has several choices when analyzing this data which include 1) only including those with a valid value of HeatingBehavior and specifying the universe as those with heat or 2) including those who do not have heat. It is important to specify what population an analysis generalizes to. Here is example code where we only include those with a valid value of HeatingBehavior (choice 1). Note that we use the design object (recs_des) then filter to those that are not missing on HeatingBehavior. heat_cntl_1 <- recs_des %>% filter(!is.na(HeatingBehavior)) %>% group_by(HeatingBehavior) %>% summarize( p=survey_prop() ) heat_cntl_1 ## # A tibble: 6 × 3 ## HeatingBehavior p p_se ## <fct> <dbl> <dbl> ## 1 Set one temp and leave it 0.430 4.69e-3 ## 2 Manually adjust at night/no one home 0.264 4.54e-3 ## 3 Programmable or smart thermostat automatically adjust… 0.168 3.12e-3 ## 4 Turn on or off as needed 0.102 2.89e-3 ## 5 No control 0.0333 1.70e-3 ## 6 Other 0.00208 3.59e-4 Here is example code where we include those that do not have heat (choice 2). To help understand what we are looking at we have included the output to show both variables SpaceHeatingUsed and HeatingBehavior. heat_cntl_2 <- recs_des %>% group_by(interact(SpaceHeatingUsed, HeatingBehavior)) %>% summarize( p=survey_prop() ) heat_cntl_2 ## # A tibble: 7 × 4 ## SpaceHeatingUsed HeatingBehavior p p_se ## <lgl> <fct> <dbl> <dbl> ## 1 FALSE <NA> 0.0469 2.07e-3 ## 2 TRUE Set one temp and leave it 0.410 4.60e-3 ## 3 TRUE Manually adjust at night/no one home 0.251 4.36e-3 ## 4 TRUE Programmable or smart thermostat aut… 0.160 2.95e-3 ## 5 TRUE Turn on or off as needed 0.0976 2.79e-3 ## 6 TRUE No control 0.0317 1.62e-3 ## 7 TRUE Other 0.00198 3.41e-4 If we ran the first analysis, we would say that 16.8% of households with heat use a programmable or smart thermostat for the heating of their home. While if we used the results from the second analysis, we could say that 16% of households use a programmable or smart thermostat for the heating of their home. The distinction of the two statements is bolded for emphasis. Skip patterns often change the universe that we are talking about and need to be carefully examined. Filtering to the correct universe is important when handling these types of missing data. The nabular we created above can also help with this. If we have NA_skip values in the shadow, we can make sure that we filter out all of these values and only include relevant missing. To do this with survey data we could first create the nabular, then create the design object on that data, and then use the shadow variables to assist with filtering the data. Let’s use the nabular we created above for ANES 2020 (anes_2020_shadow) to create the design object. anes_adjwgt_shadow <- anes_2020_shadow %>% mutate(V200010b = V200010b/sum(V200010b)*targetpop) anes_des_shadow <- anes_adjwgt_shadow %>% as_survey_design( weights = V200010b, strata = V200010d, ids = V200010c, nest = TRUE ) Then we can use this design object to look at the percent of the population that voted for each candidate in 2016 (V201103). First, let’s look at the percentages without removing any cases: pres16_select1<-anes_des_shadow %>% group_by(V201103) %>% summarize( All_Missing=survey_prop() ) pres16_select1 ## # A tibble: 5 × 3 ## V201103 All_Missing All_Missing_se ## <dbl+lbl> <dbl> <dbl> ## 1 -1 [-1. Inapplicable] 0.324 0.00933 ## 2 1 [1. Hillary Clinton] 0.330 0.00728 ## 3 2 [2. Donald Trump] 0.299 0.00728 ## 4 5 [5. Other {SPECIFY}] 0.0409 0.00230 ## 5 NA 0.00627 0.00121 Next, we will look at the percentages removing only those that were missing due to skip patterns (i.e., they did not receive this question). pres16_select2<-anes_des_shadow %>% filter(V201103_NA!="NA_skip") %>% group_by(V201103) %>% summarize( No_Skip_Missing=survey_prop() ) pres16_select2 ## # A tibble: 4 × 3 ## V201103 No_Skip_Missing No_Skip_Missing_se ## <dbl+lbl> <dbl> <dbl> ## 1 1 [1. Hillary Clinton] 0.488 0.00870 ## 2 2 [2. Donald Trump] 0.443 0.00856 ## 3 5 [5. Other {SPECIFY}] 0.0606 0.00330 ## 4 NA 0.00928 0.00178 Finally, we will look at the percentages removing all missing values both due to skip patterns and due to those who refused to answer the question. pres16_select3<-anes_des_shadow %>% filter(V201103_NA=="!NA") %>% group_by(V201103) %>% summarize( No_Missing=survey_prop() ) pres16_select3 ## # A tibble: 3 × 3 ## V201103 No_Missing No_Missing_se ## <dbl+lbl> <dbl> <dbl> ## 1 1 [1. Hillary Clinton] 0.492 0.00875 ## 2 2 [2. Donald Trump] 0.447 0.00861 ## 3 5 [5. Other {SPECIFY}] 0.0611 0.00332 #edxahdlkim table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #edxahdlkim thead, #edxahdlkim tbody, #edxahdlkim tfoot, #edxahdlkim tr, #edxahdlkim td, #edxahdlkim th { border-style: none; } #edxahdlkim p { margin: 0; padding: 0; } #edxahdlkim .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #edxahdlkim .gt_caption { padding-top: 4px; padding-bottom: 4px; } #edxahdlkim .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #edxahdlkim .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #edxahdlkim .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #edxahdlkim .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #edxahdlkim .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #edxahdlkim .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #edxahdlkim .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #edxahdlkim .gt_column_spanner_outer:first-child { padding-left: 0; } #edxahdlkim .gt_column_spanner_outer:last-child { padding-right: 0; } #edxahdlkim .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #edxahdlkim .gt_spanner_row { border-bottom-style: hidden; } #edxahdlkim .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #edxahdlkim .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #edxahdlkim .gt_from_md > :first-child { margin-top: 0; } #edxahdlkim .gt_from_md > :last-child { margin-bottom: 0; } #edxahdlkim .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #edxahdlkim .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #edxahdlkim .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #edxahdlkim .gt_row_group_first td { border-top-width: 2px; } #edxahdlkim .gt_row_group_first th { border-top-width: 2px; } #edxahdlkim .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #edxahdlkim .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #edxahdlkim .gt_first_summary_row.thick { border-top-width: 2px; } #edxahdlkim .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #edxahdlkim .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #edxahdlkim .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #edxahdlkim .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #edxahdlkim .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #edxahdlkim .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #edxahdlkim .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #edxahdlkim .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #edxahdlkim .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #edxahdlkim .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #edxahdlkim .gt_left { text-align: left; } #edxahdlkim .gt_center { text-align: center; } #edxahdlkim .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #edxahdlkim .gt_font_normal { font-weight: normal; } #edxahdlkim .gt_font_bold { font-weight: bold; } #edxahdlkim .gt_font_italic { font-style: italic; } #edxahdlkim .gt_super { font-size: 65%; } #edxahdlkim .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #edxahdlkim .gt_asterisk { font-size: 100%; vertical-align: 0; } #edxahdlkim .gt_indent_1 { text-indent: 5px; } #edxahdlkim .gt_indent_2 { text-indent: 10px; } #edxahdlkim .gt_indent_3 { text-indent: 15px; } #edxahdlkim .gt_indent_4 { text-indent: 20px; } #edxahdlkim .gt_indent_5 { text-indent: 25px; } TABLE 11.1: Percentage of Votes by Candidate for Different Missing Data Inclusions Candidate Including All Missing Data Removing Skip Patterns Only Removing All Missing Data % s.e. (%) % s.e. (%) % s.e. (%) Did not Vote for President in 2016 32.4% 0.9% NA NA NA NA Hillary Clinton 33.0% 0.7% 48.8% 0.9% 49.2% 0.9% Donald Trump 29.9% 0.7% 44.3% 0.9% 44.7% 0.9% Other Candidate 4.1% 0.2% 6.1% 0.3% 6.1% 0.3% Missing 0.6% 0.1% 0.9% 0.2% NA NA As Table 11.1 shows, the results can vary greatly depending on which type of missing data that are removed. If we remove only the skip patterns the margin between the Clinton and Trump is 4.5 percentage points, but if we include all data even including those that did not vote in 2016, the margin is 3.1 percentage points. How we handle the different types of missing values is important for interpretation of the data. References DeBell, Matthew. 2010. “How to Analyze ANES Survey Data.” ANES Technical Report Series nes012492. Palo Alto, CA: Stanford University; Ann Arbor, MI: the University of Michigan; https://electionstudies.org/wp-content/uploads/2018/05/HowToAnalyzeANESData.pdf. Kim, Jae Kwang, and Jun Shao. 2021. Statistical Methods for Handling Incomplete Data. Chapman & Hall/CRC Press. Mack, Christina, Zhaohui Su, and Daniel Westreich. 2018. “Types of Missing Data.” In Managing Missing Data in Patient Registries: Addendum to Registries for Evaluating Patient Outcomes: A User’s Guide, Third Edition [Internet]. Rockville (MD): Agency for Healthcare Research; Quality (US); https://www.ncbi.nlm.nih.gov/books/NBK493614/. Schafer, Joseph L, and John W Graham. 2002. “Missing Data: Our View of the State of the Art.” Psychological Methods 7: 147–77. https://doi.org/10.1037//1082-989X.7.2.147. Tierney, Nicholas, and Dianne Cook. 2023. “Expanding Tidy Data Principles to Facilitate Missing Data Exploration, Visualization and Assessment of Imputations.” Journal of Statistical Software 105 (7): 1–31. https://doi.org/10.18637/jss.v105.i07. Valliant, Richard, and Jill A. Dever. 2018. Survey Weights: A Step-by-Step Guide to Calculation. Stata Press. "],["c12-pitfalls.html", "Chapter 12 Common pitfalls", " Chapter 12 Common pitfalls "],["c13-ncvs-vignette.html", "Chapter 13 National Crime Victimization Survey Vignette 13.1 Introduction 13.2 Data Structure 13.3 Survey Notation 13.4 Data File Preparation 13.5 Survey Design Objects 13.6 Calculating Estimates 13.7 Statistical testing 13.8 Exercises", " Chapter 13 National Crime Victimization Survey Vignette Prerequisites For this chapter, load the following packages: library(tidyverse) library(survey) library(srvyr) library(srvyrexploR) library(gt) We will use data from the United States National Crime Victimization Survey (NCVS). Here is the code to read in the three datasets from the {srvyrexploR} package: data(ncvs_2021_incident) data(ncvs_2021_household) data(ncvs_2021_person) 13.1 Introduction The NCVS is a household survey sponsored by the Bureau of Justice Statistics (BJS), which collects data on criminal victimization, including characteristics of the crimes, offenders, and victims. Crime types include both household and personal crimes, as well as violent and non-violent crimes. The target population of this survey is all people in the United States age 12 and older living in housing units and noninstitutional group quarters. The NCVS has been ongoing since 1992. An earlier survey, the National Crime Survey, was run from 1972 to 1991 (Bureau of Justice Statistics 2017). The survey is administered using a rotating panel. When an address enters the sample, the residents of that address are interviewed every six months for a total of seven interviews. If the initial residents move away from the address during the period, the new residents are included in the survey, as people are not followed when they move. NCVS data is publicly available and distributed by Inter-university Consortium for Political and Social Research (ICPSR)37, with data going back to 1992. The vignette in this book will include data from 2021 (United States. Bureau of Justice Statistics 2022). The NCVS data structure is complicated, and the User’s Guide contains examples for analysis in SAS, SUDAAN, SPSS, and Stata, but not R (Shook-Sa, Bonnie, Couzens, G. Lance, and Berzofsky, Marcus 2015). This vignette will adapt those examples for R. 13.2 Data Structure The data from ICPSR is distributed with five files, each having its unique identifier indicated: Address Record - YEARQ, IDHH Household Record - YEARQ, IDHH Person Record - YEARQ, IDHH, IDPER Incident Record - YEARQ, IDHH, IDPER 2021 Collection Year Incident - YEARQ, IDHH, IDPER We will focus on the household, person, and incident files. From these files, we selected a subset of columns for examples to use in this vignette. We have included data in the {srvyexploR} package with a subset of columns, but you can download the complete files at ICPSR38. 13.3 Survey Notation The NCVS User Guide (Shook-Sa, Bonnie, Couzens, G. Lance, and Berzofsky, Marcus 2015) uses the following notation: \\(i\\) represents NCVS households, identified on the household-level file with the household identification number IDHH. \\(j\\) represents NCVS individual respondents within households \\(i\\), identified on the person-level file with the person identification number IDPER. \\(k\\) represents reporting periods (i.e., YEARQ) for households \\(i\\) and individual respondent \\(j\\). \\(l\\) represents victimization records for respondent \\(j\\) in household \\(i\\) and reporting period \\(k\\). Each record on the NCVS incident-level file is associated with a victimization record \\(l\\). \\(D\\) represents one or more domain characteristics of interest in the calculation of NCVS estimates. For victimization totals and proportions, domains can be defined on the basis of crime types (e.g., violent crimes, property crimes), characteristics of victims (e.g., age, sex, household income), or characteristics of the victimizations (e.g., victimizations reported to police, victimizations committed with a weapon present). Domains could also be a combination of all of these types of characteristics. For example, in the calculation of victimization rates, domains are defined on the basis of the characteristics of the victims. \\(A_a\\) represents the level \\(a\\) of covariate \\(A\\). Covariate \\(A\\) is defined in the calculation of victimization proportions and represents the characteristic for which the analyst wants to obtain the distribution of victimizations in domain \\(D\\). \\(C\\) represents the personal or property crime for which we want to obtain a victimization rate. In this vignette, we will discuss four estimates: Victimization totals estimate the number of criminal victimizations with a given characteristic. As demonstrated below, these can be calculated from any of the data files. The estimated victimization total, \\(\\hat{t}_D\\) for domain \\(D\\) is estimated as \\[ \\hat{t}_D = \\sum_{ijkl \\in D} v_{ijkl}\\] where \\(v_{ijkl}\\) is the series-adjusted victimization weight for household \\(i\\), respondent \\(j\\), reporting period \\(k\\), and victimization \\(l\\), that is WGTVICCY. Victimization proportions estimate characteristics among victimizations or victims. Victimization proportions are calculated using the incident data file. The estimated victimization proportion for domain \\(D\\) across level \\(a\\) of covariate \\(A\\), \\(\\hat{p}_{A_a,D}\\) is \\[ \\hat{p}_{A_a,D} =\\frac{\\sum_{ijkl \\in A_a, D} v_{ijkl}}{\\sum_{ijkl \\in D} v_{ijkl}}.\\] The numerator is the number of incidents with a particular characteristic in a domain, and the denominator is the number of incidents in a domain. Victimization rates are estimates of the number of victimizations per 1,000 persons or households in the population39. Victimization rates are calculated using the household or person-level data files. The estimated victimization rate for crime \\(C\\) in domain \\(D\\) is \\[\\hat{VR}_{C,D}= \\frac{\\sum_{ijkl \\in C,D} v_{ijkl}}{\\sum_{ijk \\in D} w_{ijk}}\\times 1000\\] where \\(w_{ijk}\\) is the person weight (WGTPERCY) or household weight (WGTHHCY) for personal and household crimes, respectively. The numerator is the number of incidents in a domain, and the denominator is the number of persons or households in a domain. Notice that the weights in the numerator and denominator are different - this is important, and in the syntax and examples below, we will discuss how to make an estimate that involves two weights. Prevalence rates are estimates of the percentage of the population (persons or households) who are victims of a crime. These are estimated using the household or person-level data files. The estimated prevalence rate for crime \\(C\\) in domain \\(D\\) is \\[ \\hat{PR}_{C, D}= \\frac{\\sum_{ijk \\in {C,D}} I_{ij}w_{ijk}}{\\sum_{ijk \\in D} w_{ijk}} \\times 100\\] where \\(I_{ij}\\) is an indicator that a person or household in domain \\(D\\) was a victim of crime \\(C\\) at any time in the year. The numerator is the number of victims in domain \\(D\\) for crime \\(C\\), and the denominator is the number of people or households in the population. 13.4 Data File Preparation Some work is necessary to prepare the files before analysis. The design variables indicating pseudostratum (V2117) and half-sample code (V2118) are only included on the household file, so they must be added to the person and incident files for any analysis. For victimization rates, we need to know the victimization status for both victims and non-victims. Therefore, the incident file must be summarized and merged onto the household or person files for household-level and person-level crimes, respectively. We begin this vignette by discussing how to create these incident summary files. This is following Section 2.2 of the NCVS User’s Guide (Shook-Sa, Bonnie, Couzens, G. Lance, and Berzofsky, Marcus 2015). 13.4.1 Preparing Files for Estimation of Victimization Rates Each record on the incident file represents one victimization, which is not the same as one incident. Some victimizations have several instances that make it difficult for the victim to differentiate the details of these incidents, labeled as “series crimes”. Appendix A of the User’s Guide indicates how to calculate the series weight in other statistical languages. Here, we adapt that code for R. Essentially, if a victimization is a series crime, its series weight is top-coded at 10 based on the number of actual victimizations, that is that even if the crime repeatedly occurred more than 10 times, it is counted as 10 times to reduce the influence of extreme outliers. If an incident is a series crime, but the number of occurrences is unknown, the series weight is set to 6. A description of the variables used to create indicators of series and the associated weights is included in Table 13.1. TABLE 13.1: Codebook for incident variables - related to series weight Description Value Label V4016 How many times incident occur last 6 mos 1-996 Number of times 997 Don’t know V4017 How many incidents 1 1-5 incidents (not a “series”) 2 6 or more incidents 8 Residue (invalid data) V4018 Incidents similar in detail 1 Similar 2 Different (not in a “series”) 8 Residue (invalid data) V4019 Enough detail to distinguish incidents 1 Yes (not a “series”) 2 No (is a “series”) 8 Residue (invalid data) WGTVICCY Adjusted victimization weight Numeric We want to create four variables to indicate if an incident is a series crime. First, we create a variable called series using V4017, V4018, and V4019 where an incident is considered a series crime if there are 6 or more incidents (V4107), the incidents are similar in detail (V4018), or there is not enough detail to distinguish the incidents (V4019). Next, we top-code the number of incidents (V4016) by creating a variable n10v4016 which is set to 10 if V4016 > 10. Finally, we create the series weight using our new top-coded variable and the existing weight. inc_series <- ncvs_2021_incident %>% mutate( series = case_when(V4017 %in% c(1, 8) ~ 1, V4018 %in% c(2, 8) ~ 1, V4019 %in% c(1, 8) ~ 1, TRUE ~ 2 ), n10v4016 = case_when(V4016 %in% c(997, 998) ~ NA_real_, V4016 > 10 ~ 10, TRUE ~ V4016), serieswgt = case_when(series == 2 & is.na(n10v4016) ~ 6, series == 2 ~ n10v4016, TRUE ~ 1), NEWWGT = WGTVICCY * serieswgt ) The next step in preparing the files for estimation is to create indicators on the victimization file for characteristics of interest. Almost all BJS publications limit the analysis to records where the victimization occurred in the United States, where V4022 is not equal to 1, and we will do this for all estimates as well. A brief codebook of variables for this task is located in Table 13.2 TABLE 13.2: Codebook for incident variables - crime type indicators and characteristics Variable Description Value Label V4022 In what city/town/village 1 Outside U.S. 2 Not inside a city/town/village 3 Same city/town/village as present residence 4 Different city/town/village as present residence 5 Don’t know 6 Don’t know if 2, 4, or 5 V4049 Did offender have weapon 1 Yes 2 No 3 Don’t know V4050 What was weapon 1 At least one good entry 3 Indicates “Yes-Type Weapon-NA” 7 Indicates “Gun Type Unknown” 8 No good entry V4051 Hand gun 0 No 1 Yes V4052 Other gun 0 No 1 Yes V4053 Knife 0 No 1 Yes V4399 Reported to police 1 Yes 2 No 3 Don’t know V4529 Type of crime code 01 Completed rape 02 Attempted rape 03 Sexual attack with serious assault 04 Sexual attack with minor assault 05 Completed robbery with injury from serious assault 06 Completed robbery with injury from minor assault 07 Completed robbery without injury from minor assault 08 Attempted robbery with injury from serious assault 09 Attempted robbery with injury from minor assault 10 Attempted robbery without injury 11 Completed aggravated assault with injury 12 Attempted aggravated assault with weapon 13 Threatened assault with weapon 14 Simple assault completed with injury 15 Sexual assault without injury 16 Unwanted sexual contact without force 17 Assault without weapon without injury 18 Verbal threat of rape 19 Verbal threat of sexual assault 20 Verbal threat of assault 21 Completed purse snatching 22 Attempted purse snatching 23 Pocket picking (completed only) 31 Completed burglary, forcible entry 32 Completed burglary, unlawful entry without force 33 Attempted forcible entry 40 Completed motor vehicle theft 41 Attempted motor vehicle theft 54 Completed theft less than $10 55 Completed theft $10 to $49 56 Completed theft $50 to $249 57 Completed theft $250 or greater 58 Completed theft value NA 59 Attempted theft Using these variables, we will create the following indicators: Property crime V4529 >= 31 Variable: Property Violent crime V4529 <= 20 Variable: Violent Property crime reported to the police V4529 >= 31 and V4399=1 Variable: Property_ReportPolice Violent crime reported to the police V4529 < 31 and V4399=1 Variable: Violent_ReportPolice Aggravated assault without a weapon V4529 in 11:12 and V4049=2 Variable: AAST_NoWeap Aggravated assault with a firearm V4529 in 11:12 and V4049=1 and (V4051=1 or V4052=1 or V4050=7) Variable: AAST_Firearm Aggravated assault with a knife or sharp object V4529 in 11:12 and V4049=1 and (V4053=1 or V4054=1) Variable: AAST_Knife Aggravated assault with another type of weapon V4529 in 11:12 and V4049=1 and V4050=1 and not firearm or knife Variable: AAST_Other inc_ind <- inc_series %>% filter(V4022 != 1) %>% mutate( WeapCat = case_when( is.na(V4049) ~ NA_character_, V4049 == 2 ~ "NoWeap", V4049 == 3 ~ "UnkWeapUse", V4050 == 3 ~ "Other", V4051 == 1 | V4052 == 1 | V4050 == 7 ~ "Firearm", V4053 == 1 | V4054 == 1 ~ "Knife", TRUE ~ "Other" ), V4529_num = parse_number(as.character(V4529)), ReportPolice = V4399 == 1, Property = V4529_num >= 31, Violent = V4529_num <= 20, Property_ReportPolice = Property & ReportPolice, Violent_ReportPolice = Violent & ReportPolice, AAST = V4529_num %in% 11:13, AAST_NoWeap = AAST & WeapCat == "NoWeap", AAST_Firearm = AAST & WeapCat == "Firearm", AAST_Knife = AAST & WeapCat == "Knife", AAST_Other = AAST & WeapCat == "Other" ) This is a good point to pause to look at the output of crosswalks between an original variable and a derived one to check that the logic was programmed correctly and that everything ends up in the expected category. inc_series %>% count(V4022) ## # A tibble: 6 × 2 ## V4022 n ## <fct> <int> ## 1 1 34 ## 2 2 65 ## 3 3 7697 ## 4 4 1143 ## 5 5 39 ## 6 8 4 inc_ind %>% count(V4022) ## # A tibble: 5 × 2 ## V4022 n ## <fct> <int> ## 1 2 65 ## 2 3 7697 ## 3 4 1143 ## 4 5 39 ## 5 8 4 inc_ind %>% count(WeapCat, V4049, V4050, V4051, V4052, V4052, V4053, V4054) ## # A tibble: 13 × 8 ## WeapCat V4049 V4050 V4051 V4052 V4053 V4054 n ## <chr> <fct> <fct> <fct> <fct> <fct> <fct> <int> ## 1 Firearm 1 1 0 1 0 0 15 ## 2 Firearm 1 1 0 1 1 1 1 ## 3 Firearm 1 1 1 0 0 0 125 ## 4 Firearm 1 1 1 0 1 0 2 ## 5 Firearm 1 1 1 1 0 0 3 ## 6 Firearm 1 7 0 0 0 0 3 ## 7 Knife 1 1 0 0 0 1 14 ## 8 Knife 1 1 0 0 1 0 71 ## 9 NoWeap 2 <NA> <NA> <NA> <NA> <NA> 1794 ## 10 Other 1 1 0 0 0 0 147 ## 11 Other 1 3 0 0 0 0 26 ## 12 UnkWeapUse 3 <NA> <NA> <NA> <NA> <NA> 519 ## 13 <NA> <NA> <NA> <NA> <NA> <NA> <NA> 6228 inc_ind %>% count(V4529, Property, Violent, AAST) %>% print(n = 40) ## # A tibble: 34 × 5 ## V4529 Property Violent AAST n ## <fct> <lgl> <lgl> <lgl> <int> ## 1 1 FALSE TRUE FALSE 45 ## 2 2 FALSE TRUE FALSE 20 ## 3 3 FALSE TRUE FALSE 11 ## 4 4 FALSE TRUE FALSE 3 ## 5 5 FALSE TRUE FALSE 24 ## 6 6 FALSE TRUE FALSE 26 ## 7 7 FALSE TRUE FALSE 59 ## 8 8 FALSE TRUE FALSE 5 ## 9 9 FALSE TRUE FALSE 7 ## 10 10 FALSE TRUE FALSE 57 ## 11 11 FALSE TRUE TRUE 97 ## 12 12 FALSE TRUE TRUE 91 ## 13 13 FALSE TRUE TRUE 163 ## 14 14 FALSE TRUE FALSE 165 ## 15 15 FALSE TRUE FALSE 24 ## 16 16 FALSE TRUE FALSE 12 ## 17 17 FALSE TRUE FALSE 357 ## 18 18 FALSE TRUE FALSE 14 ## 19 19 FALSE TRUE FALSE 3 ## 20 20 FALSE TRUE FALSE 607 ## 21 21 FALSE FALSE FALSE 2 ## 22 22 FALSE FALSE FALSE 2 ## 23 23 FALSE FALSE FALSE 19 ## 24 31 TRUE FALSE FALSE 248 ## 25 32 TRUE FALSE FALSE 634 ## 26 33 TRUE FALSE FALSE 188 ## 27 40 TRUE FALSE FALSE 256 ## 28 41 TRUE FALSE FALSE 97 ## 29 54 TRUE FALSE FALSE 407 ## 30 55 TRUE FALSE FALSE 1006 ## 31 56 TRUE FALSE FALSE 1686 ## 32 57 TRUE FALSE FALSE 1420 ## 33 58 TRUE FALSE FALSE 798 ## 34 59 TRUE FALSE FALSE 395 inc_ind %>% count(ReportPolice, V4399) ## # A tibble: 4 × 3 ## ReportPolice V4399 n ## <lgl> <fct> <int> ## 1 FALSE 2 5670 ## 2 FALSE 3 103 ## 3 FALSE 8 12 ## 4 TRUE 1 3163 inc_ind %>% count(AAST, WeapCat, AAST_NoWeap, AAST_Firearm, AAST_Knife, AAST_Other) ## # A tibble: 11 × 7 ## AAST WeapCat AAST_NoWeap AAST_Firearm AAST_Knife AAST_Other n ## <lgl> <chr> <lgl> <lgl> <lgl> <lgl> <int> ## 1 FALSE Firearm FALSE FALSE FALSE FALSE 34 ## 2 FALSE Knife FALSE FALSE FALSE FALSE 23 ## 3 FALSE NoWeap FALSE FALSE FALSE FALSE 1769 ## 4 FALSE Other FALSE FALSE FALSE FALSE 27 ## 5 FALSE UnkWeapUse FALSE FALSE FALSE FALSE 516 ## 6 FALSE <NA> FALSE FALSE FALSE FALSE 6228 ## 7 TRUE Firearm FALSE TRUE FALSE FALSE 115 ## 8 TRUE Knife FALSE FALSE TRUE FALSE 62 ## 9 TRUE NoWeap TRUE FALSE FALSE FALSE 25 ## 10 TRUE Other FALSE FALSE FALSE TRUE 146 ## 11 TRUE UnkWeapUse FALSE FALSE FALSE FALSE 3 After creating indicators of victimization types and characteristics, the file is summarized, and crimes are summed across persons or households by YEARQ. Property crimes (i.e., crimes committed against households, such as household burglary or motor vehicle theft) are summed across households, and personal crimes (i.e., crimes committed against an individual, such as assault, robbery, and personal theft) are summed across persons. The indicators are summed using the serieswgt, and the variable WGTVICCY needs to be retained for later analysis. inc_hh_sums <- inc_ind %>% filter(V4529_num > 23) %>% # restrict to household crimes group_by(YEARQ, IDHH) %>% summarize(WGTVICCY = WGTVICCY[1], across(starts_with("Property"), ~ sum(. * serieswgt), .names = "{.col}"), .groups = "drop") inc_pers_sums <- inc_ind %>% filter(V4529_num <= 23) %>% # restrict to person crimes group_by(YEARQ, IDHH, IDPER) %>% summarize(WGTVICCY = WGTVICCY[1], across(c(starts_with("Violent"), starts_with("AAST")), ~ sum(. * serieswgt), .names = "{.col}"), .groups = "drop") Now, we merge the victimization summary files into the appropriate files. For any record on the household or person file that is not on the victimization file, the victimization counts are set to 0 after merging. In this step, we will also create the victimization adjustment factor. See 2.2.4 in the User’s Guide for details of why this adjustment is created (Shook-Sa, Bonnie, Couzens, G. Lance, and Berzofsky, Marcus (2015)). It is calculated as follows: \\[ A_{ijk}=\\frac{v_{ijk}}{w_{ijk}}\\] where \\(w_{ijk}\\) is the person weight (WGTPERCY) for personal crimes or the household weight (WGTHHCY) for household crimes, and \\(v_{ijk}\\) is the victimization weight (WGTVICCY) for household \\(i\\), respondent \\(j\\), in reporting period \\(k\\). The adjustment factor is set to 0 if no incidents are reported. # Set up a list of 0s for each crime type/characteristic to replace NA's hh_z_list <- rep(0, ncol(inc_hh_sums) - 3) %>% as.list() %>% setNames(names(inc_hh_sums)[-(1:3)]) pers_z_list <- rep(0, ncol(inc_pers_sums) - 4) %>% as.list() %>% setNames(names(inc_pers_sums)[-(1:4)]) hh_vsum <- ncvs_2021_household %>% full_join(inc_hh_sums, by = c("YEARQ", "IDHH")) %>% replace_na(hh_z_list) %>% mutate(ADJINC_WT = if_else(is.na(WGTVICCY), 0, WGTVICCY / WGTHHCY)) pers_vsum <- ncvs_2021_person %>% full_join(inc_pers_sums, by = c("YEARQ", "IDHH", "IDPER")) %>% replace_na(pers_z_list) %>% mutate(ADJINC_WT = if_else(is.na(WGTVICCY), 0, WGTVICCY / WGTPERCY)) 13.4.2 Derived Demographic Variables A final step in file preparation for the household and person files is creating any derived variables on the household and person files, such as income categories or age categories, for subgroup analysis. We can do this step before or after merging the victimization counts. 13.4.2.1 Household Variables For the household file, we create categories for tenure (rental status), urbanicity, income, place size, and region. A codebook of the household variables are located in Table 13.3. TABLE 13.3: Codebook for household variables Variable Description Value Label V2015 Tenure 1 Owned or being bought 2 Rented for cash 3 No cash rent SC214A Household Income 01 Less than $5,000 02 $5,000 to $7,499 03 $7,500 to $9,999 04 $10,000 to $12,499 05 $12,500 to $14,999 06 $15,000 to $17,499 07 $17,500 to $19,999 08 $20,000 to $24,999 09 $25,000 to $29,999 10 $30,000 to $34,999 11 $35,000 to $39,999 12 $40,000 to $49,999 13 $50,000 to $74,999 15 $75,000 to $99,999 16 $100,000-$149,999 17 $150,000-$199,999 18 $200,000 or more V2126B Place Size Code 00 Not in a place 13 Under 10,000 16 10,000-49,999 17 50,000-99,999 18 100,000-249,999 19 250,000-499,999 20 500,000-999,999 21 1,000,000-2,499,999 22 2,500,000-4,999,999 23 5,000,000 or more V2127B Region 1 Northeast 2 Midwest 3 South 4 West V2143 Urbanicity 1 Urban 2 Suburban 3 Rural hh_vsum_der <- hh_vsum %>% mutate( Tenure = factor(case_when(V2015 == 1 ~ "Owned", !is.na(V2015) ~ "Rented"), levels = c("Owned", "Rented")), Urbanicity = factor(case_when(V2143 == 1 ~ "Urban", V2143 == 2 ~ "Suburban", V2143 == 3 ~ "Rural"), levels = c("Urban", "Suburban", "Rural")), SC214A_num = as.numeric(as.character(SC214A)), Income = case_when(SC214A_num <= 8 ~ "Less than $25,000", SC214A_num <= 12 ~ "$25,000-49,999", SC214A_num <= 15 ~ "$50,000-99,999", SC214A_num <= 17 ~ "$100,000-199,999", SC214A_num <= 18 ~ "$200,000 or more"), Income = fct_reorder(Income, SC214A_num, .na_rm = FALSE), PlaceSize = case_match(as.numeric(as.character(V2126B)), 0 ~ "Not in a place", 13 ~ "Under 10,000", 16 ~ "10,000-49,999", 17 ~ "50,000-99,999", 18 ~ "100,000-249,999", 19 ~ "250,000-499,999", 20 ~ "500,000-999,999", c(21, 22, 23) ~ "1,000,000 or more"), PlaceSize = fct_reorder(PlaceSize, as.numeric(V2126B)), Region = case_match(as.numeric(V2127B), 1 ~ "Northeast", 2 ~ "Midwest", 3 ~ "South", 4 ~ "West"), Region = fct_reorder(Region, as.numeric(V2127B)) ) As before, we want to check to make sure the recoded variables we create match the existing data as expected. hh_vsum_der %>% count(Tenure, V2015) ## # A tibble: 4 × 3 ## Tenure V2015 n ## <fct> <fct> <int> ## 1 Owned 1 101944 ## 2 Rented 2 46269 ## 3 Rented 3 1925 ## 4 <NA> <NA> 106322 hh_vsum_der %>% count(Urbanicity, V2143) ## # A tibble: 3 × 3 ## Urbanicity V2143 n ## <fct> <fct> <int> ## 1 Urban 1 26878 ## 2 Suburban 2 173491 ## 3 Rural 3 56091 hh_vsum_der %>% count(Income, SC214A) ## # A tibble: 18 × 3 ## Income SC214A n ## <fct> <fct> <int> ## 1 Less than $25,000 1 7841 ## 2 Less than $25,000 2 2626 ## 3 Less than $25,000 3 3949 ## 4 Less than $25,000 4 5546 ## 5 Less than $25,000 5 5445 ## 6 Less than $25,000 6 4821 ## 7 Less than $25,000 7 5038 ## 8 Less than $25,000 8 11887 ## 9 $25,000-49,999 9 11550 ## 10 $25,000-49,999 10 13689 ## 11 $25,000-49,999 11 13655 ## 12 $25,000-49,999 12 23282 ## 13 $50,000-99,999 13 44601 ## 14 $50,000-99,999 15 33353 ## 15 $100,000-199,999 16 34287 ## 16 $100,000-199,999 17 15317 ## 17 $200,000 or more 18 16892 ## 18 <NA> <NA> 2681 hh_vsum_der %>% count(PlaceSize, V2126B) ## # A tibble: 10 × 3 ## PlaceSize V2126B n ## <fct> <fct> <int> ## 1 Not in a place 0 69484 ## 2 Under 10,000 13 39873 ## 3 10,000-49,999 16 53002 ## 4 50,000-99,999 17 27205 ## 5 100,000-249,999 18 24461 ## 6 250,000-499,999 19 13111 ## 7 500,000-999,999 20 15194 ## 8 1,000,000 or more 21 6167 ## 9 1,000,000 or more 22 3857 ## 10 1,000,000 or more 23 4106 hh_vsum_der %>% count(Region, V2127B) ## # A tibble: 4 × 3 ## Region V2127B n ## <fct> <fct> <int> ## 1 Northeast 1 41585 ## 2 Midwest 2 74666 ## 3 South 3 87783 ## 4 West 4 52426 13.4.2.2 Person Variables For the person file, we create categories for sex, race/Hispanic origin, age categories, and marital status. A codebook of the household variables is located in Table 13.4. We also merge the household demographics to the person file as well as the design variables (V2117 and V2118). TABLE 13.4: Codebook for person variables Variable Description Value Label V3014 Age 12 through 90 V3015 Current Marital Status 1 Married 2 Widowed 3 Divorced 4 Separated 5 Never married V3018 Sex 1 Male 2 Female V3023A Race 01 White only 02 Black only 03 American Indian, Alaska native only 04 Asian only 05 Hawaiian/Pacific Islander only 06 White-Black 07 White-American Indian 08 White-Asian 09 White-Hawaiian 10 Black-American Indian 11 Black-Asian 12 Black-Hawaiian/Pacific Islander 13 American Indian-Asian 14 Asian-Hawaiian/Pacific Islander 15 White-Black-American Indian 16 White-Black-Asian 17 White-American Indian-Asian 18 White-Asian-Hawaiian 19 2 or 3 races 20 4 or 5 races V3024 Hispanic Origin 1 Yes 2 No # Set label for usage later NHOPI <- "Native Hawaiian or Other Pacific Islander" pers_vsum_der <- pers_vsum %>% mutate( Sex = factor(case_when(V3018 == 1 ~ "Male", V3018 == 2 ~ "Female")), RaceHispOrigin = factor(case_when(V3024 == 1 ~ "Hispanic", V3023A == 1 ~ "White", V3023A == 2 ~ "Black", V3023A == 4 ~ "Asian", V3023A == 5 ~ NHOPI, TRUE ~ "Other"), levels = c("White", "Black", "Hispanic", "Asian", NHOPI, "Other")), V3014_num = as.numeric(as.character(V3014)), AgeGroup = case_when(V3014_num <= 17 ~ "12-17", V3014_num <= 24 ~ "18-24", V3014_num <= 34 ~ "25-34", V3014_num <= 49 ~ "35-49", V3014_num <= 64 ~ "50-64", V3014_num <= 90 ~ "65 or older"), AgeGroup = fct_reorder(AgeGroup, V3014_num), MaritalStatus = factor(case_when(V3015 == 1 ~ "Married", V3015 == 2 ~ "Widowed", V3015 == 3 ~ "Divorced", V3015 == 4 ~ "Separated", V3015 == 5 ~ "Never married"), levels = c("Never married", "Married", "Widowed","Divorced", "Separated")) ) %>% left_join(hh_vsum_der %>% select(YEARQ, IDHH, V2117, V2118, Tenure:Region), by = c("YEARQ", "IDHH")) As before, we want to check to make sure the recoded variables we create match the existing data as expected. pers_vsum_der %>% count(Sex, V3018) ## # A tibble: 2 × 3 ## Sex V3018 n ## <fct> <fct> <int> ## 1 Female 2 150956 ## 2 Male 1 140922 pers_vsum_der %>% count(RaceHispOrigin, V3024) ## # A tibble: 11 × 3 ## RaceHispOrigin V3024 n ## <fct> <fct> <int> ## 1 White 2 197292 ## 2 White 8 883 ## 3 Black 2 29947 ## 4 Black 8 120 ## 5 Hispanic 1 41450 ## 6 Asian 2 16015 ## 7 Asian 8 61 ## 8 Native Hawaiian or Other Pacific Islander 2 891 ## 9 Native Hawaiian or Other Pacific Islander 8 9 ## 10 Other 2 5161 ## 11 Other 8 49 pers_vsum_der %>% filter(RaceHispOrigin != "Hispanic" | is.na(RaceHispOrigin)) %>% count(RaceHispOrigin, V3023A) ## # A tibble: 20 × 3 ## RaceHispOrigin V3023A n ## <fct> <fct> <int> ## 1 White 1 198175 ## 2 Black 2 30067 ## 3 Asian 4 16076 ## 4 Native Hawaiian or Other Pacific Islander 5 900 ## 5 Other 3 1319 ## 6 Other 6 1217 ## 7 Other 7 1025 ## 8 Other 8 837 ## 9 Other 9 184 ## 10 Other 10 178 ## 11 Other 11 87 ## 12 Other 12 27 ## 13 Other 13 13 ## 14 Other 14 53 ## 15 Other 15 136 ## 16 Other 16 45 ## 17 Other 17 11 ## 18 Other 18 33 ## 19 Other 19 22 ## 20 Other 20 23 pers_vsum_der %>% group_by(AgeGroup) %>% summarize(minAge = min(V3014), maxAge = max(V3014), .groups = "drop") ## # A tibble: 6 × 3 ## AgeGroup minAge maxAge ## <fct> <dbl> <dbl> ## 1 12-17 12 17 ## 2 18-24 18 24 ## 3 25-34 25 34 ## 4 35-49 35 49 ## 5 50-64 50 64 ## 6 65 or older 65 90 pers_vsum_der %>% count(MaritalStatus, V3015) ## # A tibble: 6 × 3 ## MaritalStatus V3015 n ## <fct> <fct> <int> ## 1 Never married 5 90425 ## 2 Married 1 148131 ## 3 Widowed 2 17668 ## 4 Divorced 3 28596 ## 5 Separated 4 4524 ## 6 <NA> 8 2534 We then create tibbles that contain only the variables we need, which makes it easier for analyses. hh_vsum_slim <- hh_vsum_der %>% select(YEARQ:V2118, WGTVICCY:ADJINC_WT, Tenure, Urbanicity, Income, PlaceSize, Region) pers_vsum_slim <- pers_vsum_der %>% select(YEARQ:WGTPERCY, WGTVICCY:ADJINC_WT, Sex:Region) To calculate estimates about types of crime, such as what percentage of violent crimes are reported to the police, we must use the incident file. The incident file is not guaranteed to have every pseudostratum and half-sample code, so dummy records are created to append before estimation. Finally, we merge demographic variables onto the incident tibble. dummy_records <- hh_vsum_slim %>% distinct(V2117, V2118) %>% mutate(Dummy = 1, WGTVICCY = 1, NEWWGT = 1) inc_analysis <- inc_ind %>% mutate(Dummy = 0) %>% left_join(select(pers_vsum_slim, YEARQ, IDHH, IDPER, Sex:Region), by = c("YEARQ", "IDHH", "IDPER")) %>% bind_rows(dummy_records) %>% select(YEARQ:IDPER, WGTVICCY, NEWWGT, V4529, WeapCat, ReportPolice, Property:Region) The tibbles hh_vsum_slim, pers_vsum_slim, and inc_analysis can now be used to create design objects and calculate crime rate estimates. 13.5 Survey Design Objects All the data prep above is necessary to prepare the data for survey analysis. At this point, we can create the design objects and finally begin analysis. We will create three design objects for different types of analysis as they depend on which type of estimate we are creating. For the incident data, the weight of analysis is NEWWGT, which we constructed previously. The household and person-level data use WGTHHCY and WGTPERCY, respectively. For all analyses, V2117 is the strata variable, and V2118 is the cluster/PSU variable for analysis. inc_des <- inc_analysis %>% as_survey( weight = NEWWGT, strata = V2117, ids = V2118, nest = TRUE ) hh_des <- hh_vsum_slim %>% as_survey( weight = WGTHHCY, strata = V2117, ids = V2118, nest = TRUE ) pers_des <- pers_vsum_slim %>% as_survey( weight = WGTPERCY, strata = V2117, ids = V2118, nest = TRUE ) 13.6 Calculating Estimates Now that we have prepared our data and created the design effects, we can calculate our estimates. As a reminder, those are: Victimization totals estimate the number of criminal victimizations with a given characteristic. Victimization proportions estimate characteristics among victimizations or victims. Victimization rates are estimates of the number of victimizations per 1,000 persons or households in the population. Prevalence rates are estimates of the percentage of the population (persons or households) who are victims of a crime. 13.6.1 Estimation 1: Victimization Totals There are two ways to calculate victimization totals. Using the incident design object (inc_des) is the most straightforward method, but the person (pers_des) and household (hh_des) design objects can be used as well if the adjustment factor (ADJINC_WT) is incorporated. In the example below, the total number of property and violent victimizations is first calculated using the incident file and then using the household and person design objects. The incident file is smaller, and thus, estimation is faster using that file, but the estimates will be the same as illustrated below: vt1 <- inc_des %>% summarize(Property_Vzn = survey_total(Property, na.rm = TRUE), Violent_Vzn = survey_total(Violent, na.rm = TRUE)) vt2a <- hh_des %>% summarize(Property_Vzn = survey_total(Property * ADJINC_WT, na.rm = TRUE)) vt2b <- pers_des %>% summarize(Violent_Vzn = survey_total(Violent * ADJINC_WT, na.rm = TRUE)) vt1 ## # A tibble: 1 × 4 ## Property_Vzn Property_Vzn_se Violent_Vzn Violent_Vzn_se ## <dbl> <dbl> <dbl> <dbl> ## 1 11682056. 263844. 4598306. 198115. vt2a ## # A tibble: 1 × 2 ## Property_Vzn Property_Vzn_se ## <dbl> <dbl> ## 1 11682056. 263844. vt2b ## # A tibble: 1 × 2 ## Violent_Vzn Violent_Vzn_se ## <dbl> <dbl> ## 1 4598306. 198115. The number of victimizations estimated using the incident file is equivalent to the person and household file method. There are 11,682,056 property incidents and 4,598,306 violent incidents in a six-month period. 13.6.2 Estimation 2: Victimization Proportions Victimization proportions are proportions describing features of a victimization. The key here is that these are questions among victimizations, not among the population. These types of estimates can only be calculated using the incident design object (inc_des). For example, we could be interested in the percentage of property victimizations reported to the police as shown in the following code with an estimate, the standard error, and 95% confidence interval: prop1 <- inc_des %>% filter(Property) %>% summarize(Pct = survey_mean(ReportPolice, na.rm = TRUE, proportion=TRUE, vartype=c("se", "ci")) * 100) prop1 ## # A tibble: 1 × 4 ## Pct Pct_se Pct_low Pct_upp ## <dbl> <dbl> <dbl> <dbl> ## 1 30.8 0.798 29.2 32.4 Or, the percentage of violent victimizations that are in urban areas: prop2 <- inc_des %>% filter(Violent) %>% summarize(Pct = survey_mean(Urbanicity=="Urban", na.rm = TRUE) * 100) prop2 ## # A tibble: 1 × 2 ## Pct Pct_se ## <dbl> <dbl> ## 1 18.1 1.49 In 2021, we estimate that 30.8% of property crimes were reported to the police and 18.1% of violent crimes occurred in urban areas. 13.6.3 Estimation 3: Victimization Rates Victimization rates measure the number of victimizations per population. They are not an estimate of the proportion of households or persons who are victimized, which is a prevalence rate described in section 13.6.4. Victimization rates are estimated using the household (hh_des) or person (pers_des) design objects depending on the type of crime, and the adjustment factor (ADJINC_WT) must be incorporated. We return to the example of property and violent victimizations used in the example for victimization totals (section 13.6.1). In the following example, the property victimization totals are calculated as above, as well as the property victimization rate (using survey_mean()) and the population size using survey_total(). As mentioned in the introduction, victimization rates use the incident weight in the numerator and the person or household weight in the denominator. This is accomplished by calculating the rates with the weight adjustment (ADJINC_WT) multiplied by the estimate of interest. Let’s look at an example of property victimization. vr_prop <- hh_des %>% summarize( Property_Vzn = survey_total(Property * ADJINC_WT, na.rm = TRUE), Property_Rate = survey_mean(Property * ADJINC_WT * 1000, na.rm = TRUE), PopSize = survey_total(1, vartype = NULL) ) vr_prop ## # A tibble: 1 × 5 ## Property_Vzn Property_Vzn_se Property_Rate Property_Rate_se PopSize ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 11682056. 263844. 90.3 1.95 129319232. In the output above, we see the estimate for property victimization rate in 2021 was 90.3 per 1,000 households, which is consistent with calculating as the number of victimizations per 1,000 population as demonstrated in the next chunk: vr_prop %>% select(-ends_with("se")) %>% mutate(Property_Rate_manual=Property_Vzn/PopSize*1000) ## # A tibble: 1 × 4 ## Property_Vzn Property_Rate PopSize Property_Rate_manual ## <dbl> <dbl> <dbl> <dbl> ## 1 11682056. 90.3 129319232. 90.3 Victimization rates can also be calculated for particular characteristics of the victimization. In the following example, the rate of aggravated assault with no weapon, with a firearm, with a knife, and with another weapon. pers_des %>% summarize(across( starts_with("AAST_"), ~ survey_mean(. * ADJINC_WT * 1000, na.rm = TRUE) )) ## # A tibble: 1 × 8 ## AAST_NoWeap AAST_NoWeap_se AAST_Firearm AAST_Firearm_se AAST_Knife ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.249 0.0595 0.860 0.101 0.455 ## # ℹ 3 more variables: AAST_Knife_se <dbl>, AAST_Other <dbl>, ## # AAST_Other_se <dbl> A common desire is to calculate victimization rates by several characteristics. For example, we may want to calculate the violent victimization rate and aggravated assault rate by sex, race/Hispanic origin, age group, marital status, and household income. This requires a group_by() statement for each categorization separately. Thus, we make a function to do this and then use map_df() from the {purrr} package (part of the tidyverse) to loop through the variables. This function takes a demographic variable as its input (byarvar) and calculates the violent and aggravated assault vicitimization rate for each level. It then creates some columns with the variable, the level of each variable, and a numeric version of the variable (LevelNum) for sorting later. The function is run across multiple variables using map() and then stacks the results into a single output using bind_rows(). pers_est_by <- function(byvar) { pers_des %>% rename(Level := {{byvar}}) %>% filter(!is.na(Level)) %>% group_by(Level) %>% summarize( Violent = survey_mean(Violent * ADJINC_WT * 1000, na.rm = TRUE), AAST = survey_mean(AAST * ADJINC_WT * 1000, na.rm = TRUE) ) %>% mutate( Variable = byvar, LevelNum = as.numeric(Level), Level = as.character(Level) ) %>% select(Variable, Level, LevelNum, everything()) } pers_est_df <- c("Sex", "RaceHispOrigin", "AgeGroup", "MaritalStatus", "Income") %>% map(pers_est_by) %>% bind_rows() The output from all the estimates is cleanded to create better labels such as going from “RaceHispOrigin” to “Race/Hispanic Origin”. Finally, the {gt} package is used to make a publishable table (Table 13.5). Using the functions from the {gt} package, column labels and footnotes are added and estimates are presented to the first decimal place. vr_gt<-pers_est_df %>% mutate( Variable = case_when( Variable == "RaceHispOrigin" ~ "Race/Hispanic origin", Variable == "MaritalStatus" ~ "Marital status", Variable == "AgeGroup" ~ "Age", TRUE ~ Variable ) ) %>% select(-LevelNum) %>% group_by(Variable) %>% gt(rowname_col = "Level") %>% tab_spanner( label = "Violent crime", id = "viol_span", columns = c("Violent", "Violent_se") ) %>% tab_spanner(label = "Aggravated assault", columns = c("AAST", "AAST_se")) %>% cols_label( Violent = "Rate", Violent_se = "SE", AAST = "Rate", AAST_se = "SE", ) %>% fmt_number( columns = c("Violent", "Violent_se", "AAST", "AAST_se"), decimals = 1 ) %>% tab_footnote( footnote = "Includes rape or sexual assault, robbery, aggravated assault, and simple assault.", locations = cells_column_spanners(spanners = "viol_span") ) %>% tab_footnote( footnote = "Excludes persons of Hispanic origin", locations = cells_stub(rows = Level %in% c("White", "Black", "Asian", NHOPI, "Other"))) %>% tab_footnote( footnote = "Includes persons who identified as Native Hawaiian or Other Pacific Islander only.", locations = cells_stub(rows = Level == NHOPI) ) %>% tab_footnote( footnote = "Includes persons who identified as American Indian or Alaska Native only or as two or more races.", locations = cells_stub(rows = Level == "Other") ) %>% tab_source_note( source_note = "Note: Rates per 1,000 persons age 12 or older.") %>% tab_source_note(source_note = "Source: Bureau of Justice Statistics, National Crime Victimization Survey, 2021.") %>% tab_stubhead(label = "Victim demographic") %>% tab_caption("Rate and standard error of violent victimization, by type of crime and demographic characteristics, 2021") vr_gt #jslvphoojc table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #jslvphoojc thead, #jslvphoojc tbody, #jslvphoojc tfoot, #jslvphoojc tr, #jslvphoojc td, #jslvphoojc th { border-style: none; } #jslvphoojc p { margin: 0; padding: 0; } #jslvphoojc .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #jslvphoojc .gt_caption { padding-top: 4px; padding-bottom: 4px; } #jslvphoojc .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #jslvphoojc .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #jslvphoojc .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #jslvphoojc .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #jslvphoojc .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #jslvphoojc .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #jslvphoojc .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #jslvphoojc .gt_column_spanner_outer:first-child { padding-left: 0; } #jslvphoojc .gt_column_spanner_outer:last-child { padding-right: 0; } #jslvphoojc .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #jslvphoojc .gt_spanner_row { border-bottom-style: hidden; } #jslvphoojc .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #jslvphoojc .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #jslvphoojc .gt_from_md > :first-child { margin-top: 0; } #jslvphoojc .gt_from_md > :last-child { margin-bottom: 0; } #jslvphoojc .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #jslvphoojc .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #jslvphoojc .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #jslvphoojc .gt_row_group_first td { border-top-width: 2px; } #jslvphoojc .gt_row_group_first th { border-top-width: 2px; } #jslvphoojc .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #jslvphoojc .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #jslvphoojc .gt_first_summary_row.thick { border-top-width: 2px; } #jslvphoojc .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #jslvphoojc .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #jslvphoojc .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #jslvphoojc .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #jslvphoojc .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #jslvphoojc .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #jslvphoojc .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #jslvphoojc .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #jslvphoojc .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #jslvphoojc .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #jslvphoojc .gt_left { text-align: left; } #jslvphoojc .gt_center { text-align: center; } #jslvphoojc .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #jslvphoojc .gt_font_normal { font-weight: normal; } #jslvphoojc .gt_font_bold { font-weight: bold; } #jslvphoojc .gt_font_italic { font-style: italic; } #jslvphoojc .gt_super { font-size: 65%; } #jslvphoojc .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #jslvphoojc .gt_asterisk { font-size: 100%; vertical-align: 0; } #jslvphoojc .gt_indent_1 { text-indent: 5px; } #jslvphoojc .gt_indent_2 { text-indent: 10px; } #jslvphoojc .gt_indent_3 { text-indent: 15px; } #jslvphoojc .gt_indent_4 { text-indent: 20px; } #jslvphoojc .gt_indent_5 { text-indent: 25px; } TABLE 13.5: Rate and standard error of violent victimization, by type of crime and demographic characteristics, 2021 Victim demographic Violent crime1 Aggravated assault Rate SE Rate SE Sex Female 15.5 0.9 2.3 0.2 Male 17.5 1.1 3.2 0.3 Race/Hispanic origin White2 16.1 0.9 2.7 0.3 Black2 18.5 2.2 3.7 0.7 Hispanic 15.9 1.7 2.3 0.4 Asian2 8.6 1.3 1.9 0.6 Native Hawaiian or Other Pacific Islander2,3 36.1 34.4 0.0 0.0 Other2,4 45.4 13.0 6.2 2.0 Age 12-17 13.2 2.2 2.5 0.8 18-24 23.1 2.1 3.9 0.9 25-34 22.0 2.1 4.0 0.6 35-49 19.4 1.6 3.6 0.5 50-64 16.9 1.9 2.0 0.3 65 or older 6.4 1.1 1.1 0.3 Marital status Never married 22.2 1.4 4.0 0.4 Married 9.5 0.9 1.5 0.2 Widowed 10.7 3.5 0.9 0.2 Divorced 27.4 2.9 4.0 0.7 Separated 36.8 6.7 8.8 3.1 Income Less than $25,000 29.6 2.5 5.1 0.7 $25,000-49,999 16.9 1.5 3.0 0.4 $50,000-99,999 14.6 1.1 1.9 0.3 $100,000-199,999 12.2 1.3 2.5 0.4 $200,000 or more 9.7 1.4 1.7 0.6 Note: Rates per 1,000 persons age 12 or older. Source: Bureau of Justice Statistics, National Crime Victimization Survey, 2021. 1 Includes rape or sexual assault, robbery, aggravated assault, and simple assault. 2 Excludes persons of Hispanic origin 3 Includes persons who identified as Native Hawaiian or Other Pacific Islander only. 4 Includes persons who identified as American Indian or Alaska Native only or as two or more races. 13.6.4 Estimation 4: Prevalence Rates Prevalence rates differ from victimization rates as the numerator is the number of people or households victimized rather than the number of victimizations. To calculate the prevalence rates, we must run another summary of the data by calculating an indicator for whether a person or household is a victim of a particular crime at any point in the year. Below is an example of calculating first the indicator and then the prevalence rate of violent crime and aggravated assault. pers_prev_des <- pers_vsum_slim %>% mutate(Year = floor(YEARQ)) %>% mutate(Violent_Ind = sum(Violent) > 0, AAST_Ind = sum(AAST) > 0, .by = c("Year", "IDHH", "IDPER")) %>% as_survey( weight = WGTPERCY, strata = V2117, ids = V2118, nest = TRUE ) pers_prev_ests <- pers_prev_des %>% summarize(Violent_Prev = survey_mean(Violent_Ind * 100), AAST_Prev = survey_mean(AAST_Ind * 100)) pers_prev_ests ## # A tibble: 1 × 4 ## Violent_Prev Violent_Prev_se AAST_Prev AAST_Prev_se ## <dbl> <dbl> <dbl> <dbl> ## 1 0.980 0.0349 0.215 0.0143 In the example above, the indicator is multiplied by 100 to return a percentage rather than a proportion. In 2021, we estimate that 0.98% of people aged 12 and older were a victim of violent crime in the United States, and 0.22% were victims of aggravated assault. 13.7 Statistical testing For any of the types of estimates discussed, we can also perform statistical testing. For example, we could test whether property victimization rates are different between properties that are owned versus rented. First, we calculate the point estimates. prop_tenure <- hh_des %>% group_by(Tenure) %>% summarize( Property_Rate = survey_mean(Property * ADJINC_WT * 1000, na.rm = TRUE, vartype="ci"), ) prop_tenure ## # A tibble: 3 × 4 ## Tenure Property_Rate Property_Rate_low Property_Rate_upp ## <fct> <dbl> <dbl> <dbl> ## 1 Owned 68.2 64.3 72.1 ## 2 Rented 130. 123. 137. ## 3 <NA> NaN NaN NaN The property victimization rate for rented households is 129.8 per 1,000 households while the property victimization rate for owned households is 68.2, which seem very different especially given the non-overlapping confidence intervals. However, survey data is inheriently non-independent so statistical testing cannot be done by comparing confidence intervals. To conduct the statistical test, we first need to create a variable that we will compare which incorporates the adjusted incident weight (ADJINC_WT) and then the test can be conducted as discussed in Chapter 6. prop_tenure_test <- hh_des %>% mutate( Prop_Adj=Property * ADJINC_WT * 1000 ) %>% svyttest( formula = Prop_Adj ~ Tenure, design = ., na.rm = TRUE ) %>% broom::tidy() prop_tenure_test ## # A tibble: 1 × 8 ## estimate statistic p.value parameter conf.low conf.high method ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 61.6 16.0 8.91e-36 169 54.0 69.2 Design-based… ## # ℹ 1 more variable: alternative <chr> The output of the statistical test shows the same difference of 61.6 between the property victimization rates of renters and owners and the test is highly significant with the p-value of <0.0001. 13.8 Exercises What proportion of completed motor vehicle thefts are not reported to the police? Hint: Use the codebook to look at the definition of Type of Crime (V4529). ans1 <- inc_des %>% filter(str_detect(V4529, "40|41")) %>% summarize(Pct = survey_mean(ReportPolice, na.rm = TRUE) * 100) ans1 ## # A tibble: 1 × 2 ## Pct Pct_se ## <dbl> <dbl> ## 1 76.9 2.60 How many violent crimes occur in each region? inc_des %>% filter(Violent) %>% survey_count(Region) ## # A tibble: 4 × 3 ## Region n n_se ## <fct> <dbl> <dbl> ## 1 Northeast 698406. 82419. ## 2 Midwest 1144407. 95860. ## 3 South 1394214. 107505. ## 4 West 1361278. 109479. What is the property victimization rate among each income level? hh_des %>% group_by(Income) %>% summarize(Property_Rate = survey_mean(Property * ADJINC_WT * 1000, na.rm = TRUE)) ## # A tibble: 6 × 3 ## Income Property_Rate Property_Rate_se ## <fct> <dbl> <dbl> ## 1 Less than $25,000 111. 4.97 ## 2 $25,000-49,999 89.5 3.42 ## 3 $50,000-99,999 87.8 3.30 ## 4 $100,000-199,999 76.5 3.49 ## 5 $200,000 or more 91.8 5.69 ## 6 <NA> NaN NaN What is the difference between the violent victimization rate between males and females? Is it statistically different? pers_des %>% group_by(Sex) %>% summarize( Violent_rate=survey_mean(Violent * ADJINC_WT * 1000, na.rm=TRUE) ) ## # A tibble: 2 × 3 ## Sex Violent_rate Violent_rate_se ## <fct> <dbl> <dbl> ## 1 Female 15.5 0.873 ## 2 Male 17.5 1.11 pers_des %>% mutate( Violent_Adj=Violent * ADJINC_WT * 1000 ) %>% svyttest( formula = Violent_Adj ~ Sex, design = ., na.rm = TRUE ) %>% broom::tidy() ## Warning in summary.glm(g): observations with zero weight not used for ## calculating dispersion ## Warning in summary.glm(glm.object): observations with zero weight not ## used for calculating dispersion ## # A tibble: 1 × 8 ## estimate statistic p.value parameter conf.low conf.high method ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 1.93 1.43 0.156 169 -0.745 4.61 Design-based … ## # ℹ 1 more variable: alternative <chr> References Bureau of Justice Statistics. 2017. “National Crime Victimization Survey, 2016: Technical Documentation.” https://bjs.ojp.gov/sites/g/files/xyckuh236/files/media/document/ncvstd16.pdf. Shook-Sa, Bonnie, Couzens, G. Lance, and Berzofsky, Marcus. 2015. “Users’ Guide to the National Crime Victimization Survey (NCVS) Direct Variance Estimation.” https://bjs.ojp.gov/sites/g/files/xyckuh236/files/media/document/ncvs_variance_user_guide_11.06.14.pdf; Bureau of Justice Statistics. United States. Bureau of Justice Statistics. 2022. “National Crime Victimization Survey, [United States], 2021.” Inter-university Consortium for Political; Social Research [distributor]. https://doi.org/10.3886/ICPSR38429.v1. https://www.icpsr.umich.edu/web/ICPSR/series/95↩︎ https://www.icpsr.umich.edu/web/NACJD/studies/38429↩︎ BJS publishes victimization rates per 1,000, which are also presented in these examples↩︎ "],["c14-ambarom-vignette.html", "Chapter 14 AmericasBarometer Vignette 14.1 Introduction 14.2 Data structure 14.3 Preparing files 14.4 Survey design objects 14.5 Calculating estimates 14.6 Mapping survey data 14.7 Exercises", " Chapter 14 AmericasBarometer Vignette Prerequisites For this chapter, load the following packages: library(tidyverse) library(survey) library(srvyr) library(sf) library(rnaturalearth) library(rnaturalearthdata) library(gt) library(ggpattern) In this vignette, we use a subset of data from the 2021 AmericasBarometer survey. Download the raw files, available on the LAPOP website. We work with version 1.2 of the data, and there are separate files for each of the 22 countries. To read all files into R while ignoring the Stata labels, we recommend running code like this: stata_files <- list.files(here("RawData", "LAPOP_2021"), "*.dta") read_stata_unlabeled <- function(file) { read_stata(file) %>% zap_labels() %>% zap_label() } ambarom_in <- here("RawData", "LAPOP_2021", stata_files) %>% map_df(read_stata_unlabeled) %>% select(pais, strata, upm, weight1500, strata, core_a_core_b, q2, q1tb, covid2at, a4, idio2, idio2cov, it1, jc13, m1, mil10a, mil10e, ccch1, ccch3, ccus1, ccus3, edr, ocup4a, q14, q11n, q12c, q12bn, starts_with("covidedu1"), gi0n, r15, r18n, r18) The code above reads all .dta files and combines them into one tibble. 14.1 Introduction The AmericasBarometer surveys, conducted by the LAPOP Lab (LAPOP 2023b), are public opinion surveys of the Americas focused on democracy. The study was launched in 2004/2005 with 11 countries. Though the countries grow and fluctuate over time, AmericasBarometers maintains a consistent methodology across many countries. In 2021, the study included 22 countries ranging from Canada in the north to Chile and Argentina in the South (LAPOP 2023a). Historically, surveys were administered through in-person household interviews, but the COVID-19 pandemic changed the study significantly. Now, random-digit dialing (RDD) of mobile phones is used in all countries except the United States and Canada (LAPOP 2021c). In Canada, LAPOP collaborated with the Environics Institute to collect data from a panel of Canadians using a web survey (LAPOP 2021a). In the United States, YouGov conducted the survey on behalf of LAPOP by conducting a web survey among its panelists (LAPOP 2021b). The survey includes a core set of questions for all countries, but not every question is asked in each country. Additionally, some questions are only posed to half of the respondents in a country, with different sections randomized to respondents (LAPOP 2021d). 14.2 Data structure Each country and year has its own file available in Stata format (.dta). In this vignette, we download and combine all the data from the 22 participating countries in 2021. We subset the data to a smaller set of columns, as noted in the prerequisites box. Review the core questionnaire to understand the common variables across the countries (LAPOP 2021d). 14.3 Preparing files Many of the variables are coded as numeric and do not have intuitive variable names, so the next step is to create derived variables and wrangle the data for analysis. Using the core questionnaire as a codebook, we reference the factor descriptions to create derived variables with informative names: ambarom <- ambarom_in %>% mutate( Country = factor( case_match(pais, 1 ~ "Mexico", 2 ~ "Guatemala", 3 ~ "El Salvador", 4 ~ "Honduras", 5 ~ "Nicaragua", 6 ~ "Costa Rica", 7 ~ "Panama", 8 ~ "Colombia", 9 ~ "Ecuador", 10 ~ "Bolivia", 11 ~ "Peru", 12 ~ "Paraguay", 13 ~ "Chile", 14 ~ "Uruguay", 15 ~ "Brazil", 17 ~ "Argentina", 21 ~ "Dominican Republic", 22 ~ "Haiti", 23 ~ "Jamaica", 24 ~ "Guyana", 40 ~ "United States", 41 ~ "Canada")), CovidWorry = fct_reorder( case_match(covid2at, 1 ~ "Very worried", 2 ~ "Somewhat worried", 3 ~ "A little worried", 4 ~ "Not worried at all"), covid2at, .na_rm = FALSE) ) %>% rename(Educ_NotInSchool = covidedu1_1, Educ_NormalSchool = covidedu1_2, Educ_VirtualSchool = covidedu1_3, Educ_Hybrid = covidedu1_4, Educ_NoSchool = covidedu1_5, BroadbandInternet = r18n, Internet = r18) At this point, it is a good time to check the cross-tabs between the original and newly derived variables. These tables help us confirm that we have correctly matched the numeric data from the original dataset to the renamed factor data in the new dataset. For instance, let’s check the original variable pais and the derived variable Country. We can consult the questionnaire or codebook to confirm that Argentina is coded as 17, Bolivia as 10, etc. Similarly, for CovidWorry and covid2at, we can verify that Very worried is coded as 1, and so on for the other variables. ambarom %>% count(Country, pais) %>% print(n = 22) ## # A tibble: 22 × 3 ## Country pais n ## <fct> <dbl> <int> ## 1 Argentina 17 3011 ## 2 Bolivia 10 3002 ## 3 Brazil 15 3016 ## 4 Canada 41 2201 ## 5 Chile 13 2954 ## 6 Colombia 8 2993 ## 7 Costa Rica 6 2977 ## 8 Dominican Republic 21 3000 ## 9 Ecuador 9 3005 ## 10 El Salvador 3 3245 ## 11 Guatemala 2 3000 ## 12 Guyana 24 3011 ## 13 Haiti 22 3088 ## 14 Honduras 4 2999 ## 15 Jamaica 23 3121 ## 16 Mexico 1 2998 ## 17 Nicaragua 5 2997 ## 18 Panama 7 3183 ## 19 Paraguay 12 3004 ## 20 Peru 11 3038 ## 21 United States 40 1500 ## 22 Uruguay 14 3009 ambarom %>% count(CovidWorry, covid2at) ## # A tibble: 5 × 3 ## CovidWorry covid2at n ## <fct> <dbl> <int> ## 1 Very worried 1 24327 ## 2 Somewhat worried 2 13233 ## 3 A little worried 3 11478 ## 4 Not worried at all 4 8628 ## 5 <NA> NA 6686 14.4 Survey design objects The technical report is the best reference for understanding how to specify the sampling design in R (LAPOP 2021c). The data includes two weights: wt and weight1500. The first weight variable is specific to each country and sums to the sample size, but it is calibrated to reflect each country’s demographics. The second weight variable sums to 1500 for each country and is recommended for multi-country analyses. Although not explicitly stated in the documentation, the Stata syntax example (svyset upm [pw=weight1500], strata(strata)) indicates the variable upm is a clustering variable and strata is the strata variable. Therefore, the design object is created in R as follows: ambarom_des <- ambarom %>% as_survey_design(ids = upm, strata = strata, weight = weight1500) One interesting thing to note is that these weight variables can provide estimates for comparing countries but not for multi-country estimates. The reason is that the weights do not account for the different sizes of countries. For example, Canada has about 10% of the population of the United States, but an estimate that uses records from both countries would weigh them equally. 14.5 Calculating estimates When calculating estimates from the data, we use the survey design object ambarom_des and then apply the survey_mean() function. The next sections walk through a few examples. 14.5.1 Example: Worried about COVID This survey was administered between March and August of 2021, with the specific timing varying by country40. Given the state of the pandemic at that time, several questions about COVID were included. The first question about COVID asked: How worried are you about the possibility that you or someone in your household will get sick from coronavirus in the next 3 months? Very worried Somewhat worried A little worried Not worried at all If we are interested in those who are very worried or somewhat worried, we can create a new variable (CovidWorry_bin) that groups levels of the original question using the fct_collapse() function from the {forcats} package. We then use the survey_count() function to understand how responses are distributed across each category of the original variable (CovidWorry) and the new variable (CovidWorry_bin). covid_worry_collapse <- ambarom_des %>% mutate(CovidWorry_bin = fct_collapse( CovidWorry, WorriedHi = c("Very worried", "Somewhat worried"), WorriedLo = c("A little worried", "Not worried at all") )) covid_worry_collapse %>% survey_count(CovidWorry_bin, CovidWorry) ## # A tibble: 5 × 4 ## CovidWorry_bin CovidWorry n n_se ## <fct> <fct> <dbl> <dbl> ## 1 WorriedHi Very worried 12369. 83.6 ## 2 WorriedHi Somewhat worried 6378. 63.4 ## 3 WorriedLo A little worried 5896. 62.6 ## 4 WorriedLo Not worried at all 4840. 59.7 ## 5 <NA> <NA> 3518. 42.2 With this new variable, we can now use survey_mean() to calculate the percentage of people in each country who are either very or somewhat worried about COVID. There are missing data, as indicated in the survey_count() output above, so we need to use na.rm = TRUE in the survey_mean() function to handle the missing values. covid_worry_country_ests <- covid_worry_collapse %>% group_by(Country) %>% summarize(p = survey_mean(CovidWorry_bin == "WorriedHi", na.rm = TRUE) * 100) covid_worry_country_ests ## # A tibble: 22 × 3 ## Country p p_se ## <fct> <dbl> <dbl> ## 1 Argentina 65.8 1.08 ## 2 Bolivia 71.6 0.960 ## 3 Brazil 83.5 0.962 ## 4 Canada 48.9 1.34 ## 5 Chile 81.8 0.828 ## 6 Colombia 67.9 1.12 ## 7 Costa Rica 72.6 0.952 ## 8 Dominican Republic 50.1 1.13 ## 9 Ecuador 71.7 0.967 ## 10 El Salvador 52.5 1.02 ## # ℹ 12 more rows To view the results for all countries, we can use the {gt} package to create Table 14.1. covid_worry_country_ests_gt <- covid_worry_country_ests %>% gt(rowname_col = "Country") %>% cols_label(p = "Percent", p_se = "SE") %>% fmt_number(decimals = 1) %>% tab_source_note("AmericasBarometer Surveys, 2021") covid_worry_country_ests_gt #uphlolqabb table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #uphlolqabb thead, #uphlolqabb tbody, #uphlolqabb tfoot, #uphlolqabb tr, #uphlolqabb td, #uphlolqabb th { border-style: none; } #uphlolqabb p { margin: 0; padding: 0; } #uphlolqabb .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #uphlolqabb .gt_caption { padding-top: 4px; padding-bottom: 4px; } #uphlolqabb .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #uphlolqabb .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #uphlolqabb .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #uphlolqabb .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #uphlolqabb .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #uphlolqabb .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #uphlolqabb .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #uphlolqabb .gt_column_spanner_outer:first-child { padding-left: 0; } #uphlolqabb .gt_column_spanner_outer:last-child { padding-right: 0; } #uphlolqabb .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #uphlolqabb .gt_spanner_row { border-bottom-style: hidden; } #uphlolqabb .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #uphlolqabb .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #uphlolqabb .gt_from_md > :first-child { margin-top: 0; } #uphlolqabb .gt_from_md > :last-child { margin-bottom: 0; } #uphlolqabb .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #uphlolqabb .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #uphlolqabb .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #uphlolqabb .gt_row_group_first td { border-top-width: 2px; } #uphlolqabb .gt_row_group_first th { border-top-width: 2px; } #uphlolqabb .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #uphlolqabb .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #uphlolqabb .gt_first_summary_row.thick { border-top-width: 2px; } #uphlolqabb .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #uphlolqabb .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #uphlolqabb .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #uphlolqabb .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #uphlolqabb .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #uphlolqabb .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #uphlolqabb .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #uphlolqabb .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #uphlolqabb .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #uphlolqabb .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #uphlolqabb .gt_left { text-align: left; } #uphlolqabb .gt_center { text-align: center; } #uphlolqabb .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #uphlolqabb .gt_font_normal { font-weight: normal; } #uphlolqabb .gt_font_bold { font-weight: bold; } #uphlolqabb .gt_font_italic { font-style: italic; } #uphlolqabb .gt_super { font-size: 65%; } #uphlolqabb .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #uphlolqabb .gt_asterisk { font-size: 100%; vertical-align: 0; } #uphlolqabb .gt_indent_1 { text-indent: 5px; } #uphlolqabb .gt_indent_2 { text-indent: 10px; } #uphlolqabb .gt_indent_3 { text-indent: 15px; } #uphlolqabb .gt_indent_4 { text-indent: 20px; } #uphlolqabb .gt_indent_5 { text-indent: 25px; } TABLE 14.1: Percentage worried about the possibility that they or someone in their household will get sick from coronavirus in the next 3 months Percent SE Argentina 65.8 1.1 Bolivia 71.6 1.0 Brazil 83.5 1.0 Canada 48.9 1.3 Chile 81.8 0.8 Colombia 67.9 1.1 Costa Rica 72.6 1.0 Dominican Republic 50.1 1.1 Ecuador 71.7 1.0 El Salvador 52.5 1.0 Guatemala 69.3 1.0 Guyana 60.0 1.6 Haiti 54.4 1.8 Honduras 64.6 1.1 Jamaica 28.4 0.9 Mexico 63.6 1.0 Nicaragua 80.0 1.0 Panama 70.2 1.0 Paraguay 61.5 1.1 Peru 77.1 2.5 United States 46.6 1.7 Uruguay 60.9 1.1 AmericasBarometer Surveys, 2021 14.5.2 Example: Education affected by COVID Respondents were also asked a question about how the pandemic affected education. This question was asked to households with children under the age of 13, and respondents could select more than one option, as follows: Did any of these children have their school education affected due to the pandemic?   - No, because they are not yet school age or because they do not attend school for another reason   - No, their classes continued normally   - Yes, they went to virtual or remote classes   - Yes, they switched to a combination of virtual and in-person classes   - Yes, they cut all ties with the school Working with multiple-choice questions can be both challenging and interesting. Let’s walk through how to analyze this question. If we are interested in the impact on education, we should focus on the data of those whose children are attending school. This means we need to exclude those who selected the first response option: “No, because they are not yet school age or because they do not attend school for another reason.” To do this, we use the Educ_NotInSchool variable in the dataset, which has values of 0 and 1. A value of 1 indicates that the respondent chose the first response option (none of the children are in school), and a value of 0 means that at least one of their children is in school. By filtering the data to those with a value of 0 (they have at least one child in school), we can consider only respondents with at least one child attending school. Now, let’s review the data for those who selected one of the next three response options: No, their classes continued normally: Educ_NormalSchool Yes, they went to virtual or remote classes: Educ_VirtualSchool Yes, they switched to a combination of virtual and in-person classes: Educ_Hybrid The unweighted cross-tab for these responses is included below. It reveals a wide range of impacts, where many combinations of effects on education are possible. ambarom %>% filter(Educ_NotInSchool == 0) %>% count(Educ_NormalSchool, Educ_VirtualSchool, Educ_Hybrid) ## # A tibble: 8 × 4 ## Educ_NormalSchool Educ_VirtualSchool Educ_Hybrid n ## <dbl> <dbl> <dbl> <int> ## 1 0 0 0 861 ## 2 0 0 1 1192 ## 3 0 1 0 7554 ## 4 0 1 1 280 ## 5 1 0 0 833 ## 6 1 0 1 18 ## 7 1 1 0 72 ## 8 1 1 1 7 In reviewing the survey question, we might be interested in knowing the answers to the following: What percentage of households indicated that school continued as normal with no virtual or hybrid option? What percentage of households indicated that the education medium was changed to either virtual or hybrid? What percentage of households indicated that they cut ties with their school? To find the answers, we create indicators for the first two questions, make national estimates for all three questions, and then construct a summary table for easy viewing. First, we create and inspect the indicators and their distributions using survey_count(). ambarom_des_educ <- ambarom_des %>% filter(Educ_NotInSchool == 0) %>% mutate( Educ_OnlyNormal = (Educ_NormalSchool == 1 & Educ_VirtualSchool == 0 & Educ_Hybrid == 0), Educ_MediumChange = (Educ_VirtualSchool == 1 | Educ_Hybrid == 1) ) ambarom_des_educ %>% survey_count(Educ_OnlyNormal, Educ_NormalSchool, Educ_VirtualSchool, Educ_Hybrid) ## # A tibble: 8 × 6 ## Educ_OnlyNormal Educ_NormalSchool Educ_VirtualSchool Educ_Hybrid ## <lgl> <dbl> <dbl> <dbl> ## 1 FALSE 0 0 0 ## 2 FALSE 0 0 1 ## 3 FALSE 0 1 0 ## 4 FALSE 0 1 1 ## 5 FALSE 1 0 1 ## 6 FALSE 1 1 0 ## 7 FALSE 1 1 1 ## 8 TRUE 1 0 0 ## # ℹ 2 more variables: n <dbl>, n_se <dbl> ambarom_des_educ %>% survey_count(Educ_MediumChange, Educ_VirtualSchool, Educ_Hybrid) ## # A tibble: 4 × 5 ## Educ_MediumChange Educ_VirtualSchool Educ_Hybrid n n_se ## <lgl> <dbl> <dbl> <dbl> <dbl> ## 1 FALSE 0 0 880. 26.1 ## 2 TRUE 0 1 561. 19.2 ## 3 TRUE 1 0 3812. 49.4 ## 4 TRUE 1 1 136. 9.86 Next, we group the data by country and calculate the population estimates for our three questions. covid_educ_ests <- ambarom_des_educ %>% group_by(Country) %>% summarize( p_onlynormal = survey_mean(Educ_OnlyNormal, na.rm = TRUE) * 100, p_mediumchange = survey_mean(Educ_MediumChange, na.rm = TRUE) * 100, p_noschool = survey_mean(Educ_NoSchool, na.rm = TRUE) * 100, ) covid_educ_ests ## # A tibble: 16 × 7 ## Country p_onlynormal p_onlynormal_se p_mediumchange p_mediumchange_se ## <fct> <dbl> <dbl> <dbl> <dbl> ## 1 Argent… 5.39 1.14 87.1 1.72 ## 2 Brazil 4.28 1.17 81.5 2.33 ## 3 Chile 0.715 0.267 96.2 0.962 ## 4 Colomb… 2.84 0.727 90.3 1.40 ## 5 Domini… 3.75 0.793 87.4 1.45 ## 6 Ecuador 5.18 0.963 87.5 1.39 ## 7 El Sal… 2.92 0.680 85.8 1.53 ## 8 Guatem… 3.00 0.727 82.2 1.73 ## 9 Guyana 3.34 0.702 85.3 1.67 ## 10 Haiti 81.1 2.25 7.25 1.48 ## 11 Hondur… 3.68 0.882 80.7 1.72 ## 12 Jamaica 5.42 0.950 88.1 1.43 ## 13 Panama 7.20 1.18 89.4 1.42 ## 14 Paragu… 4.66 0.939 90.7 1.37 ## 15 Peru 2.04 0.604 91.8 1.20 ## 16 Uruguay 8.60 1.40 84.3 2.02 ## # ℹ 2 more variables: p_noschool <dbl>, p_noschool_se <dbl> Finally, to view the results for all countries, we can use the {gt} package to construct Table 14.2. covid_educ_ests_gt <- covid_educ_ests %>% gt(rowname_col = "Country") %>% cols_label( p_onlynormal = "%", p_onlynormal_se = "SE", p_mediumchange = "%", p_mediumchange_se = "SE", p_noschool = "%", p_noschool_se = "SE" ) %>% tab_spanner(label = "Normal school only", columns = c("p_onlynormal", "p_onlynormal_se")) %>% tab_spanner(label = "Medium change", columns = c("p_mediumchange", "p_mediumchange_se")) %>% tab_spanner(label = "Cut ties with school", columns = c("p_noschool", "p_noschool_se")) %>% fmt_number(decimals = 1) %>% tab_source_note("AmericasBarometer Surveys, 2021") covid_educ_ests_gt #ismfkpkdnv table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #ismfkpkdnv thead, #ismfkpkdnv tbody, #ismfkpkdnv tfoot, #ismfkpkdnv tr, #ismfkpkdnv td, #ismfkpkdnv th { border-style: none; } #ismfkpkdnv p { margin: 0; padding: 0; } #ismfkpkdnv .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #ismfkpkdnv .gt_caption { padding-top: 4px; padding-bottom: 4px; } #ismfkpkdnv .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #ismfkpkdnv .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #ismfkpkdnv .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #ismfkpkdnv .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #ismfkpkdnv .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #ismfkpkdnv .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #ismfkpkdnv .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #ismfkpkdnv .gt_column_spanner_outer:first-child { padding-left: 0; } #ismfkpkdnv .gt_column_spanner_outer:last-child { padding-right: 0; } #ismfkpkdnv .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #ismfkpkdnv .gt_spanner_row { border-bottom-style: hidden; } #ismfkpkdnv .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #ismfkpkdnv .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #ismfkpkdnv .gt_from_md > :first-child { margin-top: 0; } #ismfkpkdnv .gt_from_md > :last-child { margin-bottom: 0; } #ismfkpkdnv .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #ismfkpkdnv .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #ismfkpkdnv .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #ismfkpkdnv .gt_row_group_first td { border-top-width: 2px; } #ismfkpkdnv .gt_row_group_first th { border-top-width: 2px; } #ismfkpkdnv .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #ismfkpkdnv .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #ismfkpkdnv .gt_first_summary_row.thick { border-top-width: 2px; } #ismfkpkdnv .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #ismfkpkdnv .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #ismfkpkdnv .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #ismfkpkdnv .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #ismfkpkdnv .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #ismfkpkdnv .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #ismfkpkdnv .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #ismfkpkdnv .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #ismfkpkdnv .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #ismfkpkdnv .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #ismfkpkdnv .gt_left { text-align: left; } #ismfkpkdnv .gt_center { text-align: center; } #ismfkpkdnv .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #ismfkpkdnv .gt_font_normal { font-weight: normal; } #ismfkpkdnv .gt_font_bold { font-weight: bold; } #ismfkpkdnv .gt_font_italic { font-style: italic; } #ismfkpkdnv .gt_super { font-size: 65%; } #ismfkpkdnv .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #ismfkpkdnv .gt_asterisk { font-size: 100%; vertical-align: 0; } #ismfkpkdnv .gt_indent_1 { text-indent: 5px; } #ismfkpkdnv .gt_indent_2 { text-indent: 10px; } #ismfkpkdnv .gt_indent_3 { text-indent: 15px; } #ismfkpkdnv .gt_indent_4 { text-indent: 20px; } #ismfkpkdnv .gt_indent_5 { text-indent: 25px; } TABLE 14.2: Impact on education in households with children under the age of 13 who had children that would generally attend school Normal school only Medium change Cut ties with school % SE % SE % SE Argentina 5.4 1.1 87.1 1.7 9.9 1.6 Brazil 4.3 1.2 81.5 2.3 22.1 2.5 Chile 0.7 0.3 96.2 1.0 4.0 1.0 Colombia 2.8 0.7 90.3 1.4 7.5 1.3 Dominican Republic 3.8 0.8 87.4 1.5 10.5 1.4 Ecuador 5.2 1.0 87.5 1.4 7.9 1.1 El Salvador 2.9 0.7 85.8 1.5 11.8 1.4 Guatemala 3.0 0.7 82.2 1.7 17.7 1.8 Guyana 3.3 0.7 85.3 1.7 13.0 1.6 Haiti 81.1 2.3 7.2 1.5 11.7 1.8 Honduras 3.7 0.9 80.7 1.7 16.9 1.6 Jamaica 5.4 0.9 88.1 1.4 7.5 1.2 Panama 7.2 1.2 89.4 1.4 3.8 0.9 Paraguay 4.7 0.9 90.7 1.4 6.4 1.2 Peru 2.0 0.6 91.8 1.2 6.8 1.1 Uruguay 8.6 1.4 84.3 2.0 8.0 1.6 AmericasBarometer Surveys, 2021 In the countries that were asked this question, many households experienced a change in their child’s education medium. However, in Haiti, only 7.2% of households with children switched to virtual or hybrid learning. 14.6 Mapping survey data While the table effectively presents the data, a map could also be insightful. To generate maps of the countries, we can use the package {rnaturalearth} and subset North and South America with the ne_countries() function. The function returns an sf (simple features) object with many columns, but most importantly, soverignt (sovereignty), geounit (country or territory), and geometry (the shape). For an example of the difference between sovereignty and country/territory, the United States, Puerto Rico, and the US Virgin Islands are all separate units with the same sovereignty. A map without data is plotted in Figure 14.1. country_shape <- ne_countries( scale = "medium", returnclass = "sf", continent = c("North America", "South America") ) country_shape %>% ggplot() + geom_sf() FIGURE 14.1: Map of North and South America The map in Figure 14.1 appears very wide due to the Aleutian islands in Alaska extending into the Eastern Hemisphere. We can crop the shapefile to include only the Western Hemisphere, which removes some of the trailing islands of Alaska. country_shape_crop <- country_shape %>% st_crop(c(xmin = -180, xmax = 0, ymin = -90, ymax = 90)) Now that we have the necessary shape files, our next step is to match our survey data to the map. Countries can be named differently (e.g., “U.S”, “U.S.A”, “United States”). To make sure we can visualize our survey data on the map, we need to match the country names in both the survey data and the map data. To do this, we can use the anti_join() function to identify the countries in the survey data that aren’t in the map data. For example, as shown below, the United States is referred to as “United States” in the survey data but “United States of America” in the map data. Table 14.3 shows the countries in the survey data but not the map data and Table 14.4 shows the countries in the map data but not the survey data. survey_country_list <- ambarom %>% distinct(Country) survey_country_list_gt <- survey_country_list %>% anti_join(country_shape_crop, by = c("Country" = "geounit")) %>% gt() survey_country_list_gt #zpnruhcqur table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #zpnruhcqur thead, #zpnruhcqur tbody, #zpnruhcqur tfoot, #zpnruhcqur tr, #zpnruhcqur td, #zpnruhcqur th { border-style: none; } #zpnruhcqur p { margin: 0; padding: 0; } #zpnruhcqur .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #zpnruhcqur .gt_caption { padding-top: 4px; padding-bottom: 4px; } #zpnruhcqur .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #zpnruhcqur .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #zpnruhcqur .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #zpnruhcqur .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #zpnruhcqur .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #zpnruhcqur .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #zpnruhcqur .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #zpnruhcqur .gt_column_spanner_outer:first-child { padding-left: 0; } #zpnruhcqur .gt_column_spanner_outer:last-child { padding-right: 0; } #zpnruhcqur .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #zpnruhcqur .gt_spanner_row { border-bottom-style: hidden; } #zpnruhcqur .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #zpnruhcqur .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #zpnruhcqur .gt_from_md > :first-child { margin-top: 0; } #zpnruhcqur .gt_from_md > :last-child { margin-bottom: 0; } #zpnruhcqur .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #zpnruhcqur .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #zpnruhcqur .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #zpnruhcqur .gt_row_group_first td { border-top-width: 2px; } #zpnruhcqur .gt_row_group_first th { border-top-width: 2px; } #zpnruhcqur .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #zpnruhcqur .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #zpnruhcqur .gt_first_summary_row.thick { border-top-width: 2px; } #zpnruhcqur .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #zpnruhcqur .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #zpnruhcqur .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #zpnruhcqur .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #zpnruhcqur .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #zpnruhcqur .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #zpnruhcqur .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #zpnruhcqur .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #zpnruhcqur .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #zpnruhcqur .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #zpnruhcqur .gt_left { text-align: left; } #zpnruhcqur .gt_center { text-align: center; } #zpnruhcqur .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #zpnruhcqur .gt_font_normal { font-weight: normal; } #zpnruhcqur .gt_font_bold { font-weight: bold; } #zpnruhcqur .gt_font_italic { font-style: italic; } #zpnruhcqur .gt_super { font-size: 65%; } #zpnruhcqur .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #zpnruhcqur .gt_asterisk { font-size: 100%; vertical-align: 0; } #zpnruhcqur .gt_indent_1 { text-indent: 5px; } #zpnruhcqur .gt_indent_2 { text-indent: 10px; } #zpnruhcqur .gt_indent_3 { text-indent: 15px; } #zpnruhcqur .gt_indent_4 { text-indent: 20px; } #zpnruhcqur .gt_indent_5 { text-indent: 25px; } TABLE 14.3: Countries in the survey data but not the map data Country United States map_country_list_gt<-country_shape_crop %>% as_tibble() %>% select(geounit, sovereignt) %>% anti_join(survey_country_list, by = c("geounit" = "Country")) %>% arrange(geounit) %>% gt() map_country_list_gt #sgxskozkog table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #sgxskozkog thead, #sgxskozkog tbody, #sgxskozkog tfoot, #sgxskozkog tr, #sgxskozkog td, #sgxskozkog th { border-style: none; } #sgxskozkog p { margin: 0; padding: 0; } #sgxskozkog .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #sgxskozkog .gt_caption { padding-top: 4px; padding-bottom: 4px; } #sgxskozkog .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #sgxskozkog .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #sgxskozkog .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #sgxskozkog .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #sgxskozkog .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #sgxskozkog .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #sgxskozkog .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #sgxskozkog .gt_column_spanner_outer:first-child { padding-left: 0; } #sgxskozkog .gt_column_spanner_outer:last-child { padding-right: 0; } #sgxskozkog .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #sgxskozkog .gt_spanner_row { border-bottom-style: hidden; } #sgxskozkog .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #sgxskozkog .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #sgxskozkog .gt_from_md > :first-child { margin-top: 0; } #sgxskozkog .gt_from_md > :last-child { margin-bottom: 0; } #sgxskozkog .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #sgxskozkog .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #sgxskozkog .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #sgxskozkog .gt_row_group_first td { border-top-width: 2px; } #sgxskozkog .gt_row_group_first th { border-top-width: 2px; } #sgxskozkog .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #sgxskozkog .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #sgxskozkog .gt_first_summary_row.thick { border-top-width: 2px; } #sgxskozkog .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #sgxskozkog .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #sgxskozkog .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #sgxskozkog .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #sgxskozkog .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #sgxskozkog .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #sgxskozkog .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #sgxskozkog .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #sgxskozkog .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #sgxskozkog .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #sgxskozkog .gt_left { text-align: left; } #sgxskozkog .gt_center { text-align: center; } #sgxskozkog .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #sgxskozkog .gt_font_normal { font-weight: normal; } #sgxskozkog .gt_font_bold { font-weight: bold; } #sgxskozkog .gt_font_italic { font-style: italic; } #sgxskozkog .gt_super { font-size: 65%; } #sgxskozkog .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #sgxskozkog .gt_asterisk { font-size: 100%; vertical-align: 0; } #sgxskozkog .gt_indent_1 { text-indent: 5px; } #sgxskozkog .gt_indent_2 { text-indent: 10px; } #sgxskozkog .gt_indent_3 { text-indent: 15px; } #sgxskozkog .gt_indent_4 { text-indent: 20px; } #sgxskozkog .gt_indent_5 { text-indent: 25px; } TABLE 14.4: Countries in the map data but not the survey data geounit sovereignt Anguilla United Kingdom Antigua and Barbuda Antigua and Barbuda Aruba Netherlands Barbados Barbados Belize Belize Bermuda United Kingdom British Virgin Islands United Kingdom Cayman Islands United Kingdom Cuba Cuba Curaçao Netherlands Dominica Dominica Falkland Islands United Kingdom Greenland Denmark Grenada Grenada Montserrat United Kingdom Puerto Rico United States of America Saint Barthelemy France Saint Kitts and Nevis Saint Kitts and Nevis Saint Lucia Saint Lucia Saint Martin France Saint Pierre and Miquelon France Saint Vincent and the Grenadines Saint Vincent and the Grenadines Sint Maarten Netherlands Suriname Suriname The Bahamas The Bahamas Trinidad and Tobago Trinidad and Tobago Turks and Caicos Islands United Kingdom United States Virgin Islands United States of America United States of America United States of America Venezuela Venezuela There are several ways to fix the mismatched names for a successful join. The simplest solution is to rename the data in the shape object before merging. Since only one country name in the survey data differs from the map data, we rename the map data accordingly. country_shape_upd <- country_shape_crop %>% mutate(geounit = if_else(geounit == "United States of America", "United States", geounit)) Now that the country names match, we can merge the survey and map data and then plot the data. We begin with the map file and merge it with the survey estimates generated in Section 14.5 (covid_worry_country_ests and covid_educ_ests). We use the tidyverse function of full_join(), which joins the rows in the map data and the survey estimates based on the columns geounit and Country. A full join keeps all the rows from both datasets, matching rows when possible. For any rows without matches, the function fills in an NA for the missing value. covid_sf <- country_shape_upd %>% full_join(covid_worry_country_ests, by = c("geounit" = "Country")) %>% full_join(covid_educ_ests, by = c("geounit" = "Country")) After the merge, we create two figures that display the population estimates for the percentage of people worried about COVID (Figure 14.2) and the percentage of households with at least one child participating in virtual or hybrid learning (Figure 14.3). ggplot() + geom_sf(data = covid_sf, aes(fill = p, geometry = geometry), color = "darkgray") + scale_fill_gradientn( guide = "colorbar", name = "Percent", labels = scales::comma, colors = c("#BFD7EA", "#087e8b", "#0B3954"), na.value = NA ) + geom_sf_pattern( data = filter(covid_sf, is.na(p)), pattern = "crosshatch", pattern_fill = "lightgray", pattern_color = "lightgray", fill = NA, color = "darkgray" ) + theme_minimal() FIGURE 14.2: Percent of households worried someone in their household will get COVID-19 in the next 3 months by country ggplot() + geom_sf( data = covid_sf, aes(fill = p_mediumchange, geometry = geometry), color = "darkgray" ) + scale_fill_gradientn( guide = "colorbar", name = "Percent", labels = scales::comma, colors = c("#BFD7EA", "#087e8b", "#0B3954"), na.value = NA ) + geom_sf_pattern( data = filter(covid_sf, is.na(p_mediumchange)), pattern = "crosshatch", pattern_fill = "lightgray", pattern_color = "lightgray", fill = NA, color = "darkgray" ) + theme_minimal() FIGURE 14.3: Percent of households who had at least one child participate in virtual or hybrid learning In Figure 14.3, we observe missing data (represented by the crosshatch pattern) for Canada, Mexico, and the United States. The questionnaires indicate that these three countries did not include the education question in the survey. To focus on countries with available data, we can remove North America from the map and show only Central and South America. We do this below by restricting the shape files to Latin America and the Caribbean, as depicted in Figure 14.4. covid_c_s <- covid_sf %>% filter(region_wb == "Latin America & Caribbean") ggplot() + geom_sf( data = covid_c_s, aes(fill = p_mediumchange, geometry = geometry), color = "darkgray" ) + scale_fill_gradientn( guide = "colorbar", name = "Percent", labels = scales::comma, colors = c("#BFD7EA", "#087e8b", "#0B3954"), na.value = NA ) + geom_sf_pattern( data = filter(covid_c_s, is.na(p_mediumchange)), pattern = "crosshatch", pattern_fill = "lightgray", pattern_color = "lightgray", fill = NA, color = "darkgray" ) + theme_minimal() FIGURE 14.4: Percent of households who had at least one child participate in virtual or hybrid learning, Central and South America In Figure 14.4, we can see that most countries with available data have similar percentages (reflected in their similar shades). However, Haiti stands out with a lighter shade, indicating a considerably lower percentage of households with at least one child participating in virtual or hybrid learning. 14.7 Exercises Calculate the percentage of households with broadband internet and those with any internet at home, including from a phone or tablet. Hint: if you come across countries with 0% internet usage, you may want to filter by something first. int_ests <- ambarom_des %>% filter(!is.na(Internet) | !is.na(BroadbandInternet)) %>% group_by(Country) %>% summarize( p_broadband = survey_mean(BroadbandInternet, na.rm = TRUE) * 100, p_internet = survey_mean(Internet, na.rm = TRUE) * 100 ) int_ests %>% print(n = 30) ## # A tibble: 20 × 5 ## Country p_broadband p_broadband_se p_internet p_internet_se ## <fct> <dbl> <dbl> <dbl> <dbl> ## 1 Argentina 62.3 1.13 86.2 0.871 ## 2 Bolivia 41.4 1.03 77.2 0.956 ## 3 Brazil 68.3 1.25 88.9 0.879 ## 4 Chile 63.1 1.06 93.5 0.550 ## 5 Colombia 45.7 1.15 68.7 1.09 ## 6 Costa Rica 49.6 1.07 84.4 0.798 ## 7 Dominican Republ… 37.1 1.04 73.7 1.05 ## 8 Ecuador 59.7 1.06 79.9 0.898 ## 9 El Salvador 30.2 0.906 63.9 0.985 ## 10 Guatemala 33.4 0.993 61.5 1.08 ## 11 Guyana 63.7 1.09 86.8 0.781 ## 12 Haiti 11.8 0.791 58.5 1.25 ## 13 Honduras 28.2 0.968 60.7 1.11 ## 14 Jamaica 64.2 0.986 91.5 0.602 ## 15 Mexico 44.9 1.05 70.9 1.05 ## 16 Nicaragua 39.1 1.12 76.3 1.09 ## 17 Panama 43.4 1.02 73.1 0.976 ## 18 Paraguay 33.3 0.971 72.9 1.01 ## 19 Peru 42.4 1.07 71.1 1.07 ## 20 Uruguay 62.7 1.08 90.6 0.699 Create a faceted map showing both broadband internet and any internet usage. internet_sf <- country_shape_upd %>% full_join(select(int_ests, p = p_internet, geounit = Country), by = "geounit") %>% mutate(Type = "Internet") broadband_sf <- country_shape_upd %>% full_join(select(int_ests, p = p_broadband, geounit = Country), by = "geounit") %>% mutate(Type = "Broadband") b_int_sf <- internet_sf %>% bind_rows(broadband_sf) %>% filter(region_wb == "Latin America & Caribbean") b_int_sf %>% ggplot(aes(fill = p), color="darkgray") + geom_sf() + facet_wrap( ~ Type) + scale_fill_gradientn( guide = "colorbar", name = "Percent", labels = scales::comma, colors = c("#BFD7EA", "#087E8B", "#0B3954"), na.value = NA ) + geom_sf_pattern( data = filter(b_int_sf, is.na(p)), pattern = "crosshatch", pattern_fill = "lightgray", pattern_color = "lightgray", fill = NA, color = "darkgray" ) + theme_minimal() FIGURE 14.5: Percent of broadband internet and any internet usage, Central and South America References LAPOP. 2021a. “AmericasBarometer 2021 - Canada: Technical Information.” Vanderbilt University; http://datasets.americasbarometer.org/database/files/ABCAN2021-Technical-Report-v1.0-FINAL-eng-110921.pdf. ———. 2021b. “AmericasBarometer 2021 - U.S.: Technical Information.” Vanderbilt University; http://datasets.americasbarometer.org/database/files/ABUSA2021-Technical-Report-v1.0-FINAL-eng-110921.pdf. ———. 2021c. “AmericasBarometer 2021: Technical Information.” Vanderbilt University; https://www.vanderbilt.edu/lapop/ab2021/AB2021-Technical-Report-v1.0-FINAL-eng-030722.pdf. ———. 2021d. “Core Questionnaire.” https://www.vanderbilt.edu/lapop/ab2021/AB2021-Core-Questionnaire-v17.5-Eng-210514-W-v2.pdf. ———. 2023a. “About the AmericasBarometer.” https://www.vanderbilt.edu/lapop/about-americasbarometer.php. ———. 2023b. “The AmericasBarometer by the LAPOP Lab.” www.vanderbilt.edu/lapop. See Table 2 in LAPOP (2021c) for dates by country↩︎ "],["anes-cb.html", "A ANES Derived Variable Codebook A.1 ADMIN A.2 WEIGHTS A.3 PRE-ELECTION SURVEY QUESTIONNAIRE A.4 POST-ELECTION SURVEY QUESTIONNAIRE", " A ANES Derived Variable Codebook The full codebook with the original variables is available at https://electionstudies.org/wp-content/uploads/2022/02/anes_timeseries_2020_userguidecodebook_20220210.pdf A.1 ADMIN V200001 Description: 2020 Case ID Variable class: numeric CaseID Description: 2020 Case ID Variable class: numeric V200002 Description: Mode of interview: pre-election interview Variable class: haven_labelled, vctrs_vctr, double V200002 Label n Unweighted Freq 1 Video 274 0.037 2 Telephone 115 0.015 3 Web 7064 0.948 Total 7453 1.000 InterviewMode Description: Mode of interview: pre-election interview Variable class: factor InterviewMode n Unweighted Freq Video 274 0.037 Telephone 115 0.015 Web 7064 0.948 Total 7453 1.000 A.2 WEIGHTS V200010b Description: Full sample post-election weight Variable class: numeric N Missing Minimum Median Maximum 0 0.0083 0.6863 6.651 Weight Description: Full sample post-election weight Variable class: numeric N Missing Minimum Median Maximum 0 0.0083 0.6863 6.651 V200010c Description: Full sample variance unit Variable class: numeric N Missing Minimum Median Maximum 0 1 2 3 VarUnit Description: Full sample variance unit Variable class: factor VarUnit n Unweighted Freq 1 3689 0.495 2 3750 0.503 3 14 0.002 Total 7453 1.000 V200010d Description: Full sample variance stratum Variable class: numeric N Missing Minimum Median Maximum 0 1 24 50 Stratum Description: Full sample variance stratum Variable class: factor Stratum n Unweighted Freq 1 167 0.022 2 148 0.020 3 158 0.021 4 151 0.020 5 147 0.020 6 172 0.023 7 163 0.022 8 159 0.021 9 160 0.021 10 159 0.021 11 137 0.018 12 179 0.024 13 148 0.020 14 160 0.021 15 159 0.021 16 148 0.020 17 158 0.021 18 156 0.021 19 154 0.021 20 144 0.019 21 170 0.023 22 146 0.020 23 165 0.022 24 147 0.020 25 169 0.023 26 165 0.022 27 172 0.023 28 133 0.018 29 157 0.021 30 167 0.022 31 154 0.021 32 143 0.019 33 143 0.019 34 124 0.017 35 138 0.019 36 130 0.017 37 136 0.018 38 145 0.019 39 140 0.019 40 125 0.017 41 158 0.021 42 146 0.020 43 130 0.017 44 126 0.017 45 126 0.017 46 135 0.018 47 133 0.018 48 140 0.019 49 133 0.018 50 130 0.017 Total 7453 1.000 A.3 PRE-ELECTION SURVEY QUESTIONNAIRE V201006 Description: PRE: How interested in following campaigns Question: Some people don’t pay much attention to political campaigns. How about you? Would you say that you have been very much interested, somewhat interested or not much interested in the political campaigns so far this year? Variable class: haven_labelled, vctrs_vctr, double V201006 Label n Unweighted Freq -9 -9. Refused 1 0.000 1 Very much interested 3940 0.529 2 Somewhat interested 2569 0.345 3 Not much interested 943 0.127 Total 7453 1.000 CampaignInterest Description: PRE: How interested in following campaigns Question: Some people don’t pay much attention to political campaigns. How about you? Would you say that you have been very much interested, somewhat interested or not much interested in the political campaigns so far this year? Variable class: factor CampaignInterest n Unweighted Freq Very much interested 3940 0.529 Somewhat interested 2569 0.345 Not much interested 943 0.127 NA 1 0.000 Total 7453 1.000 V201024 Description: PRE: In what manner did R vote Question: Which one of the following best describes how you voted? Variable class: haven_labelled, vctrs_vctr, double V201024 Label n Unweighted Freq -9 -9. Refused 1 0.000 -1 -1. Inapplicable 7078 0.950 1 Definitely voted in person at a polling place before election day 101 0.014 2 Definitely voted by mailing a ballot to elections officials before election day 242 0.032 3 Definitely voted in some other way 28 0.004 4 Not completely sure whether you voted or not 3 0.000 Total 7453 1.000 V201025x Description: PRE: SUMMARY: Registration and early vote status Variable class: haven_labelled, vctrs_vctr, double V201025x Label n Unweighted Freq -4 -4. Technical error 1 0.000 1 Not registered (or DK/RF), does not intend to register (or DK/RF intent) 339 0.045 2 Not registered (or DK/RF), intends to register 290 0.039 3 Registered but did not vote early (or DK/RF) 6452 0.866 4 Registered and voted early 371 0.050 Total 7453 1.000 V201029 Description: PRE: For whom did R vote for President Question: Who did you vote for? [Joe Biden, Donald Trump/Donald Trump, Joe Biden], Jo Jorgensen, Howie Hawkins, or someone else? Variable class: haven_labelled, vctrs_vctr, double V201029 Label n Unweighted Freq -9 -9. Refused 10 0.001 -1 -1. Inapplicable 7092 0.952 1 Joe Biden 239 0.032 2 Donald Trump 103 0.014 3 Jo Jorgensen 2 0.000 4 Howie Hawkins 1 0.000 5 Other candidate {SPECIFY} 4 0.001 12 Specified as refused 2 0.000 Total 7453 1.000 V201101 Description: PRE: Did R vote for President in 2016 [revised] Question: Four years ago, in 2016, Hillary Clinton ran on the Democratic ticket against Donald Trump for the Republicans. We talk to many people who tell us they did not vote. And we talk to a few people who tell us they did vote, who really did not. We can tell they did not vote by checking with official government records. What about you? If we check the official government voter records, will they show that you voted in the 2016 presidential election, or that you did not vote in that election? Variable class: haven_labelled, vctrs_vctr, double V201101 Label n Unweighted Freq -9 -9. Refused 13 0.002 -8 -8. Don’t know 1 0.000 -1 -1. Inapplicable 3780 0.507 1 Yes, voted 2780 0.373 2 No, didn’t vote 879 0.118 Total 7453 1.000 V201102 Description: PRE: Did R vote for President in 2016 Question: Four years ago, in 2016, Hillary Clinton ran on the Democratic ticket against Donald Trump for the Republicans. Do you remember for sure whether or not you voted in that election? Variable class: haven_labelled, vctrs_vctr, double V201102 Label n Unweighted Freq -9 -9. Refused 6 0.001 -8 -8. Don’t know 1 0.000 -1 -1. Inapplicable 3673 0.493 1 Yes, voted 3030 0.407 2 No, didn’t vote 743 0.100 Total 7453 1.000 VotedPres2016 Description: PRE: Did R vote for President in 2016 Question: Derived from V201102, V201101 Variable class: factor VotedPres2016 n Unweighted Freq Yes 5810 0.780 No 1622 0.218 NA 21 0.003 Total 7453 1.000 V201103 Description: PRE: Recall of last (2016) Presidential vote choice Question: Which one did you vote for? Variable class: haven_labelled, vctrs_vctr, double V201103 Label n Unweighted Freq -9 -9. Refused 41 0.006 -8 -8. Don’t know 2 0.000 -1 -1. Inapplicable 1643 0.220 1 Hillary Clinton 2911 0.391 2 Donald Trump 2466 0.331 5 Other {SPECIFY} 390 0.052 Total 7453 1.000 VotedPres2016_selection Description: PRE: Recall of last (2016) Presidential vote choice Question: Which one did you vote for? Variable class: factor VotedPres2016_selection n Unweighted Freq Clinton 2911 0.391 Trump 2466 0.331 Other 390 0.052 NA 1686 0.226 Total 7453 1.000 V201228 Description: PRE: Party ID: Does R think of self as Democrat, Republican, or Independent Question: Generally speaking, do you usually think of yourself as [a Democrat, a Republican / a Republican, a Democrat], an independent, or what? Variable class: haven_labelled, vctrs_vctr, double V201228 Label n Unweighted Freq -9 -9. Refused 37 0.005 -8 -8. Don’t know 4 0.001 -4 -4. Technical error 1 0.000 0 No preference {VOL - video/phone only} 6 0.001 1 Democrat 2589 0.347 2 Republican 2304 0.309 3 Independent 2277 0.306 5 Other party {SPECIFY} 235 0.032 Total 7453 1.000 V201229 Description: PRE: Party Identification strong - Democrat Republican Question: Would you call yourself a strong [Democrat / Republican] or a not very strong [Democrat / Republican]? Variable class: haven_labelled, vctrs_vctr, double V201229 Label n Unweighted Freq -9 -9. Refused 4 0.001 -1 -1. Inapplicable 2560 0.343 1 Strong 3341 0.448 2 Not very strong 1548 0.208 Total 7453 1.000 V201230 Description: PRE: No Party Identification - closer to Democratic Party or Republican Party Question: Do you think of yourself as closer to the Republican Party or to the Democratic Party? Variable class: haven_labelled, vctrs_vctr, double V201230 Label n Unweighted Freq -9 -9. Refused 19 0.003 -8 -8. Don’t know 2 0.000 -1 -1. Inapplicable 4893 0.657 1 Closer to Republican 782 0.105 2 Neither {VOL in video and phone} 876 0.118 3 Closer to Democratic 881 0.118 Total 7453 1.000 V201231x Description: PRE: SUMMARY: Party ID Question: Derived from V201228, V201229, and PTYID_LEANPTY Variable class: haven_labelled, vctrs_vctr, double V201231x Label n Unweighted Freq -9 -9. Refused 23 0.003 -8 -8. Don’t know 2 0.000 1 Strong Democrat 1796 0.241 2 Not very strong Democrat 790 0.106 3 Independent-Democrat 881 0.118 4 Independent 876 0.118 5 Independent-Republican 782 0.105 6 Not very strong Republican 758 0.102 7 Strong Republican 1545 0.207 Total 7453 1.000 PartyID Description: PRE: SUMMARY: Party ID Question: Derived from V201228, V201229, and PTYID_LEANPTY Variable class: factor PartyID n Unweighted Freq Strong democrat 1796 0.241 Not very strong democrat 790 0.106 Independent-democrat 881 0.118 Independent 876 0.118 Independent-republican 782 0.105 Not very strong republican 758 0.102 Strong republican 1545 0.207 NA 25 0.003 Total 7453 1.000 V201233 Description: PRE: How often trust government in Washington to do what is right [revised] Question: How often can you trust the federal government in Washington to do what is right? Variable class: haven_labelled, vctrs_vctr, double V201233 Label n Unweighted Freq -9 -9. Refused 26 0.003 -8 -8. Don’t know 3 0.000 1 Always 80 0.011 2 Most of the time 1016 0.136 3 About half the time 2313 0.310 4 Some of the time 3313 0.445 5 Never 702 0.094 Total 7453 1.000 TrustGovernment Description: PRE: How often trust government in Washington to do what is right [revised] Question: How often can you trust the federal government in Washington to do what is right? Variable class: factor TrustGovernment n Unweighted Freq Always 80 0.011 Most of the time 1016 0.136 About half the time 2313 0.310 Some of the time 3313 0.445 Never 702 0.094 NA 29 0.004 Total 7453 1.000 V201237 Description: PRE: How often can people be trusted Question: Generally speaking, how often can you trust other people? Variable class: haven_labelled, vctrs_vctr, double V201237 Label n Unweighted Freq -9 -9. Refused 12 0.002 -8 -8. Don’t know 1 0.000 1 Always 48 0.006 2 Most of the time 3511 0.471 3 About half the time 2020 0.271 4 Some of the time 1597 0.214 5 Never 264 0.035 Total 7453 1.000 TrustPeople Description: PRE: How often can people be trusted Question: Generally speaking, how often can you trust other people? Variable class: factor TrustPeople n Unweighted Freq Always 48 0.006 Most of the time 3511 0.471 About half the time 2020 0.271 Some of the time 1597 0.214 Never 264 0.035 NA 13 0.002 Total 7453 1.000 V201507x Description: PRE: SUMMARY: Respondent age Question: Derived from birth month, day and year Variable class: haven_labelled, vctrs_vctr, double N Missing N Refused (-9) Minimum Median Maximum 0 294 18 53 80 Age Description: PRE: SUMMARY: Respondent age Question: Derived from birth month, day and year Variable class: numeric N Missing Minimum Median Maximum 294 18 53 80 AgeGroup Description: PRE: SUMMARY: Respondent age Question: Derived from birth month, day and year Variable class: factor AgeGroup n Unweighted Freq 18-29 871 0.117 30-39 1241 0.167 40-49 1081 0.145 50-59 1200 0.161 60-69 1436 0.193 70 or older 1330 0.178 NA 294 0.039 Total 7453 1.000 V201510 Description: PRE: Highest level of Education Question: What is the highest level of school you have completed or the highest degree you have received? Variable class: haven_labelled, vctrs_vctr, double V201510 Label n Unweighted Freq -9 -9. Refused 25 0.003 -8 -8. Don’t know 1 0.000 1 Less than high school credential 312 0.042 2 High school graduate - High school diploma or equivalent (e.g. GED) 1160 0.156 3 Some college but no degree 1519 0.204 4 Associate degree in college - occupational/vocational 550 0.074 5 Associate degree in college - academic 445 0.060 6 Bachelor’s degree (e.g. BA, AB, BS) 1877 0.252 7 Master’s degree (e.g. MA, MS, MEng, MEd, MSW, MBA) 1092 0.147 8 Professional school degree (e.g. MD, DDS, DVM, LLB, JD)/Doctoral degree (e.g. PHD, EDD) 382 0.051 95 Other {SPECIFY} 90 0.012 Total 7453 1.000 Education Description: PRE: Highest level of Education Question: What is the highest level of school you have completed or the highest degree you have received? Variable class: factor Education n Unweighted Freq Less than HS 312 0.042 High school 1160 0.156 Post HS 2514 0.337 Bachelor’s 1877 0.252 Graduate 1474 0.198 NA 116 0.016 Total 7453 1.000 V201546 Description: PRE: R: Are you Spanish, Hispanic, or Latino Question: Are you of Hispanic, Latino, or Spanish origin? Variable class: haven_labelled, vctrs_vctr, double V201546 Label n Unweighted Freq -9 -9. Refused 45 0.006 -8 -8. Don’t know 3 0.000 1 Yes 662 0.089 2 No 6743 0.905 Total 7453 1.000 V201547a Description: RESTRICTED: PRE: Race of R: White [mention] Question: I am going to read you a list of five race categories. You may choose one or more races. For this survey, Hispanic origin is not a race. Are you White? Variable class: haven_labelled, vctrs_vctr, double V201547a Label n Unweighted Freq -3 -3. Restricted 7453 1 Total 7453 1 V201547b Description: RESTRICTED: PRE: Race of R: Black or African-American [mention] Question: I am going to read you a list of five race categories. You may choose one or more races. For this survey, Hispanic origin is not a race. Are you Black or African American? Variable class: haven_labelled, vctrs_vctr, double V201547b Label n Unweighted Freq -3 -3. Restricted 7453 1 Total 7453 1 V201547c Description: RESTRICTED: PRE: Race of R: Asian [mention] Question: I am going to read you a list of five race categories. You may choose one or more races. For this survey, Hispanic origin is not a race. Are you Asian? Variable class: haven_labelled, vctrs_vctr, double V201547c Label n Unweighted Freq -3 -3. Restricted 7453 1 Total 7453 1 V201547d Description: RESTRICTED: PRE: Race of R: Native Hawaiian or Pacific Islander [mention] Question: I am going to read you a list of five race categories. You may choose one or more races. For this survey, Hispanic origin is not a race. Are you White; Black or African American; American Indian or Alaska Native; Asian; or Native Hawaiian or Other Pacific Islander? Variable class: haven_labelled, vctrs_vctr, double V201547d Label n Unweighted Freq -3 -3. Restricted 7453 1 Total 7453 1 V201547e Description: RESTRICTED: PRE: Race of R: Native American or Alaska Native [mention] Question: I am going to read you a list of five race categories. You may choose one or more races. For this survey, Hispanic origin is not a race. Are you American Indian or Alaska Native? Variable class: haven_labelled, vctrs_vctr, double V201547e Label n Unweighted Freq -3 -3. Restricted 7453 1 Total 7453 1 V201547z Description: RESTRICTED: PRE: Race of R: other specify Question: I am going to read you a list of five race categories. You may choose one or more races. For this survey, Hispanic origin is not a race. Reported other Variable class: haven_labelled, vctrs_vctr, double V201547z Label n Unweighted Freq -3 -3. Restricted 7453 1 Total 7453 1 V201549x Description: PRE: SUMMARY: R self-identified race/ethnicity Question: Derived from V201546, V201547a-V201547e, and V201547z Variable class: haven_labelled, vctrs_vctr, double V201549x Label n Unweighted Freq -9 -9. Refused 75 0.010 -8 -8. Don’t know 6 0.001 1 White, non-Hispanic 5420 0.727 2 Black, non-Hispanic 650 0.087 3 Hispanic 662 0.089 4 Asian or Native Hawaiian/other Pacific Islander, non-Hispanic alone 248 0.033 5 Native American/Alaska Native or other race, non-Hispanic alone 155 0.021 6 Multiple races, non-Hispanic 237 0.032 Total 7453 1.000 RaceEth Description: PRE: SUMMARY: R self-identified race/ethnicity Question: Derived from V201546, V201547a-V201547e, and V201547z Variable class: factor RaceEth n Unweighted Freq White 5420 0.727 Black 650 0.087 Hispanic 662 0.089 Asian, NH/PI 248 0.033 AI/AN 155 0.021 Other/multiple race 237 0.032 NA 81 0.011 Total 7453 1.000 V201600 Description: PRE: What is your (R) sex? [revised] Question: What is your sex? Variable class: haven_labelled, vctrs_vctr, double V201600 Label n Unweighted Freq -9 -9. Refused 51 0.007 1 Male 3375 0.453 2 Female 4027 0.540 Total 7453 1.000 Gender Description: PRE: What is your (R) sex? [revised] Question: What is your sex? Variable class: factor Gender n Unweighted Freq Male 3375 0.453 Female 4027 0.540 NA 51 0.007 Total 7453 1.000 V201607 Description: RESTRICTED: PRE: Total income amount - revised Question: The next question is about [the total combined income of all members of your family / your total income] during the past 12 months. This includes money from jobs, net income from business, farm or rent, pensions, dividends, interest, Social Security payments, and any other money income received by members of your family who are 15 years of age or older. What was the total income of your family during the past 12 months? TYPE THE NUMBER. YOUR BEST GUESS IS FINE. Variable class: haven_labelled, vctrs_vctr, double V201607 Label n Unweighted Freq -3 -3. Restricted 7453 1 Total 7453 1 V201610 Description: RESTRICTED: PRE: Income amt missing - categories lt 20K Question: Please choose the answer that includes the income of all members of your family during the past 12 months before taxes. Variable class: haven_labelled, vctrs_vctr, double V201610 Label n Unweighted Freq -3 -3. Restricted 7453 1 Total 7453 1 V201611 Description: RESTRICTED: PRE: Income amt missing - categories 20-40K Question: Please choose the answer that includes the income of all members of your family during the past 12 months before taxes. Variable class: haven_labelled, vctrs_vctr, double V201611 Label n Unweighted Freq -3 -3. Restricted 7453 1 Total 7453 1 V201613 Description: RESTRICTED: PRE: Income amt missing - categories 40-70K Question: Please choose the answer that includes the income of all members of your family during the past 12 months before taxes. Variable class: haven_labelled, vctrs_vctr, double V201613 Label n Unweighted Freq -3 -3. Restricted 7453 1 Total 7453 1 V201615 Description: RESTRICTED: PRE: Income amt missing - categories 70-100K Question: Please choose the answer that includes the income of all members of your family during the past 12 months before taxes. Variable class: haven_labelled, vctrs_vctr, double V201615 Label n Unweighted Freq -3 -3. Restricted 7453 1 Total 7453 1 V201616 Description: RESTRICTED: PRE: Income amt missing - categories 100+K Question: Please choose the answer that includes the income of all members of your family during the past 12 months before taxes. Variable class: haven_labelled, vctrs_vctr, double V201616 Label n Unweighted Freq -3 -3. Restricted 7453 1 Total 7453 1 V201617x Description: PRE: SUMMARY: Total (family) income Question: Derived from V201607, V201610, V201611, V201613, V201615, V201616 Variable class: haven_labelled, vctrs_vctr, double V201617x Label n Unweighted Freq -9 -9. Refused 502 0.067 -5 -5. Interview breakoff (sufficient partial IW) 15 0.002 1 Under $9,999 647 0.087 2 $10,000-14,999 244 0.033 3 $15,000-19,999 185 0.025 4 $20,000-24,999 301 0.040 5 $25,000-29,999 228 0.031 6 $30,000-34,999 296 0.040 7 $35,000-39,999 226 0.030 8 $40,000-44,999 286 0.038 9 $45,000-49,999 213 0.029 10 $50,000-59,999 485 0.065 11 $60,000-64,999 294 0.039 12 $65,000-69,999 168 0.023 13 $70,000-74,999 243 0.033 14 $75,000-79,999 215 0.029 15 $80,000-89,999 383 0.051 16 $90,000-99,999 291 0.039 17 $100,000-109,999 451 0.061 18 $110,000-124,999 312 0.042 19 $125,000-149,999 323 0.043 20 $150,000-174,999 366 0.049 21 $175,000-249,999 374 0.050 22 $250,000 or more 405 0.054 Total 7453 1.000 Income Description: PRE: SUMMARY: Total (family) income Question: Derived from V201607, V201610, V201611, V201613, V201615, V201616 Variable class: factor Income n Unweighted Freq Under $9,999 647 0.087 $10,000-14,999 244 0.033 $15,000-19,999 185 0.025 $20,000-24,999 301 0.040 $25,000-29,999 228 0.031 $30,000-34,999 296 0.040 $35,000-39,999 226 0.030 $40,000-44,999 286 0.038 $45,000-49,999 213 0.029 $50,000-59,999 485 0.065 $60,000-64,999 294 0.039 $65,000-69,999 168 0.023 $70,000-74,999 243 0.033 $75,000-79,999 215 0.029 $80,000-89,999 383 0.051 $90,000-99,999 291 0.039 $100,000-109,999 451 0.061 $110,000-124,999 312 0.042 $125,000-149,999 323 0.043 $150,000-174,999 366 0.049 $175,000-249,999 374 0.050 $250,000 or more 405 0.054 NA 517 0.069 Total 7453 1.000 Income7 Description: PRE: SUMMARY: Total (family) income Question: Derived from V201607, V201610, V201611, V201613, V201615, V201616 Variable class: factor Income7 n Unweighted Freq Under $20k 1076 0.144 $20-40k 1051 0.141 $40-60k 984 0.132 $60-80k 920 0.123 $80-100k 674 0.090 $100-125k 763 0.102 $125k or more 1468 0.197 NA 517 0.069 Total 7453 1.000 A.4 POST-ELECTION SURVEY QUESTIONNAIRE V202051 Description: POST: R registered to vote (post-election) Question: Now on a different topic. Are you registered to vote at [Respondent’s preloaded address], registered at a different address, or not currently registered? Variable class: haven_labelled, vctrs_vctr, double V202051 Label n Unweighted Freq -9 -9. Refused 4 0.001 -6 -6. No post-election interview 4 0.001 -1 -1. Inapplicable 6820 0.915 1 Registered at this address 173 0.023 2 Registered at a different address 59 0.008 3 Not currently registered 393 0.053 Total 7453 1.000 V202066 Description: POST: Did R vote in November 2020 election Question: In talking to people about elections, we often find that a lot of people were not able to vote because they weren’t registered, they were sick, or they just didn’t have time. Which of the following statements best describes you: Variable class: haven_labelled, vctrs_vctr, double V202066 Label n Unweighted Freq -9 -9. Refused 7 0.001 -6 -6. No post-election interview 4 0.001 -1 -1. Inapplicable 372 0.050 1 I did not vote (in the election this November) 582 0.078 2 I thought about voting this time, but didn’t 265 0.036 3 I usually vote, but didn’t this time 192 0.026 4 I am sure I voted 6031 0.809 Total 7453 1.000 V202072 Description: POST: Did R vote for President Question: How about the election for President? Did you vote for a candidate for President? Variable class: haven_labelled, vctrs_vctr, double V202072 Label n Unweighted Freq -9 -9. Refused 2 0.000 -6 -6. No post-election interview 4 0.001 -1 -1. Inapplicable 1418 0.190 1 Yes, voted for President 5952 0.799 2 No, didn’t vote for President 77 0.010 Total 7453 1.000 VotedPres2020 Description: POST: Did R vote for President Question: How about the election for President? Did you vote for a candidate for President? Variable class: factor VotedPres2020 n Unweighted Freq Yes 5952 0.799 No 77 0.010 NA 1424 0.191 Total 7453 1.000 V202073 Description: POST: For whom did R vote for President Question: Who did you vote for? [Joe Biden, Donald Trump/Donald Trump, Joe Biden], Jo Jorgensen, Howie Hawkins, or someone else? Variable class: haven_labelled, vctrs_vctr, double V202073 Label n Unweighted Freq -9 -9. Refused 53 0.007 -6 -6. No post-election interview 4 0.001 -1 -1. Inapplicable 1497 0.201 1 Joe Biden 3267 0.438 2 Donald Trump 2462 0.330 3 Jo Jorgensen 69 0.009 4 Howie Hawkins 23 0.003 5 Other candidate {SPECIFY} 56 0.008 7 Specified as Republican candidate 1 0.000 8 Specified as Libertarian candidate 3 0.000 11 Specified as don’t know 2 0.000 12 Specified as refused 16 0.002 Total 7453 1.000 V202109x Description: PRE-POST: SUMMARY: Voter turnout in 2020 Question: Derived from V201024, V202066, V202051 Variable class: haven_labelled, vctrs_vctr, double V202109x Label n Unweighted Freq -2 -2. Not reported 7 0.001 0 Did not vote 1039 0.139 1 Voted 6407 0.860 Total 7453 1.000 V202110x Description: PRE-POST: SUMMARY: 2020 Presidential vote Question: Derived from V201029, V202073 Variable class: haven_labelled, vctrs_vctr, double V202110x Label n Unweighted Freq -9 -9. Refused 81 0.011 -8 -8. Don’t know 2 0.000 -1 -1. Inapplicable 1136 0.152 1 Joe Biden 3509 0.471 2 Donald Trump 2567 0.344 3 Jo Jorgensen 74 0.010 4 Howie Hawkins 24 0.003 5 Other candidate {SPECIFY} 60 0.008 Total 7453 1.000 VotedPres2020_selection Description: PRE-POST: SUMMARY: 2020 Presidential vote Question: Derived from V201029, V202073 Variable class: factor VotedPres2020_selection n Unweighted Freq Biden 3509 0.471 Trump 2567 0.344 Other 158 0.021 NA 1219 0.164 Total 7453 1.000 EarlyVote2020 Description: PRE-POST: Voted early for president Question: Derived from V201025x, VotedPres2020 Variable class: factor EarlyVote2020 n Unweighted Freq Yes 371 0.050 No 5949 0.798 NA 1133 0.152 Total 7453 1.000 "],["recs-cb.html", "B RECS Derived Variable Codebook B.1 ADMIN B.2 GEOGRAPHY B.3 WEATHER B.4 YOUR HOME B.5 SPACE HEATING B.6 AIR CONDITIONING B.7 THERMOSTAT B.8 WEIGHTS B.9 CONSUMPTION AND EXPENDITURE", " B RECS Derived Variable Codebook The full codebook with the original variables is available at https://www.eia.gov/consumption/residential/data/2020/index.php?view=microdata - “Variable and response codebook”. This codebook includes the variables on the dataset included for download along with this book. B.1 ADMIN DOEID Description: Unique identifier for each respondent ClimateRegion_BA Description: Building America Climate Zone ClimateRegion_BA n Unweighted Freq Mixed-Dry 142 0.008 Mixed-Humid 5579 0.302 Hot-Humid 2545 0.138 Hot-Dry 1577 0.085 Very-Cold 572 0.031 Cold 7116 0.385 Marine 911 0.049 Subarctic 54 0.003 Total 18496 1.000 Urbanicity Description: 2010 Census Urban Type Code Urbanicity n Unweighted Freq Urban Area 12395 0.670 Urban Cluster 2020 0.109 Rural 4081 0.221 Total 18496 1.000 B.2 GEOGRAPHY Region Description: Census Region Region n Unweighted Freq Northeast 3657 0.198 Midwest 3832 0.207 South 6426 0.347 West 4581 0.248 Total 18496 1.000 REGIONC Description: Census Region REGIONC n Unweighted Freq MIDWEST 3832 0.207 NORTHEAST 3657 0.198 SOUTH 6426 0.347 WEST 4581 0.248 Total 18496 1.000 Division Description: Census Division, Mountain Division is divided into North and South for RECS purposes Division n Unweighted Freq New England 1680 0.091 Middle Atlantic 1977 0.107 East North Central 2014 0.109 West North Central 1818 0.098 South Atlantic 3256 0.176 East South Central 1343 0.073 West South Central 1827 0.099 Mountain North 1180 0.064 Mountain South 904 0.049 Pacific 2497 0.135 Total 18496 1.000 STATE_FIPS Description: State Federal Information Processing System Code STATE_FIPS n Unweighted Freq 01 242 0.013 02 311 0.017 04 495 0.027 05 268 0.014 06 1152 0.062 08 360 0.019 09 294 0.016 10 143 0.008 11 221 0.012 12 655 0.035 13 417 0.023 15 282 0.015 16 270 0.015 17 530 0.029 18 400 0.022 19 286 0.015 20 208 0.011 21 428 0.023 22 311 0.017 23 223 0.012 24 359 0.019 25 552 0.030 26 388 0.021 27 325 0.018 28 168 0.009 29 296 0.016 30 172 0.009 31 189 0.010 32 231 0.012 33 175 0.009 34 456 0.025 35 178 0.010 36 904 0.049 37 479 0.026 38 331 0.018 39 339 0.018 40 232 0.013 41 313 0.017 42 617 0.033 44 191 0.010 45 334 0.018 46 183 0.010 47 505 0.027 48 1016 0.055 49 188 0.010 50 245 0.013 51 451 0.024 53 439 0.024 54 197 0.011 55 357 0.019 56 190 0.010 Total 18496 1.000 state_postal Description: State Postal Code state_postal n Unweighted Freq AL 242 0.013 AK 311 0.017 AZ 495 0.027 AR 268 0.014 CA 1152 0.062 CO 360 0.019 CT 294 0.016 DE 143 0.008 DC 221 0.012 FL 655 0.035 GA 417 0.023 HI 282 0.015 ID 270 0.015 IL 530 0.029 IN 400 0.022 IA 286 0.015 KS 208 0.011 KY 428 0.023 LA 311 0.017 ME 223 0.012 MD 359 0.019 MA 552 0.030 MI 388 0.021 MN 325 0.018 MS 168 0.009 MO 296 0.016 MT 172 0.009 NE 189 0.010 NV 231 0.012 NH 175 0.009 NJ 456 0.025 NM 178 0.010 NY 904 0.049 NC 479 0.026 ND 331 0.018 OH 339 0.018 OK 232 0.013 OR 313 0.017 PA 617 0.033 RI 191 0.010 SC 334 0.018 SD 183 0.010 TN 505 0.027 TX 1016 0.055 UT 188 0.010 VT 245 0.013 VA 451 0.024 WA 439 0.024 WV 197 0.011 WI 357 0.019 WY 190 0.010 Total 18496 1.000 state_name Description: State Name state_name n Unweighted Freq Alabama 242 0.013 Alaska 311 0.017 Arizona 495 0.027 Arkansas 268 0.014 California 1152 0.062 Colorado 360 0.019 Connecticut 294 0.016 Delaware 143 0.008 District of Columbia 221 0.012 Florida 655 0.035 Georgia 417 0.023 Hawaii 282 0.015 Idaho 270 0.015 Illinois 530 0.029 Indiana 400 0.022 Iowa 286 0.015 Kansas 208 0.011 Kentucky 428 0.023 Louisiana 311 0.017 Maine 223 0.012 Maryland 359 0.019 Massachusetts 552 0.030 Michigan 388 0.021 Minnesota 325 0.018 Mississippi 168 0.009 Missouri 296 0.016 Montana 172 0.009 Nebraska 189 0.010 Nevada 231 0.012 New Hampshire 175 0.009 New Jersey 456 0.025 New Mexico 178 0.010 New York 904 0.049 North Carolina 479 0.026 North Dakota 331 0.018 Ohio 339 0.018 Oklahoma 232 0.013 Oregon 313 0.017 Pennsylvania 617 0.033 Rhode Island 191 0.010 South Carolina 334 0.018 South Dakota 183 0.010 Tennessee 505 0.027 Texas 1016 0.055 Utah 188 0.010 Vermont 245 0.013 Virginia 451 0.024 Washington 439 0.024 West Virginia 197 0.011 Wisconsin 357 0.019 Wyoming 190 0.010 Total 18496 1.000 B.3 WEATHER HDD65 Description: Heating degree days in 2020, base temperature 65F; Derived from the weighted temperatures of nearby weather stations N Missing Minimum Median Maximum 0 0 4396 17383 CDD65 Description: Cooling degree days in 2020, base temperature 65F; Derived from the weighted temperatures of nearby weather stations N Missing Minimum Median Maximum 0 0 1179 5534 HDD30YR Description: Heating degree days, 30-year average 1981-2010, base temperature 65F; Taken from nearest weather station, inoculated with random errors N Missing Minimum Median Maximum 0 0 4825 16071 CDD30YR Description: Cooling degree days, 30-year average 1981-2010, base temperature 65F; Taken from nearest weather station, inoculated with random errors N Missing Minimum Median Maximum 0 0 1020 4905 B.4 YOUR HOME HousingUnitType Description: Type of housing unit Question: Which best describes your home? HousingUnitType n Unweighted Freq Mobile home 974 0.053 Single-family detached 12319 0.666 Single-family attached 1751 0.095 Apartment: 2-4 Units 1013 0.055 Apartment: 5 or more units 2439 0.132 Total 18496 1.000 YearMade Description: Range when housing unit was built Question: Derived from: In what year was your home built? AND Although you do not know the exact year your home was built, it is helpful to have an estimate. About when was your home built? YearMade n Unweighted Freq Before 1950 2721 0.147 1950-1959 1685 0.091 1960-1969 1867 0.101 1970-1979 2817 0.152 1980-1989 2435 0.132 1990-1999 2451 0.133 2000-2009 2748 0.149 2010-2015 989 0.053 2016-2020 783 0.042 Total 18496 1.000 TOTSQFT_EN Description: Total energy-consuming area (square footage) of the housing unit. Includes all main living areas; all basements; heated, cooled, or finished attics; and heating or cooled garages. For single-family housing units this is derived using the respondent-reported square footage (SQFTEST) and adjusted using the “include” variables (e.g., SQFTINCB), where applicable. For apartments and mobile homes this is the respondent-reported square footage. A derived variable rounded to the nearest 10 N Missing Minimum Median Maximum 0 200 1700 15000 TOTHSQFT Description: Square footage of the housing unit that is heated by space heating equipment. A derived variable rounded to the nearest 10 N Missing Minimum Median Maximum 0 0 1520 15000 TOTCSQFT Description: Square footage of the housing unit that is cooled by air-conditioning equipment or evaporative cooler, a derived variable rounded to the nearest 10 N Missing Minimum Median Maximum 0 0 1200 14600 ZTOTSQFT_EN Description: Imputation indicator for SQFTEST ZTOTSQFT_EN n Unweighted Freq Not imputed 11930 0.645 Imputed 6566 0.355 Total 18496 1.000 ZYearMade Description: Imputation indicator for YEARMADERANGE ZYearMade n Unweighted Freq Not imputed 18176 0.983 Imputed 320 0.017 Total 18496 1.000 ZHousingUnitType Description: Imputation indicator for TYPEHUQ ZHousingUnitType n Unweighted Freq Not imputed 18496 1 Total 18496 1 B.5 SPACE HEATING SpaceHeatingUsed Description: Space heating equipment used Question: Is your home heated during the winter? SpaceHeatingUsed n Unweighted Freq FALSE 751 0.041 TRUE 17745 0.959 Total 18496 1.000 ZSpaceHeatingUsed Description: Imputation indicator for HEATHOME ZSpaceHeatingUsed n Unweighted Freq Not imputed 18474 0.999 Imputed 22 0.001 Total 18496 1.000 B.6 AIR CONDITIONING ACUsed Description: Air conditioning equipment used Question: Is any air conditioning equipment used in your home? ACUsed n Unweighted Freq FALSE 2325 0.126 TRUE 16171 0.874 Total 18496 1.000 ZACUsed Description: Imputation indicator for AIRCOND ZACUsed n Unweighted Freq Not imputed 18448 0.997 Imputed 48 0.003 Total 18496 1.000 ZACBehavior Description: Imputation indicator for COOLCNTL ZACBehavior n Unweighted Freq Not imputed 15819 0.855 Imputed 352 0.019 Not applicable 2325 0.126 Total 18496 1.000 B.7 THERMOSTAT HeatingBehavior Description: Winter temperature control method Question: Which of the following best describes how your household controls the indoor temperature during the winter? HeatingBehavior n Unweighted Freq Set one temp and leave it 7806 0.422 Manually adjust at night/no one home 4654 0.252 Programmable or smart thermostat automatically adjusts the temperature 3310 0.179 Turn on or off as needed 1491 0.081 No control 438 0.024 Other 46 0.002 NA 751 0.041 Total 18496 1.000 WinterTempDay Description: Winter thermostat setting or temperature in home when someone is home during the day Question: During the winter, what is your home’s typical indoor temperature when someone is home during the day? N Missing Minimum Median Maximum 751 50 70 90 WinterTempAway Description: Winter thermostat setting or temperature in home when no one is home during the day Question: During the winter, what is your home’s typical indoor temperature when no one is inside your home during the day? N Missing Minimum Median Maximum 751 50 68 90 WinterTempNight Description: Winter thermostat setting or temperature in home at night Question: During the winter, what is your home’s typical indoor temperature inside your home at night? N Missing Minimum Median Maximum 751 50 68 90 ACBehavior Description: Summer temperature control method Question: Which of the following best describes how your household controls the indoor temperature during the summer? ACBehavior n Unweighted Freq Set one temp and leave it 6738 0.364 Manually adjust at night/no one home 3637 0.197 Programmable or smart thermostat automatically adjusts the temperature 2638 0.143 Turn on or off as needed 2746 0.148 No control 409 0.022 Other 3 0.000 NA 2325 0.126 Total 18496 1.000 SummerTempDay Description: Summer thermostat setting or temperature in home when someone is home during the day Question: During the summer, what is your home’s typical indoor temperature when someone is home during the day? N Missing Minimum Median Maximum 2325 50 72 90 SummerTempAway Description: Summer thermostat setting or temperature in home when no one is home during the day Question: During the summer, what is your home’s typical indoor temperature when no one is inside your home during the day? N Missing Minimum Median Maximum 2325 50 74 90 SummerTempNight Description: Summer thermostat setting or temperature in home at night Question: During the summer, what is your home’s typical indoor temperature inside your home at night? N Missing Minimum Median Maximum 2325 50 72 90 ZHeatingBehavior Description: Imputation indicator for HEATCNTL ZHeatingBehavior n Unweighted Freq Not imputed 17395 0.940 Imputed 350 0.019 Not applicable 751 0.041 Total 18496 1.000 ZWinterTempAway Description: Imputation indicator for TEMPGONE ZWinterTempAway n Unweighted Freq Not imputed 16840 0.910 Imputed 905 0.049 Not applicable 751 0.041 Total 18496 1.000 ZSummerTempAway Description: Imputation indicator for TEMPGONEAC ZSummerTempAway n Unweighted Freq Not imputed 15240 0.824 Imputed 931 0.050 Not applicable 2325 0.126 Total 18496 1.000 ZWinterTempDay Description: Imputation indicator for TEMPHOME ZWinterTempDay n Unweighted Freq Not imputed 17382 0.940 Imputed 363 0.020 Not applicable 751 0.041 Total 18496 1.000 ZSummerTempDay Description: Imputation indicator for TEMPHOMEAC ZSummerTempDay n Unweighted Freq Not imputed 15658 0.847 Imputed 513 0.028 Not applicable 2325 0.126 Total 18496 1.000 ZWinterTempNight Description: Imputation indicator for TEMPNITE ZWinterTempNight n Unweighted Freq Not imputed 17207 0.930 Imputed 538 0.029 Not applicable 751 0.041 Total 18496 1.000 ZSummerTempNight Description: Imputation indicator for TEMPNITEAC ZSummerTempNight n Unweighted Freq Not imputed 15497 0.838 Imputed 674 0.036 Not applicable 2325 0.126 Total 18496 1.000 B.8 WEIGHTS NWEIGHT Description: Final Analysis Weight N Missing Minimum Median Maximum 0 437.9 6119 29279 NWEIGHT1 Description: Final Analysis Weight for replicate 1 N Missing Minimum Median Maximum 0 0 6136 30015 NWEIGHT2 Description: Final Analysis Weight for replicate 2 N Missing Minimum Median Maximum 0 0 6151 29422 NWEIGHT3 Description: Final Analysis Weight for replicate 3 N Missing Minimum Median Maximum 0 0 6151 29431 NWEIGHT4 Description: Final Analysis Weight for replicate 4 N Missing Minimum Median Maximum 0 0 6153 29494 NWEIGHT5 Description: Final Analysis Weight for replicate 5 N Missing Minimum Median Maximum 0 0 6134 30039 NWEIGHT6 Description: Final Analysis Weight for replicate 6 N Missing Minimum Median Maximum 0 0 6147 29419 NWEIGHT7 Description: Final Analysis Weight for replicate 7 N Missing Minimum Median Maximum 0 0 6135 29586 NWEIGHT8 Description: Final Analysis Weight for replicate 8 N Missing Minimum Median Maximum 0 0 6151 29499 NWEIGHT9 Description: Final Analysis Weight for replicate 9 N Missing Minimum Median Maximum 0 0 6139 29845 NWEIGHT10 Description: Final Analysis Weight for replicate 10 N Missing Minimum Median Maximum 0 0 6163 29635 NWEIGHT11 Description: Final Analysis Weight for replicate 11 N Missing Minimum Median Maximum 0 0 6140 29681 NWEIGHT12 Description: Final Analysis Weight for replicate 12 N Missing Minimum Median Maximum 0 0 6160 29849 NWEIGHT13 Description: Final Analysis Weight for replicate 13 N Missing Minimum Median Maximum 0 0 6142 29843 NWEIGHT14 Description: Final Analysis Weight for replicate 14 N Missing Minimum Median Maximum 0 0 6154 30184 NWEIGHT15 Description: Final Analysis Weight for replicate 15 N Missing Minimum Median Maximum 0 0 6145 29970 NWEIGHT16 Description: Final Analysis Weight for replicate 16 N Missing Minimum Median Maximum 0 0 6133 29825 NWEIGHT17 Description: Final Analysis Weight for replicate 17 N Missing Minimum Median Maximum 0 0 6126 30606 NWEIGHT18 Description: Final Analysis Weight for replicate 18 N Missing Minimum Median Maximum 0 0 6155 29689 NWEIGHT19 Description: Final Analysis Weight for replicate 19 N Missing Minimum Median Maximum 0 0 6153 29336 NWEIGHT20 Description: Final Analysis Weight for replicate 20 N Missing Minimum Median Maximum 0 0 6139 30274 NWEIGHT21 Description: Final Analysis Weight for replicate 21 N Missing Minimum Median Maximum 0 0 6135 29766 NWEIGHT22 Description: Final Analysis Weight for replicate 22 N Missing Minimum Median Maximum 0 0 6149 29791 NWEIGHT23 Description: Final Analysis Weight for replicate 23 N Missing Minimum Median Maximum 0 0 6148 30126 NWEIGHT24 Description: Final Analysis Weight for replicate 24 N Missing Minimum Median Maximum 0 0 6136 29946 NWEIGHT25 Description: Final Analysis Weight for replicate 25 N Missing Minimum Median Maximum 0 0 6150 30445 NWEIGHT26 Description: Final Analysis Weight for replicate 26 N Missing Minimum Median Maximum 0 0 6136 29893 NWEIGHT27 Description: Final Analysis Weight for replicate 27 N Missing Minimum Median Maximum 0 0 6125 30030 NWEIGHT28 Description: Final Analysis Weight for replicate 28 N Missing Minimum Median Maximum 0 0 6149 29599 NWEIGHT29 Description: Final Analysis Weight for replicate 29 N Missing Minimum Median Maximum 0 0 6146 30136 NWEIGHT30 Description: Final Analysis Weight for replicate 30 N Missing Minimum Median Maximum 0 0 6149 29895 NWEIGHT31 Description: Final Analysis Weight for replicate 31 N Missing Minimum Median Maximum 0 0 6144 29604 NWEIGHT32 Description: Final Analysis Weight for replicate 32 N Missing Minimum Median Maximum 0 0 6159 29310 NWEIGHT33 Description: Final Analysis Weight for replicate 33 N Missing Minimum Median Maximum 0 0 6148 29408 NWEIGHT34 Description: Final Analysis Weight for replicate 34 N Missing Minimum Median Maximum 0 0 6139 29564 NWEIGHT35 Description: Final Analysis Weight for replicate 35 N Missing Minimum Median Maximum 0 0 6141 30437 NWEIGHT36 Description: Final Analysis Weight for replicate 36 N Missing Minimum Median Maximum 0 0 6149 27896 NWEIGHT37 Description: Final Analysis Weight for replicate 37 N Missing Minimum Median Maximum 0 0 6133 30596 NWEIGHT38 Description: Final Analysis Weight for replicate 38 N Missing Minimum Median Maximum 0 0 6139 30130 NWEIGHT39 Description: Final Analysis Weight for replicate 39 N Missing Minimum Median Maximum 0 0 6147 29262 NWEIGHT40 Description: Final Analysis Weight for replicate 40 N Missing Minimum Median Maximum 0 0 6144 30344 NWEIGHT41 Description: Final Analysis Weight for replicate 41 N Missing Minimum Median Maximum 0 0 6153 29594 NWEIGHT42 Description: Final Analysis Weight for replicate 42 N Missing Minimum Median Maximum 0 0 6137 29938 NWEIGHT43 Description: Final Analysis Weight for replicate 43 N Missing Minimum Median Maximum 0 0 6157 29878 NWEIGHT44 Description: Final Analysis Weight for replicate 44 N Missing Minimum Median Maximum 0 0 6148 29896 NWEIGHT45 Description: Final Analysis Weight for replicate 45 N Missing Minimum Median Maximum 0 0 6149 29729 NWEIGHT46 Description: Final Analysis Weight for replicate 46 N Missing Minimum Median Maximum 0 0 6152 29103 NWEIGHT47 Description: Final Analysis Weight for replicate 47 N Missing Minimum Median Maximum 0 0 6150 30070 NWEIGHT48 Description: Final Analysis Weight for replicate 48 N Missing Minimum Median Maximum 0 0 6139 29343 NWEIGHT49 Description: Final Analysis Weight for replicate 49 N Missing Minimum Median Maximum 0 0 6146 29590 NWEIGHT50 Description: Final Analysis Weight for replicate 50 N Missing Minimum Median Maximum 0 0 6159 30027 NWEIGHT51 Description: Final Analysis Weight for replicate 51 N Missing Minimum Median Maximum 0 0 6150 29247 NWEIGHT52 Description: Final Analysis Weight for replicate 52 N Missing Minimum Median Maximum 0 0 6154 29445 NWEIGHT53 Description: Final Analysis Weight for replicate 53 N Missing Minimum Median Maximum 0 0 6156 30131 NWEIGHT54 Description: Final Analysis Weight for replicate 54 N Missing Minimum Median Maximum 0 0 6151 29439 NWEIGHT55 Description: Final Analysis Weight for replicate 55 N Missing Minimum Median Maximum 0 0 6143 29216 NWEIGHT56 Description: Final Analysis Weight for replicate 56 N Missing Minimum Median Maximum 0 0 6153 29203 NWEIGHT57 Description: Final Analysis Weight for replicate 57 N Missing Minimum Median Maximum 0 0 6138 29819 NWEIGHT58 Description: Final Analysis Weight for replicate 58 N Missing Minimum Median Maximum 0 0 6137 29818 NWEIGHT59 Description: Final Analysis Weight for replicate 59 N Missing Minimum Median Maximum 0 0 6144 29606 NWEIGHT60 Description: Final Analysis Weight for replicate 60 N Missing Minimum Median Maximum 0 0 6140 29818 B.9 CONSUMPTION AND EXPENDITURE BTUEL Description: Total electricity use, in thousand Btu, 2020, including self-generation of solar power N Missing Minimum Median Maximum 0 143.3 31890 628155 DOLLAREL Description: Total electricity cost, in dollars, 2020 N Missing Minimum Median Maximum 0 -889.5 1258 15680 ZBTUEL Description: Imputation flag for total electricity use ZBTUEL n Unweighted Freq Not imputed 15965 0.863 Imputed amount and cost 2138 0.116 Imputed only amount for SOLAR=1 cases 393 0.021 Total 18496 1.000 BTUNG Description: Total natural gas use, in thousand Btu, 2020 N Missing Minimum Median Maximum 0 0 22012 1134709 DOLLARNG Description: Total natural gas cost, in dollars, 2020 N Missing Minimum Median Maximum 0 0 313.9 8155 ZBTUNG Description: Imputation flag for total natural gas use ZBTUNG n Unweighted Freq Not imputed 8823 0.477 Imputed 2331 0.126 Not applicable 7342 0.397 Total 18496 1.000 BTULP Description: Total propane use, in thousand Btu, 2020 N Missing Minimum Median Maximum 0 0 0 364215 DOLLARLP Description: Total propane cost, in dollars, 2020 N Missing Minimum Median Maximum 0 0 0 6621 ZBTULP Description: Imputation flag for total propane use ZBTULP n Unweighted Freq Not imputed 896 0.048 Imputed 1103 0.060 Not applicable 16497 0.892 Total 18496 1.000 BTUFO Description: Total fuel oil/kerosene use, in thousand Btu, 2020 N Missing Minimum Median Maximum 0 0 0 426268 DOLLARFO Description: Total fuel oil/kerosene cost, in dollars, 2020 N Missing Minimum Median Maximum 0 0 0 7004 ZBTUFO Description: Imputation flag for total fuel oil/kerosene use ZBTUFO n Unweighted Freq Not imputed 626 0.034 Imputed 607 0.033 Not applicable 17263 0.933 Total 18496 1.000 BTUWOOD Description: Total wood use, in thousand Btu, 2020 N Missing Minimum Median Maximum 0 0 0 5e+05 ZBTUWOOD Description: Imputation flag for total wood use ZBTUWOOD n Unweighted Freq Not imputed 1730 0.094 Imputed 244 0.013 Not applicable 16522 0.893 Total 18496 1.000 TOTALBTU Description: Total usage including electricity, natural gas, propane, and fuel oil, in thousand Btu, 2020 N Missing Minimum Median Maximum 0 1182 74180 1367548 TOTALDOL Description: Total cost including electricity, natural gas, propane, and fuel oil, in dollars, 2020 N Missing Minimum Median Maximum 0 -150.5 1793 20043 "],["importing-survey-data-into-r.html", "C Importing survey data into R C.1 Importing delimiter-separated files into R C.2 Loading Excel files into R C.3 Importing Stata, SAS, and SPSS files into R C.4 Importing data from APIs into R C.5 Accessing databases in R C.6 Importing data from other formats", " C Importing survey data into R To analyze a survey, we need to import the survey data into R. This process is often referred to as importing, loading, or reading in data. Survey files come in different formats depending on the software used to create them. One of the many advantages of R is the flexibility in handling various data formats, regardless of their file extensions. Here are examples of common public-use survey file formats we may encounter: Delimiter-separated text files Excel spreadsheets in .xls or .xlsx format R native .rda files Stata datasets in .dta format SAS datasets in .sas format SPSS datasets in .sav format Application Programming Interfaces (APIs), often in JSON format Data stored in databases This appendix guides analysts through the process of importing these various types of survey data into R. C.1 Importing delimiter-separated files into R Delimiter-separated files use specific characters, known as delimiters, to separate values within the file. For example, CSV (Comma-Separated Values) files use commas as delimiters, while TSV (Tab-Separated Values) files use tabs. These file formats are widely used because of their simplicity and compatibility with various software applications. The {readr} package, part of the tidyverse ecosystem, offers efficient ways to import delimiter-separated files into R. It provides several advantages, including automatic data type detection and flexible handling of missing values, depending on one’s survey research needs. The {readr} package includes functions for: read_csv(): This function is specifically designed to read CSV files. read_tsv(): Use this function for Tab-Separated Values (TSV) files. read_delim(): This function can handle a broader range of delimiter-separated files, including CSV and TSV. Specify the delimiter using the delim argument. read_fwf(): This function is useful for importing Fixed-Width Files, where columns have predetermined widths, and values are aligned in specific positions. read_table(): Use this function when dealing with whitespace-separated files, such as those with spaces or multiple spaces as delimiters. read_log(): This function can read and parse web log files. The syntax for read_csv() is: read_csv( file, col_names = TRUE, col_types = NULL, col_select = NULL, id = NULL, locale = default_locale(), na = c("", "NA"), comment = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000, n_max), name_repair = "unique", num_threads = readr_threads(), progress = show_progress(), show_col_types = should_show_types(), skip_empty_rows = TRUE, lazy = should_read_lazy() ) The arguments are: file: the path to the Excel file to import col_names: a value of TRUE will import the first row of the file as column names and not included in the data frame. A value of FALSE will create automated column names. Alternatively, we can provide a vector of column names. col_types: by default, R will infer the column variable types. We can also provide a column specification using list() or cols(); for example, use col_types = cols(.default = \"c\") to read all the columns as characters. Alternatively, we can use a string to specify the variable types for each column. col_select: the columns to include in the results id: a column for storing the file path. This is useful for keeping track of the input file when importing multiple CSVs at a time. locale: the location-specific defaults for the file na: a character vector of values to interpret as missing comment: a character vector of values to interpret as comments trim_ws: a value of TRUE will trim leading and trailing white space skip: number of lines to skip before importing the data n_max: maximum number of lines to read guess_max: maximum number of lines use for guessing column types name_repair: whether to check column names. By default, the column names are unique. num_threads: the number of processing threads to use for initial parsing and lazy reading of data progress: a value of TRUE displays a progress bar show_col_types: a value of TRUE displays the column types skip_empty_rows: a value of TRUE will ignore blank rows lazy: a value of TRUE will read values lazily The other functions share a similar syntax to read_csv(). To find more details, run ?? followed by the function name. For example, run ??read_delim in the Console for additional information. In the example below, we use {readr} to load a CSV file named ‘anes_timeseries_2020_csv_20220210.csv’ into an R object called anes_csv. The read_csv() imports the file and stores the data in the anes_csv object. We can then use this object for further analysis. library(readr) anes_csv <- read_csv("data/anes_timeseries_2020_csv_20220210.csv") C.2 Loading Excel files into R Excel, a widely used spreadsheet software program created by Microsoft, is a common file format in survey research. We can load Excel spreadsheets into the R environment using the {readxl} package. The package supports both the legacy .xls files and the modern .xlsx format. To load Excel data into R, we can use the read_excel() function from the {readxl} package. This function offers a range of customizable options for the import process. Let’s explore the syntax: read_excel( path, sheet = NULL, range = NULL, col_names = TRUE, col_types = NULL, na = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000, n_max), progress = readxl_progress(), .name_repair = "unique" ) The arguments are: path: the path to the Excel file to import sheet: the name or index of the sheet (sometimes called tabs) within the Excel file range: the range of cells to import (for example, “P15:T87”) col_names: indicates whether the first row of the dataset contains column names col_types: specify the data types of columns na: define the representation of missing values (for example, NULL) trim_ws: controls whether leading and trailing whitespaces should be trimmed skip and n_max: enable skipping rows and limit the number of rows imported guess_max: sets the maximum number of rows used for data type guessing progress: specifies a progress bar for large imports .name_repair: determines how column names are repaired if they are not valid In the code example below, we load an Excel spreadsheet named ‘anes_timeseries_2020_csv_20220210.xlsx’ into R. The resulting data is saved as a tibble in the anes_excel object, ready for further analysis. library(readxl) anes_excel <- read_excel(path = "data/anes_timeseries_2020_csv_20220210.xlsx") C.3 Importing Stata, SAS, and SPSS files into R The {haven} package, also from the tidyverse ecosystem, imports various proprietary data formats: Stata .dta files, SPSS .sav files, and SAS .sas7bdat and .sas7bcat files. One of the notable strengths of the {haven} package is its ability to handle multiple proprietary formats within a unified framework. It offers dedicated functions for each supported proprietary format, making it straightforward to import data regardless of the program. Here, we introduce read_dat() for Stata files, read_sav() for SPSS files, and read_sas() for SAS files. C.3.1 Syntax Let’s explore the syntax for importing Stata files .dat files using haven::read_dat(): read_dta( file, encoding = NULL, col_select = NULL, skip = 0, n_max = Inf, .name_repair = "unique" ) The arguments are: file: the path to the proprietary data file to import encoding: specifies the character encoding of the data file col_select: select specific columns for import skip and n_max: control the number of rows skipped and the maximum number of rows imported .name_repair: determines how column names are repaired if they are not valid The syntax for read_sav() is similar to read_dat(): read_sav( file, encoding = NULL, user_na = FALSE, col_select = NULL, skip = 0, n_max = Inf, .name_repair = "unique" ) The arguments are: file: the path to the proprietary data file to import encoding: specifies the character encoding of the data file col_select: select specific columns for import user_na: a value of TRUE will read variables with user defined missing labels will be read into labelled_spss() objects skip and n_max: control the number of rows skipped and the maximum number of rows imported .name_repair: determines how column names are repaired if they are not valid The syntax for importing SAS files with read_sas() is as follows: read_sas( data_file, catalog_file = NULL, encoding = NULL, catalog_encoding = encoding, col_select = NULL, skip = 0L, n_max = Inf, .name_repair = "unique" ) The arguments are: data_file: the path to the proprietary data file to import catalog_file: the path to the catalog file to import encoding: specifies the character encoding of the data file catalog_encoding: specifies the character encoding of the catalog file col_select: select specific columns for import skip and n_max: control the number of rows skipped and the maximum number of rows imported .name_repair: determines how column names are repaired if they are not valid In the code examples below, we demonstrate how to load Stata, SPSS, and SAS files into R using the respective {haven} functions. The resulting data is stored in anes_dta, anes_sav, and anes_sas objects as tibbles, ready for use in R. Stata: library(haven) anes_dta <- read_dta(system.file("extdata", "anes_2020_stata_example.dta", package="srvyrexploR")) SPSS: library(haven) anes_sav <- read_sav(file = "data/anes_timeseries_2020_spss_20220210.sav") SAS: library(haven) anes_sas <- read_sas(file = "data/anes_timeseries_2020_sas_20220210.sas7bdat") C.3.2 Working with labeled data Stata, SPSS, and SAS files often contain labeled variables and values. These labels provide descriptive information about categorical data, making it easier to understand and analyze. When importing data from Stata, SPSS, or SAS, preserving these labels is essential for maintaining data fidelity. Consider a variable like ‘Education Level’ with coded values (e.g., 1, 2, 3). Without labels, these codes can be cryptic. However, with labels (‘High School Graduate,’ ‘Bachelor’s Degree,’ ‘Master’s Degree’), the data becomes more informative and easier to work with. With the {haven} package, we have the capability to import and work with labeled data from Stata, SPSS, and SAS files. The package uses a special class of data called haven_labelled to store labeled variables. When a dataset label is defined in Stata, it is stored in the ‘label’ attribute of the tibble when imported, ensuring that the information is not lost. We can use functions like select(), glimpse(), and is.labelled() to inspect the imported data and verify if variables are labeled. Take a look at the ANES Stata file. Notice that categorical variables are marked with a type of <dbl+lbl>. This notation indicates that these variables are labeled. library(dplyr) anes_dta %>% select(1:6) %>% glimpse() ## Rows: 7,453 ## Columns: 6 ## $ V200001 <dbl> 200015, 200022, 200039, 200046, 200053, 200060, 20008… ## $ V200002 <dbl+lbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3… ## $ V200010b <dbl> 1.0057, 1.1635, 0.7687, 0.5210, 0.9658, 0.2347, 0.440… ## $ V200010d <dbl> 9, 26, 41, 29, 23, 37, 7, 37, 32, 41, 22, 7, 38, 21, … ## $ V200010c <dbl> 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1,… ## $ V201006 <dbl+lbl> 2, 3, 2, 3, 2, 1, 2, 3, 2, 2, 2, 2, 2, 1, 2, 1, 1… We can confirm this label status using the haven::is.labelled() function. haven::is.labelled(anes_dta$V200002) ## [1] TRUE To explore the labels further, we can use the attributes() function. This function provides insights into both the variable labels ($label) and the associated value labels ($labels). attributes(anes_dta$V200002) ## $label ## [1] "Mode of interview: pre-election interview" ## ## $format.stata ## [1] "%10.0g" ## ## $class ## [1] "haven_labelled" "vctrs_vctr" "double" ## ## $labels ## 1. Video 2. Telephone 3. Web ## 1 2 3 When we import a labeled dataset using {haven}, it results in a tibble containing both the data and label information. However, this is meant to be an intermediary data structure and not intended to be the final data format for analysis. Instead, we should convert it into a regular R data frame before continuing our data workflow. There are two primary methods to achieve this conversion: (1) convert to factors or (2) remove the labels. Option 1: Convert the vector into a factor Factors are native R data types for working with categorical data. They consist of integer values that correspond to character values, known as levels. Below is a dummy example of factors. Printing factors shows the four different levels in the data: strongly agree, agree, disagree, and strongly disagree. response <- c("strongly agree", "agree", "agree", "disagree") response_levels <- c("strongly agree", "agree", "disagree", "strongly disagree") factors <- factor(response, levels = response_levels) factors ## [1] strongly agree agree agree disagree ## Levels: strongly agree agree disagree strongly disagree Factors are integer vectors, though they may look like character strings. We can confirm by looking at the vector’s structure: glimpse(factors) ## Factor w/ 4 levels "strongly agree",..: 1 2 2 3 R’s factors differ from Stata, SPSS, or SAS’ labeled vectors. However, we can convert labeled variables into factors using the as_factor() function. anes_dta %>% transmute(V200002 = as_factor(V200002)) ## # A tibble: 7,453 × 1 ## V200002 ## <fct> ## 1 3. Web ## 2 3. Web ## 3 3. Web ## 4 3. Web ## 5 3. Web ## 6 3. Web ## 7 3. Web ## 8 3. Web ## 9 3. Web ## 10 3. Web ## # ℹ 7,443 more rows The as_factor() function can be applied to all columns in a data frame or individual ones. Below, we convert all <dbl+lbl> columns into factors. anes_dta_factor <- anes_dta %>% as_factor() anes_dta_factor %>% select(1:6) %>% glimpse() ## Rows: 7,453 ## Columns: 6 ## $ V200001 <dbl> 200015, 200022, 200039, 200046, 200053, 200060, 20008… ## $ V200002 <fct> 3. Web, 3. Web, 3. Web, 3. Web, 3. Web, 3. Web, 3. We… ## $ V200010b <dbl> 1.0057, 1.1635, 0.7687, 0.5210, 0.9658, 0.2347, 0.440… ## $ V200010d <dbl> 9, 26, 41, 29, 23, 37, 7, 37, 32, 41, 22, 7, 38, 21, … ## $ V200010c <dbl> 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1,… ## $ V201006 <fct> 2. Somewhat interested, 3. Not much interested, 2. So… Option 2: Strip the labels The second option is to remove the labels altogether, converting the labeled data into a regular R data frame. To remove, or ‘zap’ the labels from our tibble, we can use the {haven} package’s zap_label() and zap_labels() functions. This approach removes the labels but retains the data values in their original form. The ANES Stata file columns contains variable labels. Using purrr’s map(), we can review the labels using attr. In the example below, we list the first two variables and their labels. For instance, the label for V200002 is “Mode of interview: pre-election interview”. purrr::map(anes_dta, ~attr(.x, "label")) %>% head(2) ## $V200001 ## [1] "2020 Case ID" ## ## $V200002 ## [1] "Mode of interview: pre-election interview" Use zap_label() to remove the variable labels but retain the value labels. Notice that the labels return as NULL. zap_label(anes_dta) %>% purrr::map(~attr(.x, "label")) %>% head(2) ## $V200001 ## NULL ## ## $V200002 ## 1. Video 2. Telephone 3. Web ## 1 2 3 To remove the value labels, use zap_labels(). Notice the previous <dbl+lbl> columns are now <dbl>. zap_labels(anes_dta) %>% select(1:6) %>% glimpse() ## Rows: 7,453 ## Columns: 6 ## $ V200001 <dbl> 200015, 200022, 200039, 200046, 200053, 200060, 20008… ## $ V200002 <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,… ## $ V200010b <dbl> 1.0057, 1.1635, 0.7687, 0.5210, 0.9658, 0.2347, 0.440… ## $ V200010d <dbl> 9, 26, 41, 29, 23, 37, 7, 37, 32, 41, 22, 7, 38, 21, … ## $ V200010c <dbl> 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1,… ## $ V201006 <dbl> 2, 3, 2, 3, 2, 1, 2, 3, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1,… While it is important to convert labeled datasets into regular R data frames for working in R, the labels themselves often contain valuable information that provide context and meaning to the survey variables. To aid with interpretability and documention, consider creating a data dictionary from the labeled dataset. A data dictionary is a reference document that provides detailed information about the variables and values of a survey. The {labelled} package offers a convenient function, generate_dictionary(), that creates data dictionaries directly from a labeled dataset. This function extracts variable labels, value labels, and other metadata and organizes them into a structured document that we can browse and reference throughout our analysis. Let’s create a data dictionary from the ANES Stata dataset as an example: library(labelled) dictionary <- generate_dictionary(anes_dta) Once we’ve generated the data dictionary, we can take a look at the V200002 variable and see the label, column type, number of missing entries, and associated values. dictionary %>% filter(variable == "V200002") ## pos variable label col_type missing ## 2 V200002 Mode of interview: pre-electi~ dbl+lbl 0 ## ## ## values ## [1] 1. Video ## [2] 2. Telephone ## [3] 3. Web C.3.3 Labeled missing data values In survey data analysis, dealing with missing values is a crucial aspect of data preparation. Stata, SPSS, and SAS files each have their own methods for handling missing values. Stata has “extended” missing values, .A through .Z. SAS has “special” missing values, .A through .Z and ._. SPSS has per-column “user” missing values. Each column can declare up to three distinct values or a range of values (plus one distinct value) that should be treated as missing. SAS and Stata use a concept known as ‘tagged’ missing values, which extend R’s regular NA. A ‘tagged’ missing value is essentially an NA with an additional single-character label. These values behave identically to regular NA in standard R operations while preserving the informative tag associated with the missing value. Here is an example from the NORC at the University of Chicago’s 2018 General Society Survey. head(gss_dta$HEALTH) #> <labelled<double>[6]>: condition of health #> [1] 2 1 NA(i) NA(i) 1 2 #> #> Labels: #> value label #> 1 excellent #> 2 good #> 3 fair #> 4 poor #> NA(d) DK #> NA(i) IAP #> NA(n) NA In contrast, SPSS uses a different approach called ‘user-defined values’ to denote missing values. Each column in an SPSS dataset can have up to three distinct values designated as missing or a specified range of missing values. To model these additional user-defined missing values, {haven} provides the labeled_spss() subclass of labeled(). When you import SPSS data using {haven}, it ensures that user-defined missing values are correctly handled. You can work with this data in R while preserving the unique missing value conventions from SPSS. Here is what the GSS SPSS data looks like when loaded with {haven}. head(gss_sps$HEALTH) #> <labelled_spss<double>[6]>: Condition of health #> [1] 2 1 0 0 1 2 #> Missing values: 0, 8, 9 #> #> Labels: #> value label #> 0 IAP #> 1 EXCELLENT #> 2 GOOD #> 3 FAIR #> 4 POOR #> 8 DK #> 9 NA C.4 Importing data from APIs into R In addition to working with data saved as files, we may also need to retrieve data through Application Programming Interfaces (APIs). APIs provide a structured way to access data hosted on external servers and import it directly into R for analysis. To access this data, you need to understand how to construct API requests. Each API has unique endpoints, parameters, and authentication requirements. Pay attention to: Endpoints: These are URLs that point to specific data or services. Parameters: Information you pass to the API to customize your request (e.g., date ranges, filters). Authentication: APIs may require API keys or tokens for access. Rate Limits: APIs may have usage limits, so be aware of any rate limits or quotas. Typically, we begin by making a GET request to an API endpoint. The {httr2} package allows us to generate and process HTTP requests. We can make the GET request by pointing to the URL that contains the data we would like. library(httr2) api_url <- "https://api.example.com/survey-data" response <- GET(api_url) Once we make the request, we will obtain the data as the response. The data often comes in JSON format. We can extract and parse the data using the {jsonlite} package, allowing us to work with it in R. The fromJSON() function, shown below, coverts JSON data to an R object. survey_data <- fromJSON(content(response, "text")) Note that these are dummy examples. Please review the documentation to understand how to make requests from your specific API. R offers several packages that simplify API access by providing ready-to-use functions for popular APIs. These packages are called “wrappers”, as they “wrap” the API to make it easier to use. For example, the {tidycensus} package used in this book simplifies access to U.S. Census data, allowing us to retrieve data with R commands instead of writing complex API requests. For example, if we are interested in the population (B01003_001) of each census tract in North Carolina from the 2020 ACS, we would use the get_acs() function and the code below. Behind the scenes, get_acs() is making a GET request from the Census API and the tidycensus functions are converting the response into an R-friendly format. library(tidycensus) census_data <- get_acs( geography = "tract", variables = "B01003_001", year = 2020, state = "NC" ) To discover if there’s an R package that directly interfaces with a specific survey or data source, search for “[survey] R wrapper” or “[data source] R package” online. C.5 Accessing databases in R Databases provide a secure and organized solution as the volume and complexity of data grow. We can access, manage, and update data stored in databases in a systematic way. Because of how the data are organized, teams can draw from the same source and obtain any metadata that would be helpful for analysis. There are various ways of working with databases in RStudio. We can connect to different databases through the Connections Pane in the top right of the IDE. We can also use packages like {DBI} and {odbc} to access database tables in R files. Here is an example script connecting to a database: con <- DBI::dbConnect(odbc::odbc(), Driver = "[your driver's name]", Server = "[your server's path]", UID = rstudioapi::askForPassword("Database user"), PWD = rstudioapi::askForPassword("Database password"), Database = "[your database's name]", Warehouse = "[your warehouse's name]", Schema = "[your schema's name]" ) The {dbplyr} and {dplyr} packages allow us to make queries and run data analysis entirely using {dplyr} syntax. All of the code can be written in R so we do not have to switch between R and SQL to explore the data. Here is some sample code: q1 <- tbl(con, "bank") %>% group_by(month_idx, year, month) %>% summarise( subscribe = sum(ifelse(term_deposit == "yes", 1, 0)), total = n()) show_query(q1) Be sure to check the documentation to configure a database connection. C.6 Importing data from other formats R also offers dedicated packages such as {googlesheets4} for Google Sheets or {qualtRics} for Qualtrics. With less common or proprietary file formats, the broader data science community can often provide guidance. Online resources like Stack Overflow and dedicated forums like Posit Community are valuable sources of information for importing data into R. "],["references.html", "References", " References American National Election Studies. 2021. “ANES 2020 Time Series Study: Pre-Election and Post-Election Survey Questionnaires.” https://electionstudies.org/wp-content/uploads/2021/07/anes_timeseries_2020_questionnaire_20210719.pdf. ———. 2022. “ANES 2020 Time Series Study Full Release: User Guide and Codebook.” https://electionstudies.org/wp-content/uploads/2022/02/anes_timeseries_2020_userguidecodebook_20220210.pdf. Biemer, Paul P. 2010. “Total Survey Error: Design, Implementation, and Evaluation.” Public Opinion Quarterly 74 (5): 817–48. https://doi.org/10.1093/poq/nfq058. Biemer, Paul P., and Lars E. Lyberg. 2003. Introduction to Survey Quality. John Wiley & Sons. Biemer, Paul P., Joe Murphy, Stephanie Zimmer, Chip Berry, Grace Deng, and Katie Lewis. 2017. “Using Bonus Monetary Incentives to Encourage Web Response in Mixed-Mode Household Surveys.” Journal of Survey Statistics and Methodology 6 (2): 240–61. https://doi.org/10.1093/jssam/smx015. Bollen, Kenneth A., Paul P. Biemer, Alan F. Karr, Stephen Tueller, and Marcus E. Berzofsky. 2016. “Are Survey Weights Needed? A Review of Diagnostic Tests in Regression Analysis.” Annual Review of Statistics and Its Application 3 (1): 375–92. https://doi.org/10.1146/annurev-statistics-011516-012958. Bradburn, Norman M., Seymour Sudman, and Brian Wansink. 2004. Asking Questions: The Definitive Guide to Questionnaire Design. 2nd Edition. Jossey-Bass. Bryan, Jenny, and Jim Hester. 2023. Happy Git and GitHub for the useR. https://happygitwithr.com/. Bureau of Justice Statistics. 2017. “National Crime Victimization Survey, 2016: Technical Documentation.” https://bjs.ojp.gov/sites/g/files/xyckuh236/files/media/document/ncvstd16.pdf. Centers for Disease Control and Prevention (CDC). 2021. “Behavioral Risk Factor Surveillance System Survey Questionnaire.” U.S. Department of Health; Human Services, Centers for Disease Control; Prevention; https://www.cdc.gov/brfss/questionnaires/pdf-ques/2021-BRFSS-Questionnaire-1-19-2022-508.pdf. Cochran, William G. 1977. Sampling Techniques. John Wiley & Sons. Cox, Brenda G, David A Binder, B Nanjamma Chinnappa, Anders Christianson, Michael J Colledge, and Phillip S Kott. 2011. Business Survey Methods. John Wiley & Sons. DeBell, Matthew. 2010. “How to Analyze ANES Survey Data.” ANES Technical Report Series nes012492. Palo Alto, CA: Stanford University; Ann Arbor, MI: the University of Michigan; https://electionstudies.org/wp-content/uploads/2018/05/HowToAnalyzeANESData.pdf. DeBell, Matthew and Amsbary, Michelle and Brader, Ted and Brock, Shelley and Good, Cindy and Kamens, Justin and Maisel, Natalya and Pinto, Sarah. 2022. “Methodology Report for the ANES 2020 Time Series Study.” https://electionstudies.org/wp-content/uploads/2022/08/anes_timeseries_2020_methodology_report.pdf. DeLeeuw, Edith D. 2005. “To Mix or Not to Mix Data Collection Modes in Surveys.” Journal of Official Statistics 21: 233–55. ———. 2018. “Mixed-Mode: Past, Present, and Future.” Survey Research Methods 12 (2): 75–89. https://doi.org/10.18148/srm/2018.v12i2.7402. Deming, W Edwards. 1991. Sample Design in Business Research. Vol. 23. John Wiley & Sons. Dillman, Don A, Jolene D Smyth, and Leah Melani Christian. 2014. Internet, Phone, Mail, and Mixed-Mode Surveys: The Tailored Design Method. John Wiley & Sons. Fowler, Floyd J, and Thomas W. Mangione. 1989. Standardized Survey Interviewing. SAGE. Fuller, Wayne A. 2011. Sampling Statistics. John Wiley & Sons. Gelman, Andrew. 2007. “Struggles with Survey Weighting and Regression Modeling.” Statistical Science 22 (2): 153–64. https://doi.org/10.1214/088342306000000691. Groves, Robert M, Floyd J Fowler Jr, Mick P Couper, James M Lepkowski, Eleanor Singer, and Roger Tourangeau. 2009. Survey Methodology. John Wiley & Sons. Harter, Rachel, Michael P Battaglia, Trent D Buskirk, Don A Dillman, Ned English, Mansour Fahimi, Martin R Frankel, et al. 2016. “Address-Based Sampling.” Task force report. American Association for Public Opinion Research; https://aapor.org/wp-content/uploads/2022/11/AAPOR_Report_1_7_16_CLEAN-COPY-FINAL-2.pdf. Kim, Jae Kwang, and Jun Shao. 2021. Statistical Methods for Handling Incomplete Data. Chapman & Hall/CRC Press. LAPOP. 2021a. “AmericasBarometer 2021 - Canada: Technical Information.” Vanderbilt University; http://datasets.americasbarometer.org/database/files/ABCAN2021-Technical-Report-v1.0-FINAL-eng-110921.pdf. ———. 2021b. “AmericasBarometer 2021 - U.S.: Technical Information.” Vanderbilt University; http://datasets.americasbarometer.org/database/files/ABUSA2021-Technical-Report-v1.0-FINAL-eng-110921.pdf. ———. 2021c. “AmericasBarometer 2021: Technical Information.” Vanderbilt University; https://www.vanderbilt.edu/lapop/ab2021/AB2021-Technical-Report-v1.0-FINAL-eng-030722.pdf. ———. 2021d. “Core Questionnaire.” https://www.vanderbilt.edu/lapop/ab2021/AB2021-Core-Questionnaire-v17.5-Eng-210514-W-v2.pdf. ———. 2023a. “About the AmericasBarometer.” https://www.vanderbilt.edu/lapop/about-americasbarometer.php. ———. 2023b. “The AmericasBarometer by the LAPOP Lab.” www.vanderbilt.edu/lapop. Levy, Paul S, and Stanley Lemeshow. 2013. Sampling of Populations: Methods and Applications. John Wiley & Sons. Lumley, Thomas. 2010. Complex Surveys: A Guide to Analysis Using r: A Guide to Analysis Using r. John Wiley; Sons. ———. 2023. Survey: Analysis of Complex Survey Samples. http://r-survey.r-forge.r-project.org/survey/. Mack, Christina, Zhaohui Su, and Daniel Westreich. 2018. “Types of Missing Data.” In Managing Missing Data in Patient Registries: Addendum to Registries for Evaluating Patient Outcomes: A User’s Guide, Third Edition [Internet]. Rockville (MD): Agency for Healthcare Research; Quality (US); https://www.ncbi.nlm.nih.gov/books/NBK493614/. Penn State. 2019. “STAT 506: Sampling Theory and Methods [Online Course].” https://online.stat.psu.edu/stat506/. Särndal, Carl-Erik, Bengt Swensson, and Jan Wretman. 2003. Model Assisted Survey Sampling. Springer Science & Business Media. Schafer, Joseph L, and John W Graham. 2002. “Missing Data: Our View of the State of the Art.” Psychological Methods 7: 147–77. https://doi.org/10.1037//1082-989X.7.2.147. Schouten, Barry, Andy Peytchev, and James Wagner. 2018. Adaptive Survey Design. Chapman & Hall/CRC Press. Scott, Alastair. 2007. Rao-Scott Corrections and Their Impact. Section on Survey Research Methods. http://www.asasrms.org/Proceedings/y2007/Files/JSM2007-000874.pdf; ASA. Shah, Babubhai V, and Akhil K Vaish. 2006. “Confidence Intervals for Quantile Estimation from Complex Survey Data.” In Proceedings of the Section on Survey Research Methods. Shook-Sa, Bonnie, Couzens, G. Lance, and Berzofsky, Marcus. 2015. “Users’ Guide to the National Crime Victimization Survey (NCVS) Direct Variance Estimation.” https://bjs.ojp.gov/sites/g/files/xyckuh236/files/media/document/ncvs_variance_user_guide_11.06.14.pdf; Bureau of Justice Statistics. Skinner, Chris. 2009. “Chapter 15: Statistical Disclosure Control for Survey Data.” In Handbook of Statistics: Sample Surveys: Design, Methods and Applications, edited by C. R. Rao, 381–96. Elsevier B.V. Tierney, Nicholas, and Dianne Cook. 2023. “Expanding Tidy Data Principles to Facilitate Missing Data Exploration, Visualization and Assessment of Imputations.” Journal of Statistical Software 105 (7): 1–31. https://doi.org/10.18637/jss.v105.i07. Tourangeau, Roger, Mick P. Couper, and Frederick Conrad. 2004. “Sapcing, Position, and Order: Interpretive Heuristics for Visual Features of Survey Questions.” Public Opinion Quarterly 68: 368–93. Tourangeau, Roger, Lance J. Rips, and Kenneth Rasinski. 2000. Psychology of Survey Response. Cambridge University Press. United States. Bureau of Justice Statistics. 2022. “National Crime Victimization Survey, [United States], 2021.” Inter-university Consortium for Political; Social Research [distributor]. https://doi.org/10.3886/ICPSR38429.v1. U.S. Census Bureau. 2021. “Understanding and Using the American Community Survey Public Use Microdata Sample Files What Data Users Need to Know.” U.S. Government Printing Office; https://www.census.gov/content/dam/Census/library/publications/2021/acs/acs_pums_handbook_2021.pdf. U.S. Energy Information Administration. 2017. “Residential Energy Consumption Survey (RECS): Using the 2015 microdata file to compute estimates and standard errors (RSEs).” https://www.eia.gov/consumption/residential/data/2015/pdf/microdata_v3.pdf. ———. 2023a. “2020 Residential Energy Consumption Survey: Household Characteristics Technical Documentation Summary.” https://www.eia.gov/consumption/residential/data/2020/pdf/2020%20RECS_Methodology%20Report.pdf. ———. 2023b. “2020 Residential Energy Consumption Survey: Using the microdata file to compute estimates and relative standard errors (RSEs).” https://www.eia.gov/consumption/residential/data/2020/pdf/microdata-guide.pdf. Valliant, Richard, and Jill A. Dever. 2018. Survey Weights: A Step-by-Step Guide to Calculation. Stata Press. Valliant, Richard, Jill A Dever, and Frauke Kreuter. 2013. Practical Tools for Designing and Weighting Survey Samples. Vol. 1. Springer. Wickham, Hadley. 2019. Advanced R. https://adv-r.hadley.nz/; CRC press. ———. 2023. Ggplot2: Elegant Graphics for Data Analysis. 3rd Edition. https://ggplot2-book.org/; Springer. Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2rd Edition. https://r4ds.hadley.nz/; O’Reilly Media. Wolter, Kirk M. 2007. Introduction to Variance Estimation. Vol. 53. Springer. "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]] +[["index.html", "Exploring Complex Survey Data Analysis in R A Tidy Introduction with srvyr Preface", " Exploring Complex Survey Data Analysis in R A Tidy Introduction with srvyr Stephanie Zimmer, Rebecca J. Powell, and Isabella Velásquez 2024-03-11 Preface "],["c01-intro.html", "Chapter 1 Introduction 1.1 What to expect 1.2 Datasets used in this book", " Chapter 1 Introduction Surveys are used to gather information about a population. They are frequently used by researchers, governments, and businesses to better understand public opinion and behavior. For example, a non-profit group might be interested in public opinion on a given topic, government agencies may be interested in behaviors to inform policy, or companies may survey potential consumers about what they want from their products. Developing and fielding a survey is a method to gather information about topics that interest us. This book focuses on how to analyze the data collected from a survey. We assume that you have conducted a survey or obtained a microdata file. Microdata, also known as respondent-level or row-level data, contains individual survey responses, analysis weights, and design variables (as opposed to opposed to summarized data in tables). For the purposes of this book, you need the weights and design variables for your survey data. These are required to accurately calculate unbiased estimates1. Understanding the concepts and techniques discussed in this book will help you to extract meaningful insights from your survey data. To account for the weights and study design, researchers rely on statistical software such as SAS, Stata, SUDAAN, and R. In this book, we will use R to provide an overview to survey analysis. Our goal is to provide a comprehensive guide for individuals new to survey analysis but have some statistics and R programming background. We will use a combination of both the {survey} and {srvyr} packages and present the code following best practices from the tidyverse. In 2003, the {survey} package was released on CRAN and has been continuously developed over time2. This package, primarily developed by Thomas Lumley, is extensive and includes the following features: Estimates of point estimates and their associated variances, including means, totals, ratios, quantiles, and proportions Estimation of regression models, including generalized linear models, log-linear models, and survival curves Variances by Taylor linearization or by replicate weights (balance repeated replication, jackknife, bootstrap, multistage bootstrap, or user-supplied) Hypothesis testing for means, proportions, and more The {srvyr} package in R builds on the {survey} package. It provides wrappers for functions that align with the tidyverse philosophy, which is our motivation for using and recommending this package. We find that the {srvyr} package is user-friendly for those familiar with tidyverse packages in R. For example, while many functions in the {survey} package use variables as formulas, the {srvyr} package uses tidy selection to pass variable names3 (a common feature in the tidyverse). Users of the tidyverse are most likely familiar with the magittr pipe (%>%), which seamlessly works with functions from the {srvyr} package. Moreover, several common functions from {dplyr}, such as including filter(), mutate(), and summarize(), can be applied to survey objects. Users can streamline their analysis workflow and capitalize on the benefits of both the {srvyr} and tidyverse packages. There is one limitation to the {srvyr} package: it doesn’t fully incorporate the modeling capabilities of the {survey} package into its tidy versions. This book will use the {survey} package when discussing modeling and hypothesis testing. However, we will guide you on how to apply the pipe to these functions to ensure clarity in your analyses. 1.1 What to expect This book will cover many aspects of survey design and analysis, from understanding how to create design effects to conducting descriptive analysis, statistical tests, and models. Additionally, we emphasize best practices in coding and presenting results. Throughout this book, we use real-world data and present practical examples to help you gain proficiency in survey analysis. While we provide a brief overview of survey methodology and statistical theory, this book is not intended to be the sole resource for these topics. We reference other materials throughout the book and encourage readers to seek those out for more information. Below is a summary of each chapter: Chapter 2: An overview of surveys and the process of designing surveys. This is only an overview, and we include many references for more in-depth knowledge. Chapter 3: Understanding survey documentation. How to read the various components of survey documentation, working with missing data, and finding the documentation. Chapter 4: TO-DO Chapter 5: Descriptive analyses. Calculating point estimates along with their standard errors, confidence intervals, and design effects. Chapter 6: Statistical testing. Testing for differences between groups, including comparisons of means and proportions as well as goodness of fit tests, tests of independence, and tests of homogeneity. Chapter 7: Modeling. Linear regression, ANOVA, and logistic regression. Chapter 8: Communicating results. Describing results, reproducibility, making publishable tables and graphs, and helpful functions. Chapter 9: TO-DO Chapter 10: Specifying sampling designs. Descriptions of common sampling designs, when they are used, the math behind the mean and standard error estimates, how to specify the designs in R, and examples using real data. Chapter 11: TO-DO Chapter 12: TO-DO Chapter 13: National Crime Victimization Survey Vignette. A vignette on how to analyze data from the NCVS, a survey in the U.S. that collects information on crimes and their characteristics. This illustrates an analysis that requires multiple files to calculate victimization rates. Chapter 14: AmericasBarometer Vignette. A vignette on how to analyze data from the AmericasBarometer, a survey of attitudes, evaluations, experiences, and behavior in countries in the Western Hemisphere. This includes how to make choropleth maps with survey estimates. <<<<<<< HEAD In most chapters, you’ll find code that you can follow. Each of these chapters starts with a “set-up” section. This section will include the code needed to load the necessary packages and datasets in the chapter. We then provide the main idea of the chapter and examples on how to use the functions. Most chapters end with exercises to work through. Solutions to the exercises can be found in the Appendix. 1.2 Datasets used in this book We work with two key datasets throughout the book: the Residential Energy Consumption Survey (RECS – U.S. Energy Information Administration 2023a) and the American National Election Studies (ANES – DeBell 2010). To ensure that all readers can follow the examples, we have provided analytic datasets in an R package, {srvyrexploR}. Install the package from GitHub using the {remotes} package. remotes::install_github("tidy-survey-r/srvyrexploR") To explore the provided datasets in the package, access the documentation usng the help() command. help(package="srvyrexploR") To load the RECS and ANES datasets, start by running library(srvyrexploR) to load the package. Then, use the data() command to load the datasets into the environment. library(tidyverse) library(survey) library(srvyr) library(srvyrexploR) data(recs_2020) data(anes_2020) RECS is a study that provides energy consumption and expenditures data in American households. The Energy Information Administration funds RECS and has been fielded 15 times between 1950 and 2020. The survey has two components - the household survey and the energy supplier survey. In 2020, the household survey was collected by web and paper questionnaires and included questions about appliances, electronics, heating, air conditioning (A/C), temperatures, water heating, lighting, respondent demographics, and energy assistance. The energy supplier survey consists of components relating to energy consumption and energy expenditure. Below is an overview of the recs_2020 data: recs_2020 %>% select(-starts_with("NWEIGHT")) ## # A tibble: 18,496 × 57 ## DOEID ClimateRegion_BA Urbanicity Region REGIONC Division STATE_FIPS ## <dbl> <fct> <fct> <fct> <chr> <fct> <chr> ## 1 100001 Mixed-Dry Urban Area West WEST Mountai… 35 ## 2 100002 Mixed-Humid Urban Area South SOUTH West So… 05 ## 3 100003 Mixed-Dry Urban Area West WEST Mountai… 35 ## 4 100004 Mixed-Humid Urban Area South SOUTH South A… 45 ## 5 100005 Mixed-Humid Urban Area North… NORTHE… Middle … 34 ## 6 100006 Hot-Humid Urban Area South SOUTH West So… 48 ## 7 100007 Mixed-Humid Urban Area South SOUTH West So… 40 ## 8 100008 Mixed-Humid Urban Clu… South SOUTH East So… 28 ## 9 100009 Mixed-Humid Urban Area South SOUTH South A… 11 ## 10 100010 Hot-Dry Urban Area West WEST Mountai… 04 ## # ℹ 18,486 more rows ## # ℹ 50 more variables: state_postal <fct>, state_name <fct>, ## # HDD65 <dbl>, CDD65 <dbl>, HDD30YR <dbl>, CDD30YR <dbl>, ## # HousingUnitType <fct>, YearMade <ord>, TOTSQFT_EN <dbl>, ## # TOTHSQFT <dbl>, TOTCSQFT <dbl>, ZTOTSQFT_EN <fct>, ZYearMade <fct>, ## # ZHousingUnitType <fct>, SpaceHeatingUsed <lgl>, ## # ZSpaceHeatingUsed <fct>, ACUsed <lgl>, ZACUsed <fct>, … recs_2020 %>% select(starts_with("NWEIGHT")) ## # A tibble: 18,496 × 61 ## NWEIGHT NWEIGHT1 NWEIGHT2 NWEIGHT3 NWEIGHT4 NWEIGHT5 NWEIGHT6 ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 3284. 3273. 3349. 3345. 3437. 3416. 3355. ## 2 9007. 9020. 9081. 9020. 9213. 9117. 9179. ## 3 5669. 5793. 5914. 5763. 5870. 5721. 5663. ## 4 5294. 5361. 5362. 5371. 5393. 5328. 5354. ## 5 9935. 10048. 10262. 10037. 9961. 10108. 10298. ## 6 7250. 7339. 7435. 7336. 7426. 7309. 7472. ## 7 5684. 5733. 5831. 5664. 5874. 5906. 5734. ## 8 9700. 9725. 9774. 9658. 10221. 9792. 9906. ## 9 1236. 1268. 1263. 1258. 1250. 1267. 1252. ## 10 7084. 7131. 7287. 7184. 7176. 7198. 7174. ## # ℹ 18,486 more rows ## # ℹ 54 more variables: NWEIGHT7 <dbl>, NWEIGHT8 <dbl>, NWEIGHT9 <dbl>, ## # NWEIGHT10 <dbl>, NWEIGHT11 <dbl>, NWEIGHT12 <dbl>, NWEIGHT13 <dbl>, ## # NWEIGHT14 <dbl>, NWEIGHT15 <dbl>, NWEIGHT16 <dbl>, NWEIGHT17 <dbl>, ## # NWEIGHT18 <dbl>, NWEIGHT19 <dbl>, NWEIGHT20 <dbl>, NWEIGHT21 <dbl>, ## # NWEIGHT22 <dbl>, NWEIGHT23 <dbl>, NWEIGHT24 <dbl>, NWEIGHT25 <dbl>, ## # NWEIGHT26 <dbl>, NWEIGHT27 <dbl>, NWEIGHT28 <dbl>, … From this output, we can see that there are 18,496 rows and 118 variables. We can see that there are variables containing an ID (DOEID), geographic information (e.g., Region, state_postal, Urbanicity), along with information about the house, including the type of house (HousingUnitType) and when the house was built (YearMade). Additionally, there is a long list of weighting variables that we will use in the analysis (e.g., NWEIGHT, NWEIGHT1, …, NWEIGHT60). We will discuss using these weighting variables in Chapter 10. For a more detailed codebook, see Appendix B. The ANES is a series study that has collected data from election surveys since 1948. These surveys contain data on public opinion and voting behavior in U.S. presidential elections. The 2020 survey (the data we will be using) was fielded to individuals over the web, through live video interviewing, or over with computer-assisted telephone interviewing (CATI). The survey includes questions on party affiliation, voting choice, and level of trust with the government. Here is an overview of the anes_2020 data. First, we show the variables starting with “V” followed by a number; these are the original variables. Then, we show you the remaining variables that we created based on the original data: anes_2020 %>% select(matches("^V\\\\d")) ## # A tibble: 7,453 × 42 ## V200001 V200002 V200010b V200010c V200010d V201006 V201024 V201025x ## <dbl> <hvn_lbl> <dbl> <dbl> <dbl> <hvn_l> <hvn_l> <hvn_lb> ## 1 200015 3 1.01 2 9 2 -1 3 ## 2 200022 3 1.16 2 26 3 -1 3 ## 3 200039 3 0.769 1 41 2 -1 3 ## 4 200046 3 0.521 2 29 3 -1 3 ## 5 200053 3 0.966 1 23 2 -1 3 ## 6 200060 3 0.235 2 37 1 -1 3 ## 7 200084 3 0.441 1 7 2 -1 3 ## 8 200091 3 0.769 2 37 3 -1 2 ## 9 200107 3 1.42 2 32 2 2 4 ## 10 200114 3 1.84 2 41 2 -1 3 ## # ℹ 7,443 more rows ## # ℹ 34 more variables: V201029 <hvn_lbll>, V201101 <hvn_lbll>, ## # V201102 <hvn_lbll>, V201103 <hvn_lbll>, V201228 <hvn_lbll>, ## # V201229 <hvn_lbll>, V201230 <hvn_lbll>, V201231x <hvn_lbll>, ## # V201233 <hvn_lbll>, V201237 <hvn_lbll>, V201507x <hvn_lbll>, ## # V201510 <hvn_lbll>, V201546 <hvn_lbll>, V201547a <hvn_lbll>, ## # V201547b <hvn_lbll>, V201547c <hvn_lbll>, V201547d <hvn_lbll>, … anes_2020 %>% select(-matches("^V\\\\d")) ## # A tibble: 7,453 × 21 ## CaseID InterviewMode Weight VarUnit Stratum CampaignInterest ## <dbl> <fct> <dbl> <fct> <fct> <fct> ## 1 200015 Web 1.01 2 9 Somewhat interested ## 2 200022 Web 1.16 2 26 Not much interested ## 3 200039 Web 0.769 1 41 Somewhat interested ## 4 200046 Web 0.521 2 29 Not much interested ## 5 200053 Web 0.966 1 23 Somewhat interested ## 6 200060 Web 0.235 2 37 Very much interested ## 7 200084 Web 0.441 1 7 Somewhat interested ## 8 200091 Web 0.769 2 37 Not much interested ## 9 200107 Web 1.42 2 32 Somewhat interested ## 10 200114 Web 1.84 2 41 Somewhat interested ## # ℹ 7,443 more rows ## # ℹ 15 more variables: VotedPres2016 <fct>, ## # VotedPres2016_selection <fct>, PartyID <fct>, ## # TrustGovernment <fct>, TrustPeople <fct>, Age <dbl>, ## # AgeGroup <fct>, Education <fct>, RaceEth <fct>, Gender <fct>, ## # Income <fct>, Income7 <fct>, VotedPres2020 <fct>, ## # VotedPres2020_selection <fct>, EarlyVote2020 <fct> From this output we can see that there are 7,453 rows and 63 variables. Most of the variables start with V20, so referencing the documentation for survey will be crucial to not get lost (see Chapter 3). We have created some more descriptive variables for you to use throughout this book, such as the age (Age) and gender (Gender) of the respondent, along with variables that represent their party affiliation (PartyID). Additionally, we need the variables Weight and Stratum to analyze this data accurately. We will discuss how to use these weighting variables in Chapters 10 and 3. For a more detailed codebook, see Appendix A. In most chapters, you’ll find code that you can follow. Each of these chapters starts with a “setup” section. The setup section includes the code needed to load the necessary packages and datasets in the chapter. We then provide the main idea of the chapter and examples on how to use the functions. Most chapters end with exercises to work through. Solutions to the exercises can be found in the Appendix. References DeBell, Matthew. 2010. “How to Analyze ANES Survey Data.” ANES Technical Report Series nes012492. Palo Alto, CA: Stanford University; Ann Arbor, MI: the University of Michigan; https://electionstudies.org/wp-content/uploads/2018/05/HowToAnalyzeANESData.pdf. ———. 2023a. “2020 Residential Energy Consumption Survey: Household Characteristics Technical Documentation Summary.” https://www.eia.gov/consumption/residential/data/2020/pdf/2020%20RECS_Methodology%20Report.pdf. Valliant, Richard, and Jill A. Dever. 2018. Survey Weights: A Step-by-Step Guide to Calculation. Stata Press. If you do not already have weights created for the survey data you are using, we recommend reviewing other resources focused on weight creation such as Valliant and Dever (2018)↩︎ https://cran.r-project.org/src/contrib/Archive/survey/↩︎ https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html↩︎ "],["c02-overview-surveys.html", "Chapter 2 Overview of Surveys 2.1 Introduction 2.2 Searching for public-use survey data 2.3 Pre-Survey Planning 2.4 Study Design 2.5 Data Collection 2.6 Post-Survey Processing 2.7 Post-survey data analysis and reporting", " Chapter 2 Overview of Surveys 2.1 Introduction Developing surveys to gather accurate information about populations involves a more intricate and time-intensive process compared to surveys that use non-random criteria for selecting samples. Researchers can spend months, or even years, developing the study design, questions, and other methods for a single survey to ensure high-quality data is collected. While this book focuses on the analysis methods of complex surveys, understanding the entire survey life cycle can provide a better insight into what types of analyses should be conducted on the data. The survey life cycle consists of the necessary stages to execute a survey project successfully. Each stage influences the survey’s timing, costs, and feasibility, consequently impacting the data collected and how we should analyze it. The survey life cycle starts with a research topic or question of interest (e.g., what impact does childhood trauma have on health outcomes later in life). Researchers typically review existing data sources to determine if data are already available that can answer this question, as drawing from available resources can result in a reduced burden on respondents, cheaper research costs, and faster research outcomes. However, if existing data cannot answer the nuances of the research question, a survey can be used to capture the exact data that the researcher needs through a questionnaire, or a set of questions. To gain a deeper understanding of survey design and implementation, we recommend reviewing several pieces of existing literature in detail (e.g., Dillman, Smyth, and Christian 2014; Groves et al. 2009; Tourangeau, Rips, and Rasinski 2000; Bradburn, Sudman, and Wansink 2004; Valliant, Dever, and Kreuter 2013; Biemer and Lyberg 2003). 2.2 Searching for public-use survey data Throughout this book, we use public-use datasets from different surveys, including the American National Election Survey (ANES), the Residential Energy Consumption Survey (RECS), the National Crime Victimization Survey (NCVS), and the AmericasBarometer surveys. As mentioned above, researchers should look for existing data that can provide insights into their research questions before embarking on a new survey. One of the greatest sources of data is the government. For example, in the U.S., we can get data directly from the various statistical agencies like with RECS and NCVS. Other countries often have data available through official statistics offices, such as the Office for National Statistics in the United Kingdom. In addition to government data, many researchers will make their data publicly available through repositories such as the Inter-university Consortium for Political and Social Research (ICPSR) variable search or the Odum Institute Data Archive. Searching these repositories or other compiled lists (e.g., Analyze Survey Data for Free - asdfree.com) can be an efficient way to identify surveys with questions related to the researcher’s topic of interest. 2.3 Pre-Survey Planning There are multiple things to consider when starting a survey. Errors are the differences between the true values of the variables being studied and the values obtained through the survey. Each step and decision made before the launch of the survey impact the types of errors that are introduced into the data, which in turn impact how to interpret the results. Generally, survey researchers consider there to be seven main sources of error that fall under either Representation and Measurement (Groves et al. 2009): Representation Coverage Error: A mismatch between the population of interest (also known as the target population or study population) and the sampling frame. Sampling Error: Error produced when selecting a sample, the subset of the population, from the sampling frame, the list from which the sample is drawn (there is no sampling error if conducting a census). This error is due to randomization, and we discuss how to quantify this error in Chapter 10. Nonresponse Error: Differences between those who responded and did not respond to the survey (unit nonresponse) or a given question (item nonresponse). Adjustment Error: Error introduced during post-survey statistical adjustments. Measurement Validity: A mismatch between the topic of interest and the question(s) used to collect that information. Measurement Error: A mismatch between what the researcher asked and how the respondent answered. Processing Error: Edits by the researcher to responses provided by the respondent (e.g., adjustments to data based on illogical responses). Almost every survey has errors. Researchers attempt to conduct a survey that reduces the total survey error, or the accumulation of all errors that may arise throughout the survey life cycle. By assessing these different types of errors together, researchers can seek strategies to maximize the overall survey quality and improve the reliability and validity of results (Biemer 2010). However, attempts to lower individual sources errors (and therefore total survey error) come at the price of time, resources, and money. For example: Coverage Error Tradeoff: Researchers can search for or create more accurate and updated sampling frames, but they can be difficult to construct or obtain. Sampling Error Tradeoff: Researchers can increase the sample size to reduce sampling error; however, larger samples can be expensive and time-consuming to field. Nonresponse Error Tradeoff: Researchers can increase or diversify efforts to improve survey participation but this may be resource-intensive while not entirely removing nonresponse bias. Adjustment Error Tradeoff: Weighting, or a statistical technique used to adjust the contribution of individual survey responses to the final survey estimates, is typically done to make the sample more representative of the target population. However, if researchers do not carefully execute the adjustments or base them on inaccurate information, they can introduce new biases, leading to less accurate estimates. Validity Error Tradeoff: Reseachers can increase validity through a variety of ways, such as extensive research, using established scales, or collaborating with a psychometrician during survey design. However, doing so lengthens the amount of time and resources needed to complete survey design. Measurement Error Tradeoff: Reseachers can use techniques such as questionnaire testing and cognitive interviewing to ensure respondents are answering questions as expected. However, these activities also require time and resources to complete. Processing Error Tradeoff: Researchers can impose rigorous data cleaning and validation processes. However, this requires supervision, training, and time. The challenge for survey researchers is to find the optimal tradeoffs among these errors. They must carefully consider ways to reduce each error source and total survey error while balancing their study’s objectives and resources. For survey analysts, understanding the decisions that researchers took to minimize these error sources can impact how results are interpreted. The remainder of this chapter dives into critical considerations for survey development. We explore how to consider each of these sources of error and how these error sources can inform the interpretations of the data. 2.4 Study Design From formulating methodologies to choosing an appropriate sampling frame, the study design phase is where the blueprint for a successful survey takes shape. Study design encompasses multiple parts of the survey life cycle, including decisions on the population of interest, survey mode (the format through which a survey is administered to respondents), timeline, and questionnaire design. Knowing who and how to survey individuals depends on the study’s goals and the feasibility of implementation. This section explores the strategic planning that lays the foundation for a survey. 2.4.1 Sampling Design The set or group we want to survey is known as the population of interest. The population of interest could be broad, such as “all adults age 18+ living in the U.S.” or a specific population based on a particular characteristic or location. For example, we may want to know about “adults aged 18-24 who live in North Carolina” or “eligible voters living in Illinois.” However, a sampling frame with contact information is needed to survey individuals in these populations of interest. If researchers are looking at eligible voters, the sampling frame could be the voting registry for a given state or area. The sampling frame is likely imperfect for more broad target populations like all adults in the United States. In these cases, researchers may choose to use a sampling frame of mailing addresses and send the survey to households, or they may choose to use random digit dialing (RDD) and call random phone numbers (that may or may not be assigned, connected, and working). These imperfect sampling frames can result in coverage error where there is a mismatch between the target population and the list of individuals researchers can select. For example, if a researcher is looking to obtain estimates for “all adults aged 18+ living in the U.S.”, a sampling frame of mailing addresses will miss specific types of individuals, such as the homeless, transient populations, and incarcerated individuals. Additionally, many households have more than one adult living there, so researchers would need to consider how to get a specific individual to fill out the survey (called within household selection) or adjust the target population to report on “U.S. households” instead of “individuals.” Once the researchers have selected the sampling frame, the next step is determining how to select individuals for the survey. In rare cases, researchers may conduct a census and survey everyone on the sampling frame. However, the ability to implement a questionnaire at that scale is something only some can do (e.g., government censuses). Instead, researchers typically choose to sample individuals and use weights to estimate numbers in the target population. They can use a variety of different sampling methods, and more information on these can be found in Chapter 10. This decision of which sampling method to use impacts sampling error and can be accounted for in weighting. Example: Number of Pets in a Household Let’s use a simple example where a researcher is interested in the average number of pets in a household. Our researcher needs to consider the target population for this study. Specifically, are they interested in all households in a given country or households in a more local area (e.g., city or state)? Let’s assume our researcher is interested in the number of pets in a U.S. household with at least one adult (18 years old or older). In this case, a sampling frame of mailing addresses would provide the least coverage error as the frame would closely match our target population. Specifically, our researcher would likely want to use the Computerized Delivery Sequence File (CDSF), which is a file of mailing addresses that the United States Postal Service (USPS) creates and covers nearly 100% of U.S. households (Harter et al. 2016). To sample these households, for simplicity, we use a stratified simple random sample design, where we randomly sample households within each state (i.e., we stratify by state). Throughout this chapter, we build on this example research question to plan a survey. 2.4.2 Data Collection Planning With the sampling design decided, researchers can then decide how to survey these individuals. Specifically, the modes used for contacting and surveying the sample, how frequently to send reminders and follow-ups, and the overall timeline of the study are four of the major data collection determinations. Traditionally, researchers have considered four main modes4: Computer Assisted Personal Interview (CAPI; also known as face-to-face or in-person interviewing) Computer Assisted Telephone Interview (CATI; also known as phone or telephone interviewing) Computer Assisted Web Interview (CAWI; also known as web or online interviewing) Paper and Pencil Interview (PAPI) Researchers can use a single mode to collect data or multiple modes (also called mixed modes). Using mixed modes can allow for broader reach and increase response rates depending on the target population (DeLeeuw 2005, 2018; Biemer et al. 2017). For example, researchers could both call households to conduct a CATI survey and send mail with a PAPI survey to the household. Using both modes, researchers could gain participation through the mail from individuals who do not pick up the phone to unknown numbers or through the phone from individuals who do not open all of their mail. However, mode effects (where responses differ based on the mode of response) can be present in the data and may need to be considered during analysis. When selecting which mode, or modes, to use, understanding the unique aspects of the chosen target population and sampling frame provides insight into how they can best be reached and engaged. For example, if we plan to survey adults aged 18-24 who live in North Carolina, asking them to complete a survey using CATI (i.e., over the phone) would likely not be as successful as other modes like the web. This age group does not talk on the phone as much as other generations and often does not answer their phones for unknown numbers. Additionally, the mode for contacting respondents relies on what information is available in the sampling frame. For example, if our sampling frame includes an email address, we could email our selected sample members to convince them to complete a survey. Alternatively, if the sampling frame is a list of mailing addresses, we could contact sample members with a letter. It is important to note that there can be a difference between the contact and survey modes. For example, if we have a sampling frame with addresses, we can send a letter to our sample members and provide information on completing a web survey. Another option is using mixed-mode surveys by sending sample members a paper and pencil survey with our letter and also asking them to complete the survey online. Combining different contact modes and different survey modes can be helpful in reducing unit nonresponse error–where the entire unit (e.g., a household) does not respond to the survey at all–as different sample members may respond better to different contact and survey modes. However, when considering which modes to use, it is important to make access to the survey as easy as possible for sample members to reduce burden and unit nonresponse. Another way to reduce unit nonresponse error is by varying the language of the contact materials (Dillman, Smyth, and Christian 2014). People are motivated by different things, so constantly repeating the same message may not be helpful. Instead, mixing up the messaging and the type of contact material the sample member receives can increase response rates and reduce the unit nonresponse error. For example, instead of only sending standard letters, researchers could consider sending mailings that invoke “urgent” or “important” thoughts by sending priority letters or using other delivery services like FedEx, UPS, or DHL. A study timeline may also determine the number and types of contacts. If the timeline is long, there is plentiful time for follow-ups and diversified messages in contact materials. If the timeline is short, then fewer follow-ups can be implemented. Many studies start with the tailored design method put forth by Dillman, Smyth, and Christian (2014) and implement five contacts: Prenotification (Prenotice) letting sample members know the survey is coming Invitation to complete the survey Reminder that also thanks the respondents that may have already completed the survey Reminder (with a replacement paper survey if needed) Final reminder This method is easily adaptable based on the study timeline and needs but provides a starting point for most studies. Example: Number of Pets in a Household Let’s return to our example of a researcher who wants to know the average number of pets in a household. We are using a sampling frame of mailing addresses, so we recommend starting our data collection with letters mailed to households, but later in data collection, we want to send interviewers to the house to conduct an in-person (or CAPI) interview to decrease unit nonresponse error. This means we have two contact modes (paper and in-person). As mentioned above, the survey mode does not have to be the same as the contact mode, so we recommend a mixed-mode study with both Web and CAPI modes. Let’s assume we have six months for data collection, so we may want to recommend the following protocol: Protocol Example for 6-month Web and CAPI Data Collection Week Contact Mode Contact Message Survey Mode Offered 1 Mail: Letter Prenotice — 2 Mail: Letter Invitation Web 3 Mail: Postcard Thank You/Reminder Web 6 Mail: Letter in large envelope Animal Welfare Discussion Web 10 Mail: Postcard Inform Upcoming In-Person Visit Web 14 In-Person Visit — CAPI 16 Mail: Letter Reminder of In-Person Visit Web, but includes a number to call to schedule CAPI 20 In-Person Visit — CAPI 25 Mail: Letter in large envelope Survey Closing Notice Web, but includes a number to call to schedule CAPI This is just one possible protocol that we can use that starts respondents with the web (typically done to reduce costs). However, researchers may want to begin in-person data collection earlier during the data collection period or ask their interviewers to attempt more than two visits with a household. 2.4.3 Questionnaire Design When developing the questionnaire, it can be helpful to first outline the topics to be asked and include the “why” each question or topic is important to the research question(s). This can help researchers better tailor the questionnaire and reduce the number of questions (and thus the burden on the respondent) if topics are deemed irrelevant to the research question. When making these decisions, researchers should also consider questions needed for weighting. While we would love to have everyone in our population of interest answer our survey, this is rarely the case. Thus, including questions about demographics in the survey can assist with weighting for nonresponse errors (both unit and item nonresponse). Knowing the details of the sampling plan and what may impact coverage error and sampling error can help researchers determine what types of demographics to include. Researchers can benefit from the work of others by using questions from other surveys. Demographic sections such as race, ethnicity, or education borrow questions from a government census or other official surveys. Question banks such as the Inter-university Consortium for Political and Social Research (ICPSR) variable search can provide additional potential questions. If a question does not exist in a question bank, researchers can craft their own. When developing survey questions, researchers should start with the research topic and attempt to write questions that match the concept. The closer the question asked is to the overall concept, the better validity there is. For example, if the researcher wants to know how people consume T.V. series and movies but only asks a question about how many T.V.s are in the house, then they would be missing other ways that people watch T.V. series and movies, such as on other devices or at places outside of the home. As mentioned above, researchers can employ techniques to increase the validity of their questionnaires. For example, questionnaire testing involves piloting the survey instrument to identify and fix potential issues before conducting the main survey. Additionally, researchers could conduct cognitive interviews – a technique where researchers walk through the survey with participants, encouraging them to speak their thoughts out loud to uncover how they interpret and understand survey questions. Additionally, when designing questions, researchers should consider the mode for the survey and adjust the language appropriately. In self-administered surveys (e.g., web or mail), respondents can see all the questions and response options, but that is not the case in interviewer-administered surveys (e.g., CATI or CAPI). With interviewer-administered surveys, the response options must be read aloud to the respondents, so the question may need to be adjusted to create a better flow to the interview. Additionally, with self-administered surveys, because the respondents are viewing the questionnaire, the formatting of the questions is even more critical to ensure accurate measurement. Incorrect formatting or wording can result in measurement error, so following best practices or using existing validated questions can reduce error. There are multiple resources to help researchers draft questions for different modes (e.g., Dillman, Smyth, and Christian 2014; Fowler and Mangione 1989; Bradburn, Sudman, and Wansink 2004; Tourangeau, Couper, and Conrad 2004). Example: Number of Pets in a Household As part of our survey on the average number of pets in a household, researchers may want to know what animal most people prefer to have as a pet. Let’s say we have the following question in our survey: FIGURE 2.1: Example Question Asking Pet Preference Type This question may have validity issues as it only provides the options of “dogs” and “cats” to respondents, and the interpretation of the data could be incorrect. For example, if we had 100 respondents who answered the question and 50 selected dogs, then the results of this question cannot be “50% of the population prefers to have a dog as a pet,” as only two response options were provided. If a respondent taking our survey prefers turtles, they could either be forced to choose a response between these two (i.e., interpret the question as “between dogs and cats, which do you prefer?” and result in measurement error), or they may not answer the question (which results in item nonresponse error). Based on this, the interpretation of this question should be, “When given a choice between dogs and cats, 50% of respondents preferred to have a dog as a pet.” To avoid this issue, researchers should consider these possibilities and adjust the question accordingly. One simple way could be to add an “other” response option to give respondents a chance to provide a different response. The “other” response option could then include a way for respondents to write their other preference. For example, we could rewrite this question as: FIGURE 2.2: Example Question Asking Pet Preference Type with Other Specify Option Researchers can then code the responses from the open-ended box and get a better understanding of the respondent’s choice of preferred pet. Interpreting this question becomes easier as researchers no longer need to qualify the results with the choices provided. This is a simple example of how the presentation of the question and options can impact the findings. For more complex topics and questions, researchers must thoroughly consider how to mitigate any impacts from the presentation, formatting, wording, and other aspects. As survey analysts, reviewing not only the data but also the wording of the questions is crucial to ensure the results are presented in a manner consistent with the question asked. Chapter 3 provides further details on how to review existing survey documentation to inform our analyses. 2.5 Data Collection Once the data collection starts, researchers try to stick to the data collection protocol designed during pre-survey planning. However, effective researchers adjust their plans and adapt as needed to the current progress of data collection (Schouten, Peytchev, and Wagner 2018). Some extreme examples could be natural disasters that could prevent mail or interviewers from getting to the sample members. Others could be smaller in that something newsworthy occurs connected to the survey, so researchers could choose to play this up in communication materials. In addition to these external factors, there could be factors unique to the survey, such as lower response rates for a specific sub-group, so the data collection protocol may need to find ways to improve response rates for that specific group. 2.6 Post-Survey Processing After data collection, various activities need to be completed before we can analyze the survey. Multiple decisions made during this post-survey phase can assist researchers in reducing different error sources, such as through weighting to account for the sample selection. Knowing the decisions researchers made in creating the final analytic data can impact how analysts use the data and interpret the results. 2.6.1 Data Cleaning and Imputation Post-survey cleaning and imputation is one of the first steps researchers do to get the survey responses into a dataset for use by analysts. Data cleaning can consist of cleaning inconsistent data (e.g., with skip pattern errors or multiple questions throughout the survey being consistent with each other), editing numeric entries or open-ended responses for grammar and consistency, or recoding open-ended questions into categories for analysis. There is no universal set of fixed rules that every project must adhere to. Instead, each project or research study should establish its own guidelines and procedures for handling various cleaning scenarios based on its specific objectives. Researchers should use their best judgment to ensure data integrity, and all decisions should be documented and available to those using the data in the analysis. Each decision a researcher makes impacts processing error, so often, researchers have multiple people review these rules or recode open-ended data and adjudicate any differences in an attempt to reduce this error. Another crucial step in post-survey processing is imputation. Often, there is item nonresponse where respondents do not answer specific questions. If the questions are crucial to analysis efforts or the research question, researchers may implement imputation to reduce item nonresponse error. Imputation is a technique for replacing missing or incomplete data values with estimated values. However, as imputation is a way of assigning a value to missing data based on an algorithm or model, it can also introduce processing error, so researchers should consider the overall implications of imputing data compared to having item nonresponse. There are multiple ways to impute data. We recommend reviewing other resources like Kim and Shao (2021) for more information. Example: Number of Pets in a Household Let’s return to the question we created to ask about animal preference. The “other specify” invites respondents to specify the type of animal they prefer to have as a pet. If respondents entered answers such as “puppy,” “turtle,” “rabit,” “rabbit,” “bunny,” “ant farm,” “snake,” “Mr. Purr,” then researchers may wish to categorize these write-in responses to help with analysis. In this example, “puppy” could be assumed to be a reference to a “Dog”, and could be recoded there. The misspelling of “rabit” could be coded along with “rabbit” and “bunny” into a single category of “Bunny or Rabbit”. These are relatively standard decisions that a researcher could make. The remaining write-in responses could be categorized in a few different ways. “Mr. Purr,” which may be someone’s reference to their own cat, could be recoded as “Cat”, or it could remain as “Other” or some category that is “Unknown”. Depending on the number of responses related to each of the others, they could all be combined into a single “Other” category, or maybe categories such as “Reptiles” or “Insects” could be created. Each of these decisions may impact the interpretation of the data, so our researchers should document the types of responses that fall into each of the new categories and any decisions made. 2.6.2 Weighting We can address some of the error sources identified in the previous sections using weighting. For example, weights can address coverage, sampling, and nonresponse errors. Many published surveys include an “analysis weight” variable that combines these adjustments. However, weighting itself can also introduce adjustment error, so researchers need to balance which types of errors should be corrected with weighting. The construction of weights is outside the scope of this book, and researchers should reference other materials if interested in constructing their own (Valliant and Dever 2018). Instead, this book assumes the survey has been completed, weights are constructed, and data is available to users. We walk users through how to read the documentation (Chapter 3) and work with the data and analysis weights provided to analyze and interpret survey results correctly. Example: Number of Pets in a Household In the simple example of our survey, we decided to use a stratified sample by state to select our sample members. Knowing this sampling design, our researcher can include selection weights for analysis that account for how the sample members were selected for the survey. Additionally, the sampling frame may have the type of building associated with each address, so we could include the building type as a potential nonresponse weighting variable, along with some interviewer observations that may be related to our research topic of the average number of pets in a household. Combining these weights, we can create an analytic weight that researchers need to use when analyzing the data. 2.6.3 Disclosure Before data is released publicly, researchers need to ensure that individual respondents can not be identified by the data when confidentiality is required. There are a variety of different methods that can be used, including data swapping, top or bottom coding, coarsening, and perturbation. In data swapping, researchers may swap specific data values across different respondents so that it does not impact insights from the data but ensures that specific individuals cannot be identified. We can use top and bottom coding to mask extreme values. For example, researchers may top-code income values such that households with income greater than $500,000 are coded into a single category of “$500,000 or more”. Other disclosure methods may include aggregating response categories or location information to avoid having only a few respondents in a given group and thus be identified. For example, researchers may use coarsening to display income in categories instead of as a continuous variable. We can also perturb the data by adding random noise. There is as much art as there is science to the methods used for disclosure. In the survey documentation, researchers should only provide high-level comments about the disclosure and not specific details. This ensures nobody can reverse the disclosure and thus identify individuals. For more information on different disclosure methods, please see Skinner (2009) and AAPOR Standards. 2.6.4 Documentation Documentation is a critical step of the survey life cycle. Researchers systematically record all the details, decisions, procedures, and methodologies to ensure transparency, reproducibility, and the overall quality of survey research. Proper documentation allows analysts to understand, reproduce, and evaluate the study’s methods and findings. Chapter 3 dives into how analysts should use survey data documentation. 2.7 Post-survey data analysis and reporting After completing the survey life cycle, the data is ready for analysts to use. The rest of this book continues from this point. For more information on the survey life cycle, please explore the references cited throughout this chapter. References Biemer, Paul P. 2010. “Total Survey Error: Design, Implementation, and Evaluation.” Public Opinion Quarterly 74 (5): 817–48. https://doi.org/10.1093/poq/nfq058. Biemer, Paul P., and Lars E. Lyberg. 2003. Introduction to Survey Quality. John Wiley & Sons. Biemer, Paul P., Joe Murphy, Stephanie Zimmer, Chip Berry, Grace Deng, and Katie Lewis. 2017. “Using Bonus Monetary Incentives to Encourage Web Response in Mixed-Mode Household Surveys.” Journal of Survey Statistics and Methodology 6 (2): 240–61. https://doi.org/10.1093/jssam/smx015. Bradburn, Norman M., Seymour Sudman, and Brian Wansink. 2004. Asking Questions: The Definitive Guide to Questionnaire Design. 2nd Edition. Jossey-Bass. DeLeeuw, Edith D. 2005. “To Mix or Not to Mix Data Collection Modes in Surveys.” Journal of Official Statistics 21: 233–55. ———. 2018. “Mixed-Mode: Past, Present, and Future.” Survey Research Methods 12 (2): 75–89. https://doi.org/10.18148/srm/2018.v12i2.7402. Dillman, Don A, Jolene D Smyth, and Leah Melani Christian. 2014. Internet, Phone, Mail, and Mixed-Mode Surveys: The Tailored Design Method. John Wiley & Sons. Fowler, Floyd J, and Thomas W. Mangione. 1989. Standardized Survey Interviewing. SAGE. Groves, Robert M, Floyd J Fowler Jr, Mick P Couper, James M Lepkowski, Eleanor Singer, and Roger Tourangeau. 2009. Survey Methodology. John Wiley & Sons. Harter, Rachel, Michael P Battaglia, Trent D Buskirk, Don A Dillman, Ned English, Mansour Fahimi, Martin R Frankel, et al. 2016. “Address-Based Sampling.” Task force report. American Association for Public Opinion Research; https://aapor.org/wp-content/uploads/2022/11/AAPOR_Report_1_7_16_CLEAN-COPY-FINAL-2.pdf. Kim, Jae Kwang, and Jun Shao. 2021. Statistical Methods for Handling Incomplete Data. Chapman & Hall/CRC Press. Schouten, Barry, Andy Peytchev, and James Wagner. 2018. Adaptive Survey Design. Chapman & Hall/CRC Press. Skinner, Chris. 2009. “Chapter 15: Statistical Disclosure Control for Survey Data.” In Handbook of Statistics: Sample Surveys: Design, Methods and Applications, edited by C. R. Rao, 381–96. Elsevier B.V. Tourangeau, Roger, Mick P. Couper, and Frederick Conrad. 2004. “Sapcing, Position, and Order: Interpretive Heuristics for Visual Features of Survey Questions.” Public Opinion Quarterly 68: 368–93. Tourangeau, Roger, Lance J. Rips, and Kenneth Rasinski. 2000. Psychology of Survey Response. Cambridge University Press. Valliant, Richard, and Jill A. Dever. 2018. Survey Weights: A Step-by-Step Guide to Calculation. Stata Press. Valliant, Richard, Jill A Dever, and Frauke Kreuter. 2013. Practical Tools for Designing and Weighting Survey Samples. Vol. 1. Springer. Other modes such as using mobile apps or text messaging can also be considered, but at the time of publication, have smaller reach or are better for longitudinal studies (i.e., surveying the same individuals over many time periods of a single study).↩︎ "],["c03-understanding-survey-data-documentation.html", "Chapter 3 Understanding Survey Data Documentation 3.1 Introduction 3.2 Types of survey documentation 3.3 Missing data coding 3.4 Example: American National Election Studies (ANES) 2020 Survey Documentation", " Chapter 3 Understanding Survey Data Documentation 3.1 Introduction Survey documentation helps us prepare before we look at the actual survey data. The documentation includes technical guides, questionnaires, codebooks, errata, and other useful resources. By taking the time to review these materials, we can gain a comprehensive understanding of the survey data (including research and design decisions discussed in Chapters 2 and 10) and conduct our analysis more effectively. Survey documentation can vary in organization, type, and ease of use. The information may be stored in any format - PDFs, Excel spreadsheets, Word documents, and so on. Some surveys bundle documentation together, such as providing the codebook and questionnaire in a single document. Others keep them in separate files. Despite these variations, we can gain a general understanding of the documentation types and what aspects to focus on in each. 3.2 Types of survey documentation 3.2.1 Technical documentation The technical documentation, also known as user guides or methodology/analysis guides, highlights the variables necessary to specify the survey design. We recommend concentrating on these key sections: Introduction: The introduction orients us to the survey. This section provides the project’s background, the study’s purpose, and the main research questions. Study design: The study design section describes how researchers prepared and administered the survey. Sample: The sample section describes the sample frame, any known sampling errors, and the limitations of the sample. This section can contain recommendations on how to use sampling weights. Look for weight information, whether the survey design contains strata, clusters/PSUs, or replicate weights. Also look for population sizes, finite population correction, or replicate weight scaling information. Additional detail on sample designs is available in Chapter 10. Notes on fielding: Any additional notes on fielding, such as response rates, may be found in the technical documentation. The technical documentation may include other helpful resources. Some technical documentation includes syntax for SAS, SUDAAN, Stata, and/or R, so we do not have to create this code from scratch. 3.2.2 Questionnaires A questionnaire is a series of questions used to collect information from people in a survey. It can ask about opinions, behaviors, demographics, or even just numbers like the count of lightbulbs, square footage, or farm size. Questionnaires can employ different types of questions, such as closed-ended (e.g., select one or check all that apply), open-ended (e.g., numeric or text), Likert scales (e.g., a 5- or 7-point scale specifying a respondent’s level of agreement to a statement), or ranking questions (e.g., a list of options that a respondent ranks by preference). It may randomize the display order of responses or include instructions that help respondents understand the questions. A survey may have one questionnaire or multiple, depending on its scale and scope. The questionnaire is another important resource for understanding and interpreting the survey data (see Section 2.4.3), and we should use it alongside any analysis. It provides details about each of the questions asked in the survey, such as question name, question wording, response options, skip logic, randomizations, display specification, mode differences, and the universe (the subset of respondents that were asked a question). Below, in Figure 3.1, we show an example from the ANES 2020 questionnaire (American National Election Studies 2021). The figure shows a question’s question name (POSTVOTE_RVOTE), description (Did R Vote?), full wording of the question and responses, response order, universe, question logic (this question was only asked if vote_pre = 0), and other specifications. The section also includes the variable name, which we can link to the codebook. FIGURE 3.1: ANES 2020 Questionnaire Example The content and structure of questionnaires vary depending on the specific survey. For instance, question names may be informative (like the ANES example above), sequential, or denoted by a code. In some cases, surveys may not use separate names for questions and variables. Figure 3.2 shows an example from the Behavioral Risk Factor Surveillance System (BRFSS) questionnaire that shows a sequential question number and a coded variable name (as opposed to a question name) (Centers for Disease Control and Prevention (CDC) 2021). FIGURE 3.2: BRFSS 2021 Questionnaire Example We should factor in the details of a survey when conducting our analyses. For example, surveys that use various modes (e.g., web and mail) may have differences in question wording or skip logic, as web surveys can include fills or automate skip logic. These variations could warrant separate analyses for each mode. 3.2.3 Codebooks While a questionnaire provides information about the questions posed to respondents, the codebook explains how the survey data was coded and recorded. It lists details such as variable names, variable labels, variable meanings, codes for missing data, value labels, and value types (whether categorical or continuous, etc.). The codebook helps us understand and use the variables appropriately in our analysis. In particular, the codebook (as opposed to the questionnaire) often includes information on missing data. Note that the term data dictionary is sometimes used interchangeably with codebook, but a data dictionary may include more details on the structure and elements of the data. Figure 3.3 is a question from the ANES 2020 codebook (American National Election Studies 2022). This section indicates a particular variable’s name (V202066), question wording, value labels, universe, and associated survey question (POSTVOTE_RVOTE). FIGURE 3.3: ANES 2020 Codebook Example Reviewing the questionnaires and codebooks in parallel can clarify how to interpret the variables (Figures 3.1 and 3.3), as questions and variables do not always correspond directly to each other in a one-to-one mapping. A single question may have multiple associated variables, or a single variable may summarize multiple questions. 3.2.4 Errata An erratum (singular) or errata (plural) is a document that lists errors found in a publication or dataset. The purpose of an erratum is to correct or update inaccuracies in the original document. Examples of errata include: Issuing a corrected data table after realizing a typo or mistake in a table cell Reporting incorrectly programmed skips in an electronic survey where questions are skipped by the respondent when they should not have been The 2004 ANES dataset released an erratum, notifying analysts to remove a specific row from the data file due to the inclusion of a respondent who should not have been part of the sample. Adhering to an issued erratum helps us increase the accuracy and reliability of analysis. 3.2.5 Additional resources Survey documentation may include additional material, such as interviewer instructions or “show cards” provided to respondents during interviewer-administered surveys to help respondents answer questions. Explore the survey website to find out what resources were used and in what contexts. 3.3 Missing data coding For some observations in a dataset, there may be missing data. This can be by design or from nonresponse, and these concepts are detailed in Chapter 11. In that chapter, we also discuss how to analyze data with missing data. In this section, we discuss how to understand documentation related to missing data. The survey documentation, often the codebook, represents the missing data with a code. The codebook may list different codes depending on why certain data is missing. In the example of variable V202066 from the ANES (Figure 3.3), -9 represents “Refused,” -7 means that the response was deleted due to an incomplete interview, -6 means that there is no response because there was no follow-up interview, and -1 means “Inapplicable” (due to the designed skip pattern). As another example, there may be a summary variable that describes the missingness of a set of variables - particularly with “select all that apply” or “multiple response” questions. In the National Crime Victimization Survey (NCVS), respondents who are victims of a crime and saw the offender are asked if the offender have a weapon and then asked what the type of weapon was. This part of the questionnaire from 2021 is shown in Figure 3.4. FIGURE 3.4: Excerpt from the NCVS 2020-2021 Crime Incident Report - Weapon Type The NCVS codebook includes coding for all multiple response variables of a “lead in” variable that summarizes the individual options. For question 23a on the weapon type, the lead in variable is V4050 which is shown in 3.5. This variable is then followed by a set of variables for each weapon type. An example of one of the individual variables from the codebook, the handgun, is shown in 3.6. We will dive in more to this example in Chapter 11 of how to analyze this variable. FIGURE 3.5: Excerpt from the NCVS 2021 Codebook for V4050 - LI WHAT WAS WEAPON FIGURE 3.6: Excerpt from the NCVS 2021 Codebook for V4051 - C WEAPON: HAND GUN When data is read into R, some values may be system missing, that is they are coded as NA even if that is not evident in a codebook. We will discuss in Chapter 11 how to analyze data with NA values and review how R handles missing data in calculations. 3.4 Example: American National Election Studies (ANES) 2020 Survey Documentation Let’s look at the survey documentation for the American National Election Studies (ANES) 2020. The survey website is located at https://electionstudies.org/data-center/2020-time-series-study/. Navigating to “User Guide and Codebook” (American National Election Studies 2022), we can download the PDF that contains the survey documentation, titled “ANES 2020 Time Series Study Full Release: User Guide and Codebook”. Do not be daunted by the 796-page PDF. We will focus on the most critical information. Introduction The first section in the User Guide explains that the ANES 2020 Times Series Study continues a series of election surveys conducted since 1948. These surveys contain data on public opinion and voting behavior in the U.S. presidential elections. The introduction also includes information about the modes used for data collection (web, live video interviewing, or CATI). Additionally, there is a summary of the number of pre-election interviews (8,280) and post-election re-interviews (7,449). Sample Design and Respondent Recruitment The section “Sample Design and Respondent Recruitment” provides more detail about the survey’s sequential mixed-mode design. All three modes were conducted one after another and not at the same time. Additionally, it indicates that for the 2020 survey, they resampled all respondents who participated in 2016 ANES, along with a newly-drawn cross-section: The target population for the fresh cross-section was the 231 million non-institutional U.S. citizens aged 18 or older living in the 50 U.S. states or the District of Columbia. The document continues with more details on the sample groups. Data Analysis, Weights, and Variance Estimation The section “Data Analysis, Weights, and Variance Estimation” includes information on weights and strata/cluster variables. Reading through, we can find the full sample weight variables: For analysis of the complete set of cases using pre-election data only, including all cases and representative of the 2020 electorate, use the full sample pre-election weight, V200010a. For analysis including post-election data for the complete set of participants (i.e., analysis of post-election data only or a combination of pre- and post-election data), use the full sample post-election weight, V200010b. Additional weights are provided for analysis of subsets of the data… The document provides more information about the variables, summarized in Table 3.1. TABLE 3.1: Weight and variance information for ANES For weight Use variance unit/PSU/cluster and use variance stratum V200010a V200010c V200010d V200010b V200010c V200010d Methodology The user guide mentions a supplemental document called “How to Analyze ANES Survey Data” (DeBell 2010) as a ‘how-to guide’ for analyzing the data. In this document, we learn more about the weights, where we learn that they sum to the sample size and not the population. If our goal is to calculate estimates for the entire U.S. population instead of just the sample, we must adjust the weights to the U.S. population. To create accurate weights for the population, we need to determine the total population size at the time of the survey. Let’s review the “Sample Design and Respondent Recruitment” section for more details: The target population for the fresh cross-section was the 231 million non-institutional U.S. citizens aged 18 or older living in the 50 U.S. states or the District of Columbia. The documentation suggests that the population should equal around 231 million, but this is a very imprecise count. Upon further investigation in the available resources, we can find the methodology file titled “Methodology Report for the ANES 2020 Time Series Study” (DeBell, Matthew and Amsbary, Michelle and Brader, Ted and Brock, Shelley and Good, Cindy and Kamens, Justin and Maisel, Natalya and Pinto, Sarah 2022). This file states that we can use the population total from the Current Population Survey (CPS), a monthly survey sponsored by the U.S. Census Bureau and the U.S. Bureau of Labor Statistics. The CPS provides a more accurate population estimate for a specific month. Therefore, we can use the CPS to get the total population number for March 2020, the time in which the ANES was conducted. Chapter 4 goes into detailed instructions on how to calculate and adjust this value in the data. References American National Election Studies. 2021. “ANES 2020 Time Series Study: Pre-Election and Post-Election Survey Questionnaires.” https://electionstudies.org/wp-content/uploads/2021/07/anes_timeseries_2020_questionnaire_20210719.pdf. ———. 2022. “ANES 2020 Time Series Study Full Release: User Guide and Codebook.” https://electionstudies.org/wp-content/uploads/2022/02/anes_timeseries_2020_userguidecodebook_20220210.pdf. Centers for Disease Control and Prevention (CDC). 2021. “Behavioral Risk Factor Surveillance System Survey Questionnaire.” U.S. Department of Health; Human Services, Centers for Disease Control; Prevention; https://www.cdc.gov/brfss/questionnaires/pdf-ques/2021-BRFSS-Questionnaire-1-19-2022-508.pdf. DeBell, Matthew. 2010. “How to Analyze ANES Survey Data.” ANES Technical Report Series nes012492. Palo Alto, CA: Stanford University; Ann Arbor, MI: the University of Michigan; https://electionstudies.org/wp-content/uploads/2018/05/HowToAnalyzeANESData.pdf. DeBell, Matthew and Amsbary, Michelle and Brader, Ted and Brock, Shelley and Good, Cindy and Kamens, Justin and Maisel, Natalya and Pinto, Sarah. 2022. “Methodology Report for the ANES 2020 Time Series Study.” https://electionstudies.org/wp-content/uploads/2022/08/anes_timeseries_2020_methodology_report.pdf. "],["c04-set-up.html", "Chapter 4 Setup 4.1 Packages 4.2 Data 4.3 Design objects", " Chapter 4 Setup This chapter provides an overview of the packages, data, and design objects we use throughout this book. For a streamlined learning experience, we recommend taking the time to walk through the code provided and making sure everything is installed. As mentioned in Chapter 2, understanding how a survey was conducted helps us make sense of the results and interpret findings. So, we provide background on the datasets used in examples and exercises. Finally, we walk through how to create the survey design objects necessary to begin analysis. If you have questions or face issues while going through the book, please report them in the book’s GitHub repository: https://github.com/tidy-survey-r/tidy-survey-book. 4.1 Packages We use several packages throughout the book, but let’s install and load specific ones for this chapter. Many functions in the examples and exercises are from three packages: {tidyverse}, {survey}, and {srvyr}. If they are not already installed, use the code below. The {tidyverse} and {survey} package can both be installed from the Comprehensive R Archive Network (CRAN). We use the GitHub development version of {srvyr} because of its additional functionality compared to the one on CRAN. Install the package directly from GitHub using the {remotes} package: install.packages(c("tidyverse", "survey")) remotes::install_github("https://github.com/gergness/srvyr") We bundled the datasets used in the book in an R package, {srvyrexploR}. Install it directly from GitHub using the {remotes} package: remotes::install_github("https://github.com/tidy-survey-r/srvyrexploR") After installing these packages, load them using the library() function: library(tidyverse) library(survey) library(srvyr) library(srvyrexploR) The packages {broom}, {gt}, and {gtsummary} play a role in displaying output and creating formatted tables. Install them with the provided code5: install.packages(c("gt", "gtsummary")) After installing these packages, load them using the library() function: library(broom) library(gt) library(gtsummary) Install and load the {censusapi} package to access the Current Population Survey (CPS), which we use to ensure accurate weighting of a key dataset in the book. Run the code below to install {censusapi}: install.packages("censusapi") After installing this package, load it using the library() function: library(censusapi) Note that the {censusapi} package requires a Census API key, available for free from the U.S. Census Bureau website (refer to the package documentation for more information). It’s recommended to include the Census API key in our R environment instead of directly in the code. After obtaining the API key, save it in your R environment by running Sys.setenv(): Sys.setenv(CENSUS_KEY="YOUR_API_KEY_HERE") Then, restart the R session. Once the Census API key is stored, we can retrieve it in our R code with Sys.getenv(\"CENSUS_KEY\"). There are other packages used throughout the book. We list them in the Prerequisite boxes at the beginning of each chapter. As we work through the book, make sure to check the Prerequisite box and install any missing packages before proceeding. 4.2 Data As mentioned above, the {srvyrexploR} package contains the datasets used in the book. Once installed and loaded, explore the documentation using the help() function. Read the descriptions of the datasets to understand what they contain: help(package = "srvyrexploR") This book uses two main datasets: the American National Election Studies (ANES – DeBell 2010) and the Residential Energy Consumption Survey (RECS – U.S. Energy Information Administration 2023a). We can load these datasets individually with the data() function by specifying the dataset name as an argument. In the code below, we load the anes_2020 and recs_2020 datasets into objects with their respective names: data(anes_2020) data(recs_2020) 4.2.1 American National Election Studies (ANES) Data The ANES is a study that collects data from election surveys dating back to 1948. These surveys contain information on public opinion and voting behavior in U.S. presidential elections. They cover topics such as party affiliation, voting choice, and level of trust in the government. The 2020 survey, the data we use in the book, was fielded online, through live video interviews, or via computer-assisted telephone interviews (CATI). When working with new survey data, analysts should review the survey documentation (see Chapter 3) to understand the data collection methods. The original ANES data contains variables starting with V20 (DeBell 2010), so to assist with our analysis throughout the book, we created descriptive variable names. For example, the respondent’s age is now in a variable called Age, and gender is in a variable called Gender. These descriptive variables are included in the {srvyrexploR} package, and Table 4.1 displays the list of these renamed variables. A complete overview of all variables can be found in Appendix A. #hneprwpzzv table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #hneprwpzzv thead, #hneprwpzzv tbody, #hneprwpzzv tfoot, #hneprwpzzv tr, #hneprwpzzv td, #hneprwpzzv th { border-style: none; } #hneprwpzzv p { margin: 0; padding: 0; } #hneprwpzzv .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #hneprwpzzv .gt_caption { padding-top: 4px; padding-bottom: 4px; } #hneprwpzzv .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #hneprwpzzv .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #hneprwpzzv .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #hneprwpzzv .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #hneprwpzzv .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #hneprwpzzv .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #hneprwpzzv .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #hneprwpzzv .gt_column_spanner_outer:first-child { padding-left: 0; } #hneprwpzzv .gt_column_spanner_outer:last-child { padding-right: 0; } #hneprwpzzv .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #hneprwpzzv .gt_spanner_row { border-bottom-style: hidden; } #hneprwpzzv .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #hneprwpzzv .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #hneprwpzzv .gt_from_md > :first-child { margin-top: 0; } #hneprwpzzv .gt_from_md > :last-child { margin-bottom: 0; } #hneprwpzzv .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #hneprwpzzv .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #hneprwpzzv .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #hneprwpzzv .gt_row_group_first td { border-top-width: 2px; } #hneprwpzzv .gt_row_group_first th { border-top-width: 2px; } #hneprwpzzv .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #hneprwpzzv .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #hneprwpzzv .gt_first_summary_row.thick { border-top-width: 2px; } #hneprwpzzv .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #hneprwpzzv .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #hneprwpzzv .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #hneprwpzzv .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #hneprwpzzv .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #hneprwpzzv .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #hneprwpzzv .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #hneprwpzzv .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #hneprwpzzv .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #hneprwpzzv .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #hneprwpzzv .gt_left { text-align: left; } #hneprwpzzv .gt_center { text-align: center; } #hneprwpzzv .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #hneprwpzzv .gt_font_normal { font-weight: normal; } #hneprwpzzv .gt_font_bold { font-weight: bold; } #hneprwpzzv .gt_font_italic { font-style: italic; } #hneprwpzzv .gt_super { font-size: 65%; } #hneprwpzzv .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #hneprwpzzv .gt_asterisk { font-size: 100%; vertical-align: 0; } #hneprwpzzv .gt_indent_1 { text-indent: 5px; } #hneprwpzzv .gt_indent_2 { text-indent: 10px; } #hneprwpzzv .gt_indent_3 { text-indent: 15px; } #hneprwpzzv .gt_indent_4 { text-indent: 20px; } #hneprwpzzv .gt_indent_5 { text-indent: 25px; } TABLE 4.1: List of created variables in the ANES Data Variable Name CaseID InterviewMode Weight VarUnit Stratum CampaignInterest VotedPres2016 VotedPres2016_selection PartyID TrustGovernment TrustPeople Age AgeGroup Education RaceEth Gender Income Income7 VotedPres2020 VotedPres2020_selection EarlyVote2020 Before beginning an analysis, it is useful to view the data to understand the available variables. The dplyr::glimpse() function produces a list of all variables, their types (e.g., function, double), and a few example values. Below, we remove variables containing numbers with select(-matches(\"^V\\\\d\")) before using glimpse() to get a quick overview of the data with descriptive variable names: anes_2020 %>% select(-matches("^V\\\\d")) %>% glimpse() ## Rows: 7,453 ## Columns: 21 ## $ CaseID <dbl> 200015, 200022, 200039, 200046, 200053… ## $ InterviewMode <fct> Web, Web, Web, Web, Web, Web, Web, Web… ## $ Weight <dbl> 1.0057, 1.1635, 0.7687, 0.5210, 0.9658… ## $ VarUnit <fct> 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2,… ## $ Stratum <fct> 9, 26, 41, 29, 23, 37, 7, 37, 32, 41, … ## $ CampaignInterest <fct> Somewhat interested, Not much interest… ## $ VotedPres2016 <fct> Yes, Yes, Yes, Yes, Yes, No, Yes, No, … ## $ VotedPres2016_selection <fct> Trump, Other, Clinton, Clinton, Trump,… ## $ PartyID <fct> Strong republican, Independent, Indepe… ## $ TrustGovernment <fct> Never, Never, Some of the time, About … ## $ TrustPeople <fct> About half the time, Some of the time,… ## $ Age <dbl> 46, 37, 40, 41, 72, 71, 37, 45, 70, 43… ## $ AgeGroup <fct> 40-49, 30-39, 40-49, 40-49, 70 or olde… ## $ Education <fct> Bachelor's, Post HS, High school, Post… ## $ RaceEth <fct> "Hispanic", "Asian, NH/PI", "White", "… ## $ Gender <fct> Male, Female, Female, Male, Male, Fema… ## $ Income <fct> "$175,000-249,999", "$70,000-74,999", … ## $ Income7 <fct> $125k or more, $60-80k, $100-125k, $20… ## $ VotedPres2020 <fct> NA, Yes, Yes, Yes, Yes, Yes, Yes, NA, … ## $ VotedPres2020_selection <fct> NA, Other, Biden, Biden, Trump, Biden,… ## $ EarlyVote2020 <fct> NA, No, No, No, No, No, No, NA, Yes, N… From the output, we can see there are 7,453 rows and 21 variables in the ANES data. This output also indicates that most of the variables are factors (e.g., InterviewMode), while a few variables are in double (numeric) format (e.g., Age). 4.2.2 Residential Energy Consumption Survey (RECS) Data RECS is a study that measures energy consumption and expenditure in American households. Funded by the Energy Information Administration, the RECS data are collected through interviews with household members and energy suppliers. These interviews take place in person, over the phone, via mail, and on the web. The survey has been fielded 14 times between 1950 and 2020. It includes questions about appliances, electronics, heating, air conditioning (A/C), temperatures, water heating, lighting, energy bills, respondent demographics, and energy assistance. As mentioned above, analysts should read the survey documentation (see Chapter 3) to understand how the data was collected and implemented. Table 4.2 displays the list of variables in the RECS data (not including the weights, which start with NWEIGHT and will be described in more detail in Chapter 10). An overview of all variables can be found in Appendix B. #tbdggidntm table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #tbdggidntm thead, #tbdggidntm tbody, #tbdggidntm tfoot, #tbdggidntm tr, #tbdggidntm td, #tbdggidntm th { border-style: none; } #tbdggidntm p { margin: 0; padding: 0; } #tbdggidntm .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #tbdggidntm .gt_caption { padding-top: 4px; padding-bottom: 4px; } #tbdggidntm .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #tbdggidntm .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #tbdggidntm .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #tbdggidntm .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #tbdggidntm .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #tbdggidntm .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #tbdggidntm .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #tbdggidntm .gt_column_spanner_outer:first-child { padding-left: 0; } #tbdggidntm .gt_column_spanner_outer:last-child { padding-right: 0; } #tbdggidntm .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #tbdggidntm .gt_spanner_row { border-bottom-style: hidden; } #tbdggidntm .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #tbdggidntm .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #tbdggidntm .gt_from_md > :first-child { margin-top: 0; } #tbdggidntm .gt_from_md > :last-child { margin-bottom: 0; } #tbdggidntm .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #tbdggidntm .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #tbdggidntm .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #tbdggidntm .gt_row_group_first td { border-top-width: 2px; } #tbdggidntm .gt_row_group_first th { border-top-width: 2px; } #tbdggidntm .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #tbdggidntm .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #tbdggidntm .gt_first_summary_row.thick { border-top-width: 2px; } #tbdggidntm .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #tbdggidntm .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #tbdggidntm .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #tbdggidntm .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #tbdggidntm .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #tbdggidntm .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #tbdggidntm .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #tbdggidntm .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #tbdggidntm .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #tbdggidntm .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #tbdggidntm .gt_left { text-align: left; } #tbdggidntm .gt_center { text-align: center; } #tbdggidntm .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #tbdggidntm .gt_font_normal { font-weight: normal; } #tbdggidntm .gt_font_bold { font-weight: bold; } #tbdggidntm .gt_font_italic { font-style: italic; } #tbdggidntm .gt_super { font-size: 65%; } #tbdggidntm .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #tbdggidntm .gt_asterisk { font-size: 100%; vertical-align: 0; } #tbdggidntm .gt_indent_1 { text-indent: 5px; } #tbdggidntm .gt_indent_2 { text-indent: 10px; } #tbdggidntm .gt_indent_3 { text-indent: 15px; } #tbdggidntm .gt_indent_4 { text-indent: 20px; } #tbdggidntm .gt_indent_5 { text-indent: 25px; } TABLE 4.2: List of Variables in the RECS Data Variable Name DOEID ClimateRegion_BA Urbanicity Region REGIONC Division STATE_FIPS state_postal state_name HDD65 CDD65 HDD30YR CDD30YR HousingUnitType YearMade TOTSQFT_EN TOTHSQFT TOTCSQFT ZTOTSQFT_EN ZYearMade ZHousingUnitType SpaceHeatingUsed ZSpaceHeatingUsed ACUsed ZACUsed ZACBehavior HeatingBehavior WinterTempDay WinterTempAway WinterTempNight ACBehavior SummerTempDay SummerTempAway SummerTempNight ZHeatingBehavior ZWinterTempAway ZSummerTempAway ZWinterTempDay ZSummerTempDay ZWinterTempNight ZSummerTempNight BTUEL DOLLAREL ZBTUEL BTUNG DOLLARNG ZBTUNG BTULP DOLLARLP ZBTULP BTUFO DOLLARFO ZBTUFO BTUWOOD ZBTUWOOD TOTALBTU TOTALDOL Before starting an analysis, we recommend viewing the data to understand the types of data and variables that are included. The dplyr::glimpse() function produces a list of all variables, the type of the variable (e.g., function, double), and a few example values. Below, we remove the weight variables with select(-matches(\"^NWEIGHT\")) before using glimpse() to get a quick overview of the data: recs_2020 %>% select(-matches("^NWEIGHT")) %>% glimpse() ## Rows: 18,496 ## Columns: 57 ## $ DOEID <dbl> 1e+05, 1e+05, 1e+05, 1e+05, 1e+05, 1e+05, 1e… ## $ ClimateRegion_BA <fct> Mixed-Dry, Mixed-Humid, Mixed-Dry, Mixed-Hum… ## $ Urbanicity <fct> Urban Area, Urban Area, Urban Area, Urban Ar… ## $ Region <fct> West, South, West, South, Northeast, South, … ## $ REGIONC <chr> "WEST", "SOUTH", "WEST", "SOUTH", "NORTHEAST… ## $ Division <fct> Mountain South, West South Central, Mountain… ## $ STATE_FIPS <chr> "35", "05", "35", "45", "34", "48", "40", "2… ## $ state_postal <fct> NM, AR, NM, SC, NJ, TX, OK, MS, DC, AZ, CA, … ## $ state_name <fct> New Mexico, Arkansas, New Mexico, South Caro… ## $ HDD65 <dbl> 3844, 3766, 3819, 2614, 4219, 901, 3148, 182… ## $ CDD65 <dbl> 1679, 1458, 1696, 1718, 1363, 3558, 2128, 23… ## $ HDD30YR <dbl> 4451, 4429, 4500, 3229, 4896, 1150, 3564, 26… ## $ CDD30YR <dbl> 1027, 1305, 1010, 1653, 1059, 3588, 2043, 21… ## $ HousingUnitType <fct> Single-family detached, Apartment: 5 or more… ## $ YearMade <ord> 1970-1979, 1980-1989, 1960-1969, 1980-1989, … ## $ TOTSQFT_EN <dbl> 2100, 590, 900, 2100, 800, 4520, 2100, 900, … ## $ TOTHSQFT <dbl> 2100, 590, 900, 2100, 800, 3010, 1200, 900, … ## $ TOTCSQFT <dbl> 2100, 590, 900, 2100, 800, 3010, 1200, 0, 50… ## $ ZTOTSQFT_EN <fct> Not imputed, Not imputed, Not imputed, Not i… ## $ ZYearMade <fct> Not imputed, Not imputed, Not imputed, Not i… ## $ ZHousingUnitType <fct> Not imputed, Not imputed, Not imputed, Not i… ## $ SpaceHeatingUsed <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR… ## $ ZSpaceHeatingUsed <fct> Not imputed, Not imputed, Not imputed, Not i… ## $ ACUsed <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FA… ## $ ZACUsed <fct> Not imputed, Not imputed, Not imputed, Not i… ## $ ZACBehavior <fct> Not imputed, Imputed, Not imputed, Not imput… ## $ HeatingBehavior <fct> Set one temp and leave it, Turn on or off as… ## $ WinterTempDay <dbl> 70, 70, 69, 68, 68, 76, 74, 70, 68, 70, 72, … ## $ WinterTempAway <dbl> 70, 65, 68, 68, 68, 76, 65, 70, 60, 70, 70, … ## $ WinterTempNight <dbl> 68, 65, 67, 68, 68, 68, 74, 68, 62, 68, 72, … ## $ ACBehavior <fct> Set one temp and leave it, Turn on or off as… ## $ SummerTempDay <dbl> 71, 68, 70, 72, 72, 69, 68, NA, 72, 74, 77, … ## $ SummerTempAway <dbl> 71, 68, 68, 72, 72, 74, 70, NA, 76, 74, 77, … ## $ SummerTempNight <dbl> 71, 68, 68, 72, 72, 68, 70, NA, 68, 72, 77, … ## $ ZHeatingBehavior <fct> Not imputed, Not imputed, Not imputed, Not i… ## $ ZWinterTempAway <fct> Not imputed, Not imputed, Not imputed, Not i… ## $ ZSummerTempAway <fct> Not imputed, Not imputed, Not imputed, Not i… ## $ ZWinterTempDay <fct> Not imputed, Not imputed, Not imputed, Not i… ## $ ZSummerTempDay <fct> Not imputed, Not imputed, Not imputed, Not i… ## $ ZWinterTempNight <fct> Not imputed, Not imputed, Not imputed, Not i… ## $ ZSummerTempNight <fct> Not imputed, Not imputed, Not imputed, Not i… ## $ BTUEL <dbl> 42723, 17889, 8147, 31647, 20027, 48968, 494… ## $ DOLLAREL <dbl> 1955.06, 713.27, 334.51, 1424.86, 1087.00, 1… ## $ ZBTUEL <fct> Not imputed, Not imputed, Imputed amount and… ## $ BTUNG <dbl> 101924.4, 10145.3, 22603.1, 55118.7, 39099.5… ## $ DOLLARNG <dbl> 701.83, 261.73, 188.14, 636.91, 376.04, 439.… ## $ ZBTUNG <fct> Not imputed, Not imputed, Imputed, Not imput… ## $ BTULP <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 17… ## $ DOLLARLP <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,… ## $ ZBTULP <fct> Not applicable, Not applicable, Not applicab… ## $ BTUFO <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 68… ## $ DOLLARFO <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 18… ## $ ZBTUFO <fct> Not applicable, Not applicable, Not applicab… ## $ BTUWOOD <dbl> 0, 0, 0, 0, 0, 3000, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ ZBTUWOOD <fct> Not applicable, Not applicable, Not applicab… ## $ TOTALBTU <dbl> 144648, 28035, 30750, 86765, 59127, 85401, 1… ## $ TOTALDOL <dbl> 2656.9, 975.0, 522.6, 2061.8, 1463.0, 2335.1… From the output, we can see that there are 18,496 rows and 57 non-weight variables in the RECS data. This output also indicates that most of the variables are in double (numeric) format (e.g., TOTSQFT_EN), with some factor (e.g., Region), Boolean (e.g., ACUsed), character (e.g., REGIONC), and ordinal (e.g., YearMade) variables. 4.3 Design objects The design object is the backbone for survey analysis. It is where we specify the sampling design, weights, and other necessary information to ensure we account for errors in the data. Before creating the design object, analysts should carefully review the survey documentation to understand how to create the design object for accurate analysis. In this chapter, we provide details on how to code the design object for the ANES and RECS data used in the book. However, we only provide a high-level overview to get readers started. For a deeper understanding of creating these design objects for a variety of sampling designs, see Chapter 10. While we recommend conducting exploratory data analysis on the original data before diving into complex survey analysis (see Chapter 12), the actual analysis and inference should be performed with the survey design objects instead of the original survey data. This ensures that we appropriately apply the details of the survey design to our calculations. For example, the ANES data is called anes_2020. If we create a survey design object called anes_des, our analyses should begin with anes_des and not anes_2020. 4.3.1 American National Election Studies (ANES) Design Object The ANES documentation (DeBell 2010) details the sampling and weighting implications for analyzing the survey data. From this documentation and as noted in Chapter 3, the 2020 ANES data is weighted to the sample, not the population. To make generalizations for the population, we need to weigh the data against the full population count.The ANES methodology recommends using the Current Population Survey (CPS) to determine the number of the non-institutional U.S. citizens aged 18 or older living in the 50 U.S. states or D.C. in March of 2020. We can use the {censusapi} package to obtain the information needed for the survey design object. The getCensus() function allows us to retrieve the CPS data for March (cps/basic/mar) in 2020 (vintage = 2020). Additionally, we extract several variables from the CPS: month (HRMONTH) and year (HRYEAR4) of the interview: to confirm the correct time period age (PRTAGE) of the respondent: to narrow the population to 18 and older (eligible age to vote) citizenship status (PRCITSHP) of the respondent: to narrow the population to only those eligible to vote final person-level weight (PWSSWGT) Detailed information for these variables can be found in the CPS data dictionary6. cps_state_in <- getCensus(name = "cps/basic/mar", vintage = 2020, region = "state", vars = c("HRMONTH", "HRYEAR4", "PRTAGE", "PRCITSHP", "PWSSWGT"), key = Sys.getenv("CENSUS_KEY")) cps_state <- cps_state_in %>% as_tibble() %>% mutate(across(.cols = everything(), .fns = as.numeric)) In the code above, we include region = \"state\". The default region type for the CPS data is at the state level. While it’s not required to include this, it can be helpful for understanding the geographical context of the data. In getCensus(), we filtered the dataset by specifying the month (HRMONTH == 3) and year (HRYEAR4 == 2020) of our request. Therefore, we expect that all interviews within our output were conducted during that particular month and year. We can confirm that the data is from March of 2020 by running the code below: cps_state %>% distinct(HRMONTH, HRYEAR4) ## # A tibble: 1 × 2 ## HRMONTH HRYEAR4 ## <dbl> <dbl> ## 1 3 2020 We can narrow down the dataset using the age and citizenship variables to include only individuals who are 18 years or older (PRTAGE >= 18) and have U.S. citizenship (PRCITSHIP %in% c(1:4)): cps_narrow_resp <- cps_state %>% filter(PRTAGE >= 18, PRCITSHP %in% c(1:4)) To calculate the U.S. population from the filtered data, we sum the person weights (PWSSWGT): targetpop <- cps_narrow_resp %>% pull(PWSSWGT) %>% sum() targetpop ## [1] "231,034,125" The target population in 2020 is 231,034,125. This result gives us what we need to create the survey design object for estimating population statistics. Using the anes_2020 data, we adjust the weighting variable (V200010b) using the target population we just calculated (targetpop). We determine the proportion of the total weight for each individual weights (V200010b / sum(V200010b)) and then multiply that proportion by the calculated target population. anes_adjwgt <- anes_2020 %>% mutate(Weight = V200010b / sum(V200010b) * targetpop) Once we have the adjusted weights, we can refer to the rest of the documentation to create the survey design. The documentation indicates that the study uses a stratified cluster sampling design. This means that we need to specify variables for strata and ids (cluster) and fill in the nest argument. The documentation provides guidance on which strata and cluster variables to use depending on whether we are analyzing pre- or post-election data. In this book, we analyze post-election data, so we need to use the post-election weight V200010b, strata variable V200010d, and PSU/cluster variable V200010c. Additionally, we set nest=TRUE to ensure the clusters are nested within the strata. anes_des <- anes_adjwgt %>% as_survey_design(weights = Weight, strata = V200010d, ids = V200010c, nest = TRUE) anes_des ## Stratified 1 - level Cluster Sampling design (with replacement) ## With (101) clusters. ## Called via srvyr ## Sampling variables: ## - ids: V200010c ## - strata: V200010d ## - weights: Weight ## Data variables: ## - V200001 (dbl), CaseID (dbl), V200002 (hvn_lbll), InterviewMode ## (fct), V200010b (dbl), Weight (dbl), V200010c (dbl), VarUnit (fct), ## V200010d (dbl), Stratum (fct), V201006 (hvn_lbll), CampaignInterest ## (fct), V201024 (hvn_lbll), V201025x (hvn_lbll), V201029 (hvn_lbll), ## V201101 (hvn_lbll), V201102 (hvn_lbll), VotedPres2016 (fct), ## V201103 (hvn_lbll), VotedPres2016_selection (fct), V201228 ## (hvn_lbll), V201229 (hvn_lbll), V201230 (hvn_lbll), V201231x ## (hvn_lbll), PartyID (fct), V201233 (hvn_lbll), TrustGovernment ## (fct), V201237 (hvn_lbll), TrustPeople (fct), V201507x (hvn_lbll), ## Age (dbl), AgeGroup (fct), V201510 (hvn_lbll), Education (fct), ## V201546 (hvn_lbll), V201547a (hvn_lbll), V201547b (hvn_lbll), ## V201547c (hvn_lbll), V201547d (hvn_lbll), V201547e (hvn_lbll), ## V201547z (hvn_lbll), V201549x (hvn_lbll), RaceEth (fct), V201600 ## (hvn_lbll), Gender (fct), V201607 (hvn_lbll), V201610 (hvn_lbll), ## V201611 (hvn_lbll), V201613 (hvn_lbll), V201615 (hvn_lbll), V201616 ## (hvn_lbll), V201617x (hvn_lbll), Income (fct), Income7 (fct), ## V202051 (hvn_lbll), V202066 (hvn_lbll), V202072 (hvn_lbll), ## VotedPres2020 (fct), V202073 (hvn_lbll), V202109x (hvn_lbll), ## V202110x (hvn_lbll), VotedPres2020_selection (fct), EarlyVote2020 ## (fct) We can examine this new object to learn more about the survey design, such that the ANES is a “Stratified 1 - level Cluster Sampling design (with replacement) With (101) clusters”. Additionally, the output displays the sampling variables and then lists the remaning variables in the dataset. This design object will be used throughout this book to conduct survey analysis. 4.3.2 Residential Energy Consumption Survey (RECS) Design Object The RECS documentation (U.S. Energy Information Administration 2023a) provides information on the survey’s sampling and weighting implications for analysis. The documentation shows the 2020 RECS uses Jackknife weights, where the main analytic weight is NWEIGHT, and the Jackknife weights are NWEIGHT1-NWEIGHT60. In the survey design object code, we can specify these in the weights and repweights arguments, respectively. With Jackknife weights, additional information is required: type, scale, and mse. Chapter 10 goes into depth about each of these arguments, but to quickly get started, the documentation lets us know that type=JK1, scale=59/60, and mse = TRUE. We can use the following code to create the survey design object: recs_des <- recs_2020 %>% as_survey_rep( weights = NWEIGHT, repweights = NWEIGHT1:NWEIGHT60, type = "JK1", scale = 59 / 60, mse = TRUE ) recs_des ## Call: Called via srvyr ## Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances. ## Sampling variables: ## - repweights: `NWEIGHT1 + NWEIGHT2 + NWEIGHT3 + NWEIGHT4 + NWEIGHT5 + ## NWEIGHT6 + NWEIGHT7 + NWEIGHT8 + NWEIGHT9 + NWEIGHT10 + NWEIGHT11 + ## NWEIGHT12 + NWEIGHT13 + NWEIGHT14 + NWEIGHT15 + NWEIGHT16 + ## NWEIGHT17 + NWEIGHT18 + NWEIGHT19 + NWEIGHT20 + NWEIGHT21 + ## NWEIGHT22 + NWEIGHT23 + NWEIGHT24 + NWEIGHT25 + NWEIGHT26 + ## NWEIGHT27 + NWEIGHT28 + NWEIGHT29 + NWEIGHT30 + NWEIGHT31 + ## NWEIGHT32 + NWEIGHT33 + NWEIGHT34 + NWEIGHT35 + NWEIGHT36 + ## NWEIGHT37 + NWEIGHT38 + NWEIGHT39 + NWEIGHT40 + NWEIGHT41 + ## NWEIGHT42 + NWEIGHT43 + NWEIGHT44 + NWEIGHT45 + NWEIGHT46 + ## NWEIGHT47 + NWEIGHT48 + NWEIGHT49 + NWEIGHT50 + NWEIGHT51 + ## NWEIGHT52 + NWEIGHT53 + NWEIGHT54 + NWEIGHT55 + NWEIGHT56 + ## NWEIGHT57 + NWEIGHT58 + NWEIGHT59 + NWEIGHT60` ## - weights: NWEIGHT ## Data variables: ## - DOEID (dbl), ClimateRegion_BA (fct), Urbanicity (fct), Region ## (fct), REGIONC (chr), Division (fct), STATE_FIPS (chr), ## state_postal (fct), state_name (fct), HDD65 (dbl), CDD65 (dbl), ## HDD30YR (dbl), CDD30YR (dbl), HousingUnitType (fct), YearMade ## (ord), TOTSQFT_EN (dbl), TOTHSQFT (dbl), TOTCSQFT (dbl), ## ZTOTSQFT_EN (fct), ZYearMade (fct), ZHousingUnitType (fct), ## SpaceHeatingUsed (lgl), ZSpaceHeatingUsed (fct), ACUsed (lgl), ## ZACUsed (fct), ZACBehavior (fct), HeatingBehavior (fct), ## WinterTempDay (dbl), WinterTempAway (dbl), WinterTempNight (dbl), ## ACBehavior (fct), SummerTempDay (dbl), SummerTempAway (dbl), ## SummerTempNight (dbl), ZHeatingBehavior (fct), ZWinterTempAway ## (fct), ZSummerTempAway (fct), ZWinterTempDay (fct), ZSummerTempDay ## (fct), ZWinterTempNight (fct), ZSummerTempNight (fct), NWEIGHT ## (dbl), NWEIGHT1 (dbl), NWEIGHT2 (dbl), NWEIGHT3 (dbl), NWEIGHT4 ## (dbl), NWEIGHT5 (dbl), NWEIGHT6 (dbl), NWEIGHT7 (dbl), NWEIGHT8 ## (dbl), NWEIGHT9 (dbl), NWEIGHT10 (dbl), NWEIGHT11 (dbl), NWEIGHT12 ## (dbl), NWEIGHT13 (dbl), NWEIGHT14 (dbl), NWEIGHT15 (dbl), NWEIGHT16 ## (dbl), NWEIGHT17 (dbl), NWEIGHT18 (dbl), NWEIGHT19 (dbl), NWEIGHT20 ## (dbl), NWEIGHT21 (dbl), NWEIGHT22 (dbl), NWEIGHT23 (dbl), NWEIGHT24 ## (dbl), NWEIGHT25 (dbl), NWEIGHT26 (dbl), NWEIGHT27 (dbl), NWEIGHT28 ## (dbl), NWEIGHT29 (dbl), NWEIGHT30 (dbl), NWEIGHT31 (dbl), NWEIGHT32 ## (dbl), NWEIGHT33 (dbl), NWEIGHT34 (dbl), NWEIGHT35 (dbl), NWEIGHT36 ## (dbl), NWEIGHT37 (dbl), NWEIGHT38 (dbl), NWEIGHT39 (dbl), NWEIGHT40 ## (dbl), NWEIGHT41 (dbl), NWEIGHT42 (dbl), NWEIGHT43 (dbl), NWEIGHT44 ## (dbl), NWEIGHT45 (dbl), NWEIGHT46 (dbl), NWEIGHT47 (dbl), NWEIGHT48 ## (dbl), NWEIGHT49 (dbl), NWEIGHT50 (dbl), NWEIGHT51 (dbl), NWEIGHT52 ## (dbl), NWEIGHT53 (dbl), NWEIGHT54 (dbl), NWEIGHT55 (dbl), NWEIGHT56 ## (dbl), NWEIGHT57 (dbl), NWEIGHT58 (dbl), NWEIGHT59 (dbl), NWEIGHT60 ## (dbl), BTUEL (dbl), DOLLAREL (dbl), ZBTUEL (fct), BTUNG (dbl), ## DOLLARNG (dbl), ZBTUNG (fct), BTULP (dbl), DOLLARLP (dbl), ZBTULP ## (fct), BTUFO (dbl), DOLLARFO (dbl), ZBTUFO (fct), BTUWOOD (dbl), ## ZBTUWOOD (fct), TOTALBTU (dbl), TOTALDOL (dbl) Viewing this new object provides information about the survey design, such that the RECS is an “unstratified cluster jacknife (JK1) with 60 replicates and MSE variances”. Additionally, the output shows the sampling variables (NWEIGHT1-NWEIGHT50) and then lists the remaining variables in the dataset. This design object will be used throughout this book to conduct survey analysis. This chapter walked through the installation and loading of several packages, introduced the survey data available in the {srvyrexploR} package, and provided context on creating survey design objects for the ANES and RECS datasets. With this foundational knowledge, we can follow the instructions listed in the Prerequisite boxes at the start of each chapter. References DeBell, Matthew. 2010. “How to Analyze ANES Survey Data.” ANES Technical Report Series nes012492. Palo Alto, CA: Stanford University; Ann Arbor, MI: the University of Michigan; https://electionstudies.org/wp-content/uploads/2018/05/HowToAnalyzeANESData.pdf. ———. 2023a. “2020 Residential Energy Consumption Survey: Household Characteristics Technical Documentation Summary.” https://www.eia.gov/consumption/residential/data/2020/pdf/2020%20RECS_Methodology%20Report.pdf. Note: {broom} s already included in the tidyverse, so no separate installation is required↩︎ https://www2.census.gov/programs-surveys/cps/datasets/2020/basic/2020_Basic_CPS_Public_Use_Record_Layout_plus_IO_Code_list.txt↩︎ "],["c05-descriptive-analysis.html", "Chapter 5 Descriptive Analyses in {srvyr} 5.1 Introduction 5.2 Similarities Between {dplyr} and {srvyr} Functions 5.3 Counts and Cross-Tabulations 5.4 Totals and Sums 5.5 Means and Proportions 5.6 Quantiles and Medians 5.7 Ratios 5.8 Correlations 5.9 Standard Deviation and Variance 5.10 Additional Topics 5.11 Exercises", " Chapter 5 Descriptive Analyses in {srvyr} Prerequisites For this chapter, load the following packages: library(tidyverse) library(survey) library(srvyr) library(srvyrexploR) library(broom) To help explain the similarities between {dplyr} functions and {srvyr} functions, this chapter will use the mtcars and iris datasets that are built-in to R and apistrat data that comes in the {survey} package: data(api) dstrata <- apistrat %>% as_survey_design(strata = stype, weights = pw) We will also be using data from ANES and RECS described in Chapter 4. As a reminder, here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter 4 for more information). targetpop <- 231592693 data(anes_2020) anes_adjwgt <- anes_2020 %>% mutate(Weight = Weight/sum(Weight) * targetpop) anes_des <- anes_adjwgt %>% as_survey_design(weights = Weight, strata = Stratum, ids = VarUnit, nest = TRUE) For RECS, details are included in the RECS documentation and Chapters 4 and 10. data(recs_2020) recs_des <- recs_2020 %>% as_survey_rep(weights = NWEIGHT, repweights = NWEIGHT1:NWEIGHT60, type = "JK1", scale = 59/60, mse = TRUE) 5.1 Introduction Descriptive analyses, such as basic counts, cross-tabulations, or means, are one of the first steps a researcher takes before conducting statistical tests or developing models. Reviewing findings from descriptive analyses can help researchers glean insight into the data, the underlying population, and any unique aspects of the data or population. For example, if the data shows a proportion of males of only 10% in the data, this could indicate either a unique population or a potential error in the data. Additionally, researchers can use descriptive analyses to provide means, proportions, or other measures to summarize the data and make estimates about the population. We will discuss many different types of descriptive analyses in this chapter, but it is important to know what type of data we have and what statistics to use for that type of data. In survey data, we typically consider data to be one of these four main data types: Categorical/nominal data: variables with levels or descriptions that cannot be ordered, such as the region of the country (North, South, East, and West) Ordinal data: variables that can be ordered, such as those from a Likert scale (strongly disagree, disagree, agree, and strongly agree) Discrete data: variables that are counted or measured, such as number of children Continuous data, variables that are measured and whose values can lie anywhere on an interval, such as weight When we pull the data from surveys into R, the data will be listed as character, factor, numeric, or logical/Boolean. They will not clearly indicate the type of survey data (e.g., ordinal). When working with survey data, researchers need to properly use the questionnaire and codebook along with the data (see Chapter 3) to understand what the values for each variable represent. For example, our survey data may represent categorical variables (e.g., the North, South, East, and West regions of the United States) using numeric codes (e.g., 1, 2, 3, and 4). Though this is a categorical variable from the survey, this variable might be automatically read as numeric values when we import our data into R. This can lead to the common mistake of applying a mean function to categorical values instead of a proportion function. Choosing appropriate measures is crucial to reach valid conclusions. Different variable types have distinct properties and levels of measurement, and we cannot apply all measures to all variables. This chapter will discuss how to analyze measures of distribution (e.g., cross-tabulations), central tendency (e.g., means), relationship (e.g., ratios), and dispersion (e.g., standard). Measures of distribution describe how often an event or response occurs. These measures include counts and totals. Measures of central tendency find the central (or average) responses. These measures include means and medians. Measures of relationship describe how variables relate to each other. These measures include correlations and ratios. Measures of dispersion describe how data spreads around the central tendency for continuous variables. These measures include standard deviations and variances. Specifically, we will cover the following functions from the {srvyr} package: Count of observations (survey_count() and survey_tally()) Summation of variables (survey_total()) Means and proportions (survey_mean() and survey_prop()) Quantiles and medians (survey_quantile() and survey_median()) Correlations (survey_corr()) Ratios (survey_ratio()) Variances and standard deviations (survey_var() and survey_sd()) To incorporate each of these survey functions, recall the general process for survey estimation from Chapter 10: Create a tbl_svy object using srvyr::as_survey_design() or srvyr::as_survey_rep(). Subset the data for subpopulations using srvyr::filter(), if needed. Specify domains of analysis using srvyr::group_by(), if needed. Analyze the data with survey-specific functions. We have already discussed how to create the survey design objects in Chapter 10, and the code for creating these for the two datasets used in this chapter is provided in the Prerequisites box at the beginning of this chapter. We will apply the survey functions covered in this chapter in Step 4. To look at the data by different subgroups, we can choose to filter and/or group the data. It is very important that we filter and group the data only after creating the design object. This is necessary to ensure that the results accurately account for the survey design. Removing any data before creating the survey design object means that the data for those cases is not included in the survey design information and estimations of the variance. 5.2 Similarities Between {dplyr} and {srvyr} Functions One of the major advantages of using {srvyr} is that it applies {dplyr}-like syntax to the {survey} package. We can use pipes to specify a tbl_svy object, apply a function, and then feed that output into the next function’s first argument. Functions follow the ‘tidy’ convention of snake_case function names. The example below calculates the mean and median for the variable mpg (miles per gallon) in the mtcars dataset. mtcars %>% summarize(mpg_mean = mean(mpg), mpg_median = median(mpg)) ## mpg_mean mpg_median ## 1 20.09 19.2 Similarly, in the next example, the variance and standard deviation of the variable api00 are calculated for the tbl_svy object dstrata. Note the similarity in the syntax. When we dig into the functions later, we will show that the results output are similar in that one row is output for each group (if there are groups), but there will be more columns output. Specifically, by default, the standard error of the statistic is calculated in addition to the statistic. dstrata %>% summarize(api00_mean = survey_mean(api00), api00_med = survey_median(api00)) ## # A tibble: 1 × 4 ## api00_mean api00_mean_se api00_med api00_med_se ## <dbl> <dbl> <dbl> <dbl> ## 1 662. 9.54 668 13.7 The functions in {srvyr} also play nicely with other tidyverse functions. If we wanted to select columns that have something in common, we use {tidyselect} functions such as starts_with(), num_range(), etc. In the examples below, a combination of across() and starts_with() to calculate the mean of variables starting with “Sepal” in the iris data frame and then starting with api in the dstrata survey object. iris %>% summarize(across(starts_with("Sepal"), mean)) ## Sepal.Length Sepal.Width ## 1 5.843 3.057 dstrata %>% summarize(across(starts_with("api"), survey_mean)) ## # A tibble: 1 × 6 ## api00 api00_se api99 api99_se api.stu api.stu_se ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 662. 9.54 629. 10.1 498. 16.4 We can use {dplyr} verbs such as mutate(), filter(), etc., on our survey object. dstrata_mod <- dstrata %>% mutate(api_diff = api00 - api99) %>% filter(stype == "E") %>% select(stype, api99, api00, api_diff, api_students = api.stu) dstrata_mod ## Stratified Independent Sampling design (with replacement) ## Called via srvyr ## Sampling variables: ## - ids: `1` ## - strata: stype ## - weights: pw ## Data variables: ## - stype (fct), api99 (int), api00 (int), api_diff (int), api_students ## (int) dstrata ## Stratified Independent Sampling design (with replacement) ## Called via srvyr ## Sampling variables: ## - ids: `1` ## - strata: stype ## - weights: pw ## Data variables: ## - cds (chr), stype (fct), name (chr), sname (chr), snum (dbl), dname ## (chr), dnum (int), cname (chr), cnum (int), flag (int), pcttest ## (int), api00 (int), api99 (int), target (int), growth (int), ## sch.wide (fct), comp.imp (fct), both (fct), awards (fct), meals ## (int), ell (int), yr.rnd (fct), mobility (int), acs.k3 (int), ## acs.46 (int), acs.core (int), pct.resp (int), not.hsg (int), hsg ## (int), some.col (int), col.grad (int), grad.sch (int), avg.ed ## (dbl), full (int), emer (int), enroll (int), api.stu (int), pw ## (dbl), fpc (dbl) Instead of data frames or tibbles, {srvyr} functions are meant for tbl_svy objects. Attempting to run data manipulation on non-tbl_svy objects will result in an error, as shown in the example below when using the mtcars data frame (which is not tbl_svy object). mtcars %>% summarize(mpg_mean = survey_mean(mpg)) ## Error in `summarize()`: ## ℹ In argument: `mpg_mean = survey_mean(mpg)`. ## Caused by error in `cur_svy()` at gergness-srvyr-1917f75/R/survey_statistics.r:114:3: ## ! Survey context not set A few functions in {srvyr} parallel functions in {dplyr}, such as srvyr::summarize() and srvyr::group_by(). Unlike {srvyr}-specific verbs, the package recognizes these parallel functions on a non-survey object. It will not error and instead give the equivalent output from {dplyr}: mtcars %>% srvyr::summarize(mpg_mean = mean(mpg)) ## mpg_mean ## 1 20.09 Because this book focuses on survey analysis, most of our pipes will stem from a survey object. We will not include the namespace for each function (e.g., srvyr::summarize()). Several functions in {srvyr} must be called within srvyr::summarize() with the exception of srvyr::survey_count() and srvyr::survey_tally() much like dplyr::count() and dplyr::tally() are not called within dplyr::summarize(). These verbs can be used in conjunction with group_by() or by/.by, applying the functions on a group-by-group basis to create grouped summaries. mtcars %>% group_by(cyl) %>% dplyr::summarize(mpg_mean = mean(mpg)) ## # A tibble: 3 × 2 ## cyl mpg_mean ## <dbl> <dbl> ## 1 4 26.7 ## 2 6 19.7 ## 3 8 15.1 We use a similar setup to summarize data in {srvyr}. dstrata %>% group_by(stype) %>% summarize(api00_mean = survey_mean(api00), api00_median = survey_median(api00)) ## # A tibble: 3 × 5 ## stype api00_mean api00_mean_se api00_median api00_median_se ## <fct> <dbl> <dbl> <dbl> <dbl> ## 1 E 674. 12.5 671 20.7 ## 2 H 626. 15.5 635 21.6 ## 3 M 637. 16.6 648 24.1 5.3 Counts and Cross-Tabulations With survey_count() and survey_tally(), we can calculate the estimated population counts for a given variable or combination of variables. Sometimes, these are referred to as cross-tabulations or crosstabs, for short. These summaries should be applied to categorical data and are used to get estimated counts of the population size of groups from the survey. 5.3.1 Syntax The syntax for survey_count() is very similar to the dplyr::count() syntax; however, as noted above, it can only be called on tbl_svy objects. Let’s explore the syntax: survey_count( x, ..., wt = NULL, sort = FALSE, name = "n", .drop = dplyr::group_by_drop_default(x), vartype = c("se", "ci", "var", "cv") ) The arguments are: x: a tbl_svy object created by as_survey ...: variables to group by, passed to group_by wt: a variable to weight on in addition to the survey weights, defaults to NULL sort: how to sort the variables, defaults to FALSE name: the name of the count variable, defaults to n .drop: whether to drop empty groups vartype: type(s) of variation estimate to calculate including any of c(\"se\", \"ci\", \"var\", \"cv\"), defaults to se (standard error) (see 5.3.1 for more information) To capture a count or crosstabs by different variables, we include them in the (...) argument. This argument can take any number of variables and will break down the counts by all combinations of the provided variables. This is the same as with dplyr::count(). We can also obtain an estimate of the overall population by not including any variables in the (...) argument or by using the survey_tally() function. The survey_tally() function has a similar syntax to the survey_count() function, but it does not include the (...) or the .drop arguments: survey_tally( x, wt, sort = FALSE, name = "n", vartype = c("se", "ci", "var", "cv") ) Both functions include the vartype argument with four different values: se: standard error The estimated standard deviation of the estimate Output has a column with the variable name specified in the name argument with a suffix of “_se” ci: confidence interval The lower and upper limits of a confidence interval Output has a column with the variable name specified in the name argument with a suffix of “_low” and “_upp” By default, this is a 95% confidence interval but can be changed by using the argument level and specifying a number between 0 and 1. For example, level=0.8 would produce a 80% confidence interval. var: variance The estimated variance of the estimate Output has a column with the variable name specified in the name argument with a suffix of “_var” cv: coefficient of variation A ratio of the standard error and the estimate Output has a column with the variable name specified in the name argument with a suffix of “_cv” The confidence intervals are always calculated using a symmetric t-distribution based confidence interval as follows: \\[ \\text{estimate} \\pm t^*_{df}\\times SE\\] where \\(t^*_{df}\\) is the critical value from a t-distribution based on the confidence level and the degrees of freedom. By default, the degrees of freedom are calculated based on the design or number of replicates, but they can be specified using the argument df. For survey design objects, the degrees of freedom are calculated as the number of PSUs minus the number of strata. For replicate-based objects, the degrees of freedom are calculated as one less than the rank of the matrix of replicate weight, where the number of replicates is typically the rank. Note that specifying df = Inf is equivalent to using a normal (z-based) confidence interval. These variability types are the same for most of the survey functions, and we will provide examples using different types of variability throughout this chapter. 5.3.2 Examples Example 1: Estimated Population Count If we wanted to obtain the estimated number of households in the U.S. (the target population) using the Residential Energy Consumption Survey (RECS) data, we could use survey_count(). If we do not specify any variables in the survey_count() function, it will output the estimated population count (n) and standard error (n_se). recs_des %>% survey_count() ## # A tibble: 1 × 2 ## n n_se ## <dbl> <dbl> ## 1 123529025. 0.148 Thus, the estimated number of households in the U.S. is 123,529,025. We could also use the survey_tally() function and the example below yields the same results as using survey_count() previously. recs_des %>% survey_tally() ## # A tibble: 1 × 2 ## n n_se ## <dbl> <dbl> ## 1 123529025. 0.148 Example 2: Estimated Counts by Subgroups (Crosstabs) To calculate the estimated number of observations for subgroups, such as Region and Division, we can add the variables of interest into the survey_count() function. In the example below, the estimated number of housing units by region and division is calculated. Additionally, one of the arguments allows us to change the name of the count variable from the default (n) using name =. In this case, we are changing the name to \"N\". recs_des %>% survey_count(Region, Division, name = "N") ## # A tibble: 10 × 4 ## Region Division N N_se ## <fct> <fct> <dbl> <dbl> ## 1 Northeast New England 5876166 0.0000000137 ## 2 Northeast Middle Atlantic 16043503 0.0000000487 ## 3 Midwest East North Central 18546912 0.000000437 ## 4 Midwest West North Central 8495815 0.0000000177 ## 5 South South Atlantic 24843261 0.0000000418 ## 6 South East South Central 7380717. 0.114 ## 7 South West South Central 14619094 0.000488 ## 8 West Mountain North 4615844 0.119 ## 9 West Mountain South 4602070 0.0000000492 ## 10 West Pacific 18505643. 0.00000295 When we run the crosstab, we see there are an estimated 5,876,166 housing units in the New England Division. If we wanted to use survey_tally() to output the same results, we would get an error if we try to use the same syntax as survey_count(): recs_des %>% survey_tally(Region, Division, name = "N") ## Error in `dplyr::summarise()` at gergness-srvyr-1917f75/R/summarise.r:10:3: ## ℹ In argument: `N = survey_total(Region, vartype = vartype, ## na.rm = TRUE)`. ## Caused by error: ## ! Factor not allowed in survey functions, should be used as a grouping variable. Instead, use a the group_by() function prior to using survey_tally() to obtain this crosstab: recs_des %>% group_by(Region, Division) %>% survey_tally(name = "N") ## # A tibble: 10 × 4 ## # Groups: Region [4] ## Region Division N N_se ## <fct> <fct> <dbl> <dbl> ## 1 Northeast New England 5876166 0.0000000137 ## 2 Northeast Middle Atlantic 16043503 0.0000000487 ## 3 Midwest East North Central 18546912 0.000000437 ## 4 Midwest West North Central 8495815 0.0000000177 ## 5 South South Atlantic 24843261 0.0000000418 ## 6 South East South Central 7380717. 0.114 ## 7 South West South Central 14619094 0.000488 ## 8 West Mountain North 4615844 0.119 ## 9 West Mountain South 4602070 0.0000000492 ## 10 West Pacific 18505643. 0.00000295 5.4 Totals and Sums The survey_total() function is analogous to sum. This can be used to find the estimated aggregate sum of an outcome and should be applied to continuous variables to obtain the estimated total quantity in a population. All the functions introduced from this point on in this chapter must be called from within summarize(). 5.4.1 Syntax Here is the syntax: survey_total( x, na.rm = FALSE, vartype = c("se", "ci", "var", "cv"), level = 0.95, deff = FALSE, df = NULL ) The arguments are: x: a variable, expression, or empty na.rm: an indicator of whether missing values should be dropped, defaults to FALSE vartype: type(s) of variation estimate to calculate including any of c(\"se\", \"ci\", \"var\", \"cv\"), defaults to se (standard error) (see 5.3.1 for more information) level: a number or a vector indicating the confidence level, defaults to 0.95 deff: a logical value stating whether the design effect should be returned, defaults to FALSE (this is described in more detail in Section 5.10.3) df: (for vartype = 'ci'), a numeric value indicating degrees of freedom for the t-distribution 5.4.2 Examples Example 1: Estimated Population Count To calculate a population count estimate with survey_total(), the argument x can be left empty as shown in the example below: recs_des %>% summarize(Tot = survey_total()) ## # A tibble: 1 × 2 ## Tot Tot_se ## <dbl> <dbl> ## 1 123529025. 0.148 Note that the result from recs_des %>% summarize(survey_total()) is equivalent to the survey_count() and survey_tally() functions. However, the survey_total() function is called within summarize, whereas survey_count() and survey_tally() are not. Example 2: Overall Summation of Continuous Variables The difference between survey_total() and survey_count() is more evident when specifying continuous variables to sum. Let’s compute the total cost of electricity in whole dollars from variable DOLLAREL7. recs_des %>% summarize(elec_bill = survey_total(DOLLAREL)) ## # A tibble: 1 × 2 ## elec_bill elec_bill_se ## <dbl> <dbl> ## 1 170473527909. 664893504. It is estimated that American residential households spent a total of $170,473,527,909 on electricity in 2020, and the estimate has a standard error of $664,893,504. Example 3: Summation by Groups As we are using the {srvyr} package, we can use group_by() to calculate the cost of electricity by different groups. Let’s see how much the cost of electricity in whole dollars differed between regions and output the confidence interval instead of the default standard error. recs_des %>% group_by(Region) %>% summarize(elec_bill = survey_total(DOLLAREL, vartype = "ci")) ## # A tibble: 4 × 4 ## Region elec_bill elec_bill_low elec_bill_upp ## <fct> <dbl> <dbl> <dbl> ## 1 Northeast 29430369947. 28788987554. 30071752341. ## 2 Midwest 34972544751. 34339576041. 35605513460. ## 3 South 72496840204. 71534780902. 73458899506. ## 4 West 33573773008. 32909111702. 34238434313. The survey results estimate that households in the Northeast spent $29,430,369,947 with a confidence interval of ($28,788,987,554, $30,071,752,341) on electricity in 2020 while households in the South spent an estimated $72,496,840,204 with a confidence interval of ($28,788,987,554, $73,458,899,506). 5.5 Means and Proportions Means and proportions are the backbone of most research. The estimates calculated are often the first things we look for when reviewing research on a given topic. The survey_mean() and survey_prop() functions calculate means and proportions while incorporating the survey design elements. The survey_mean() function should be used on continuous variables of survey data, while the survey_prop() function should be used on categorical variables. These topics are grouped together because a proportion is simply a mean of a logical (Boolean) variable. 5.5.1 Syntax The syntax for both means and proportions are very similar: survey_mean( x, na.rm = FALSE, vartype = c("se", "ci", "var", "cv"), level = 0.95, proportion = FALSE, prop_method = c("logit", "likelihood", "asin", "beta", "mean"), deff = FALSE, df = NULL ) survey_prop( na.rm = FALSE, vartype = c("se", "ci", "var", "cv"), level = 0.95, proportion = TRUE, prop_method = c("logit", "likelihood", "asin", "beta", "mean", "xlogit"), deff = FALSE, df = NULL ) Both functions have the following arguments and defaults: na.rm: an indicator of whether missing values should be dropped, defaults to FALSE vartype: type(s) of variation estimate to calculate including any of c(\"se\", \"ci\", \"var\", \"cv\"), defaults to se (standard error) (see 5.3.1 for more information) level: a number or a vector indicating the confidence level, defaults to 0.95 prop_method: Method to calculate the confidence interval for confidence intervals deff: a logical value stating whether the design effect should be returned, defaults to FALSE (this is described in more detail in Section 5.10.3) df: (for vartype = 'ci'), a numeric value indicating degrees of freedom for the t-distribution There are two main differences in the syntax. The survey_mean() function includes the first argument of x, while survey_prop() does not. The x argument includes the variable or expression on which the mean should be calculated. There is no argument to include variables in the survey_prop() function. Instead, prior to summarize(), we need to use the group_by() function to specify the variables of interest. For survey_mean(), including a group_by() function will allow us to obtain the means by the different groups. The other main difference is with the proportion argument. In the survey_mean() function, this defaults to FALSE, while in the survey_prop() function, this defaults to TRUE. This is because the survey_mean() function can be used to calculate both means and proportions. If we wish to calculate a proportion using this function, we will need to set the proportion argument to TRUE. In section 5.3.1, we provide an overview of the different variability types. The interval used in confidence intervals for most measures, such as means and counts, is referred to as a Wald-type interval. While a Wald-type interval using a symmetric t-based confidence interval is an option for proportions, this generally does not have the correct coverage rate when sample sizes are small and/or the proportion is “near” 0 or 1. Thus, other methods have been developed to calculate confidence intervals and can be specified using the prop_method option in survey_prop(). The options include: logit: fits a logistic regression model and computes a Wald-type interval on the log-odds scale, which is then transformed to the probability scale. This is the default method. likelihood: uses the (Rao-Scott) scaled chi-squared distribution for the log-likelihood from a binomial distribution. asin: uses the variance-stabilizing transformation for the binomial distribution, the arcsine square root, and then back-transforms the interval to the probability scale beta: uses the incomplete beta function with an effective sample size based on the estimated variance of the proportion. mean: the Wald-type interval xlogit: uses a logit transformation of the proportion, calculates a Wald-type interval, and then back-transforms to the probability scale. This method is implemented in SUDAAN and SPSS. Each option will provide slightly different confidence interval bounds when dealing with proportions. Please note that when working with survey_mean(), this method does not need to be specified unless the proprtion argument is TRUE. 5.5.2 Examples Example 1: One Variable Proportion If we are interested in obtaining the proportion of people in each region in the RECS data, we can use group_by() and survey_prop() to obtain this. recs_des %>% group_by(Region) %>% summarize(p = survey_prop()) ## When `proportion` is unspecified, `survey_prop()` now defaults to `proportion = TRUE`. ## ℹ This should improve confidence interval coverage. ## This message is displayed once per session. ## # A tibble: 4 × 3 ## Region p p_se ## <fct> <dbl> <dbl> ## 1 Northeast 0.177 2.12e-10 ## 2 Midwest 0.219 2.62e-10 ## 3 South 0.379 7.40e-10 ## 4 West 0.224 8.16e-10 17.7% of the households are in the Northeast, 21.9% in the Midwest, and so on. Note: survey_prop() is essentially the same as using survey_mean() with a categorical variable and without specifying a numeric variable in the x argument. The following code will give us the same results as above: recs_des %>% group_by(Region) %>% summarize(p = survey_mean()) ## # A tibble: 4 × 3 ## Region p p_se ## <fct> <dbl> <dbl> ## 1 Northeast 0.177 2.12e-10 ## 2 Midwest 0.219 2.62e-10 ## 3 South 0.379 7.40e-10 ## 4 West 0.224 8.16e-10 Example 2: Conditional Proportions It is possible to obtain proportions by more than one variable. In the following example, we look at the proportion of housing units by Region and whether air conditioning is used (ACUsed).8 recs_des %>% group_by(Region, ACUsed) %>% summarize(p = survey_prop()) ## # A tibble: 8 × 4 ## # Groups: Region [4] ## Region ACUsed p p_se ## <fct> <lgl> <dbl> <dbl> ## 1 Northeast FALSE 0.110 0.00590 ## 2 Northeast TRUE 0.890 0.00590 ## 3 Midwest FALSE 0.0666 0.00508 ## 4 Midwest TRUE 0.933 0.00508 ## 5 South FALSE 0.0581 0.00278 ## 6 South TRUE 0.942 0.00278 ## 7 West FALSE 0.255 0.00759 ## 8 West TRUE 0.745 0.00759 When specifying multiple variables, the proportions are conditional. In the results above, notice that the proportions sum to 1 within each region. This can be interpreted as the proportion of housing units with air conditioning within each region. Example 3: Joint Proportions If we want the joint proportion instead, the interact function is necessary. In the example below, the interact function is used on Region and ACUsed: recs_des %>% group_by(interact(Region, ACUsed)) %>% summarize(p = survey_prop()) ## # A tibble: 8 × 4 ## Region ACUsed p p_se ## <fct> <lgl> <dbl> <dbl> ## 1 Northeast FALSE 0.0196 0.00105 ## 2 Northeast TRUE 0.158 0.00105 ## 3 Midwest FALSE 0.0146 0.00111 ## 4 Midwest TRUE 0.204 0.00111 ## 5 South FALSE 0.0220 0.00106 ## 6 South TRUE 0.357 0.00106 ## 7 West FALSE 0.0573 0.00170 ## 8 West TRUE 0.167 0.00170 As noted earlier, both the survey_prop() and survey_mean() functions can be used here and will provide the same results. Example 4: Overall Mean We can calculate the estimated average cost of electricity in the U.S. and include both the standard error and the confidence interval: recs_des %>% summarize(elec_bill = survey_mean(DOLLAREL, vartype = c("se", "ci"))) ## # A tibble: 1 × 4 ## elec_bill elec_bill_se elec_bill_low elec_bill_upp ## <dbl> <dbl> <dbl> <dbl> ## 1 1380. 5.38 1369. 1391. Nationally, the average household spent $$1,380 in 2020. Example 5: Means by Subgroup We can also calculate the estimated average cost of electricity in the U.S. by each region. To do this, we include a group_by() function with the variable of interest before the summarize() function: recs_des %>% group_by(Region) %>% summarize(elec_bill = survey_mean(DOLLAREL)) ## # A tibble: 4 × 3 ## Region elec_bill elec_bill_se ## <fct> <dbl> <dbl> ## 1 Northeast 1343. 14.6 ## 2 Midwest 1293. 11.7 ## 3 South 1548. 10.3 ## 4 West 1211. 12.0 Households from the West spent $1,211 on electricity, and in the South, they spent an average of $1,548. 5.6 Quantiles and Medians To better understand the distribution of a continuous variable, quantiles can be calculated at specific points to help gain insight. For example, we might want estimates of the quartiles (25%, 50%, 75%) of income in a population to understand how the income is distributed. We use the survey_quantile() function to calculate quantiles in survey data. Medians are often used to find the midpoint of a continuous distribution when the data is considered to be skewed, as medians are less subject to outliers than means. The median in the data is the same as the 50th percentile. In other words, it is the value where 50% of the data is higher than it and 50% is lower. Medians are a special case of quantiles that are used more often; thus, a unique function has been created for it (survey_median()). We can calculate the median of the data using both the survey_median() function and the survey_quantile() function with the 50% quantile provided as an argument. 5.6.1 Syntax The syntax for survey_quantile() and survey_median() are nearly identical: survey_quantile( x, quantiles, na.rm = FALSE, vartype = c("se", "ci", "var", "cv"), level = 0.95, interval_type = c("mean", "beta", "xlogit", "asin", "score", "quantile"), qrule = c("math", "school", "shahvaish", "hf1", "hf2", "hf3", "hf4", "hf5", "hf6", "hf7", "hf8", "hf9"), df = NULL ) survey_median( x, na.rm = FALSE, vartype = c("se", "ci", "var", "cv"), level = 0.95, interval_type = c("mean", "beta", "xlogit", "asin", "score", "quantile"), qrule = c("math", "school", "shahvaish", "hf1", "hf2", "hf3", "hf4", "hf5", "hf6", "hf7", "hf8", "hf9"), df = NULL ) The arguments that are in both functions are: x: a variable, expression, or empty na.rm: an indicator of whether missing values should be dropped, defaults to FALSE vartype: type(s) of variation estimate to calculate, defaults to se (standard error) level: a number or a vector indicating the confidence level, defaults to 0.95 interval_type: method for calculating a confidence interval qrule: rule for defining quantiles. The default is the lower end of the quantile interval (“math”). The midpoint of the quantile interval is the “school” rule. “hf1” to “hf9” are weighted analogs to type=1 to 9 in quantile(). “shahvaish” corresponds to a rule proposed by Shah and Vaish (2006). See vignette(\"qrule\", package=\"survey\") for more information. df: (for vartype = 'ci'), a numeric value indicating degrees of freedom for the t-distribution The only difference between survey_quantile() and survey_median() is the inclusion of the quantiles argument in the survey_quantile() function. This argument takes a vector with values between 0 and 1 to indicate which quantiles to calculate. For example, if we wanted the quartiles of a variable, we would provide quantiles = c(0.25, 0.5, 0.75). While we can specify quantiles of 0 and 1, which represent the minimum and maximum, this is not recommended. It only returns the minimum and maximum of the respondents and cannot be extrapolated to the population as there is no valid definition of standard error. In section 5.3.1, we provide an overview of the different variability types. The interval used in confidence intervals for most measures, such as means and counts, is referred to as a Wald-type interval. However, like confidence intervals for proportions, this is not always the most accurate interval for quantiles. With quantiles, the methods for interval types are many of the same as those for proportions (asin, beta, mean, and xlogit; see section 5.5.1) with the addition of two more methods: score: the Francisco & Fuller confidence interval based on inverting a score test (only available for design-based survey objects and not replicate-based objects) quantile: based on the replicates of the quantile. This is not valid for jackknife-type replicates but is available for bootstrap and BRR replicates. One thing of note with the score method is that when there are many ties in the data, this method can produce confidence intervals that do not contain the estimate. When dealing with a high propensity for ties (e.g., many respondents will have the same age), it is recommended to use another method. This is the method implemented in SUDAAN. However, SUDAAN adds noise to the values to prevent the issue with the ties, while the documentation in the {survey} package indicates this method generally has lower performance than the beta and logit intervals. 5.6.2 Examples Example 1: Overall Quartiles Quantiles are useful in learning about the distribution of a variable. Let’s look into the quartiles, specifically, the first quartile (p=0.25), the median (p=0.5), and the third quartile (p=0.75) of electric bills. recs_des %>% summarize(elec_bill = survey_quantile(DOLLAREL, quantiles = c(0.25, 0.5, 0.75))) ## # A tibble: 1 × 6 ## elec_bill_q25 elec_bill_q50 elec_bill_q75 elec_bill_q25_se ## <dbl> <dbl> <dbl> <dbl> ## 1 795. 1215. 1770. 5.69 ## # ℹ 2 more variables: elec_bill_q50_se <dbl>, elec_bill_q75_se <dbl> The output above shows the three quartiles and their respective standard errors. Example 2: Quartiles by Subgroup We can also estimate the quantiles of electric bills by region by incorporating the group_by() function: recs_des %>% group_by(Region) %>% summarize(elec_bill = survey_quantile(DOLLAREL, quantiles = c(0.25, 0.5, 0.75))) ## # A tibble: 4 × 7 ## Region elec_bill_q25 elec_bill_q50 elec_bill_q75 elec_bill_q25_se ## <fct> <dbl> <dbl> <dbl> <dbl> ## 1 Northeast 740. 1148. 1712. 13.7 ## 2 Midwest 769. 1149. 1632. 8.88 ## 3 South 968. 1402. 1945. 10.6 ## 4 West 623. 1028. 1568. 10.8 ## # ℹ 2 more variables: elec_bill_q50_se <dbl>, elec_bill_q75_se <dbl> Example 3: Minimum and Maximum As mentioned in the syntax section, we can specify quantiles of 0 (minimum) and 1 (maximum). R will calculate these two values and provide results. However, these are only the minimum and maximum values in the data. There is not sufficient information to determine what the standard errors should be: recs_des %>% summarize(elec_bill = survey_quantile(DOLLAREL, quantiles = c(0, 1))) ## # A tibble: 1 × 4 ## elec_bill_q00 elec_bill_q100 elec_bill_q00_se elec_bill_q100_se ## <dbl> <dbl> <dbl> <dbl> ## 1 -151. 15680. NaN 0 Example 4: Overall Median We can calculate the estimated median cost of electricity in the U.S. using the survey_median() function: recs_des %>% summarize(elec_bill = survey_median(DOLLAREL)) ## # A tibble: 1 × 2 ## elec_bill elec_bill_se ## <dbl> <dbl> ## 1 1215. 6.33 Nationally, the median household spent $1,215 in 2020. This is the same result as we obtained using the survey_quantile() function ($1,215). It is also interesting to note that the average electric bill for households that we calculated in section 5.5 is $1,380, but the estimated median electric bill is $1,215 indicating the distribution is likely right-skewed. Example 5: Medians by Subgroup We can also calculate the estimated median cost of electricity in the U.S. by each region. This is similar to finding the mean by region in that we include a group_by() function with the variable of interest before the summarize() function: recs_des %>% group_by(Region) %>% summarize(elec_bill = survey_median(DOLLAREL)) ## # A tibble: 4 × 3 ## Region elec_bill elec_bill_se ## <fct> <dbl> <dbl> ## 1 Northeast 1148. 16.6 ## 2 Midwest 1149. 11.6 ## 3 South 1402. 9.17 ## 4 West 1028. 14.3 Households from the West spent $1,028 on electricity, and in the South, they spent an average of $1,402. 5.7 Ratios Many are not familiar with the ratio estimate. The ratio is a measure of the ratio of the sum of two variables, specifically in the form of: \\[ \\frac{\\sum x_i}{\\sum y_i}.\\] The ratio is not the same as calculating the following: \\[ \\frac{1}{N} \\sum \\frac{x_i}{y_i} \\] which could be calculated with survey_mean() by creating a derived variable \\(z=x/y\\) and then calculating the mean of \\(z\\). Consider a survey of police agencies in the United States. We might want to estimate the ratio of female police officers to total police officers. We could run survey_ratio(Female_Officers, Total_Officers). If, instead, we used survey_means(Female_Officers/Total_Officers), we would be estimating the average percentage of female officers across agencies, which is a different quantity. 5.7.1 Syntax The syntax for survey_ratio() is as follows: survey_ratio( numerator, denominator, na.rm = FALSE, vartype = c("se", "ci", "var", "cv"), level = 0.95, deff = FALSE, df = NULL ) The arguments are: numerator: The numerator of the ratio denominator: The denominator of the ratio na.rm: A logical value to indicate whether missing values should be dropped vartype: type(s) of variation estimate to calculate including any of c(\"se\", \"ci\", \"var\", \"cv\"), defaults to se (standard error) (see 5.3.1 for more information) level: A single number or vector of numbers indicating the confidence level deff: A logical value to indicate whether the design effect should be returned (this is described in more detail in Section 5.10.3) df: (For vartype = “ci” only) A numeric value indicating the degrees of freedom for t-distribution 5.7.2 Examples Example 1: Overall Ratios Suppose we wanted to find the ratio of dollars spent on liquid propane per unit (in British thermal unit [Btu]) nationally9. If we wanted to find the average cost to a household, we could use survey_mean(), but to find the national unit rate, we can use ratio. In the following example, we will show both methods and discuss the interpretation of each: recs_des %>% summarize(DOLLARLP_Tot = survey_total(DOLLARLP, vartype = NULL), BTULP_Tot = survey_total(BTULP, vartype = NULL), DOL_BTU_Rat = survey_ratio(DOLLARLP, BTULP), DOL_BTU_Avg = survey_mean(DOLLARLP/BTULP, na.rm = TRUE)) ## # A tibble: 1 × 6 ## DOLLARLP_Tot BTULP_Tot DOL_BTU_Rat DOL_BTU_Rat_se DOL_BTU_Avg ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 8122911173. 391425311586. 0.0208 0.000240 0.0240 ## # ℹ 1 more variable: DOL_BTU_Avg_se <dbl> The ratio of the total spent on liquid propane to the total consumption was 0.0208, but the average rate was 0.024. With a bit of calculation, we can show that the ratio is the ratio of the totals DOLLARLP_Tot/BTULP_Tot=8,122,911,173/391,425,311,586=0.0208. While the ratio could be calculated manually in this manner, the standard error requires the use of the survey_ratio() function. The average can be interpreted as the average rate paid by a household. Example 2: Ratios by Subgroup As previously done with other estimates, we can use group_by() to examine whether this rate varies by region. recs_des %>% group_by(Region) %>% summarize(DOL_BTU_Rat = survey_ratio(DOLLARLP, BTULP)) ## # A tibble: 4 × 3 ## Region DOL_BTU_Rat DOL_BTU_Rat_se ## <fct> <dbl> <dbl> ## 1 Northeast 0.0247 0.000488 ## 2 Midwest 0.0158 0.000240 ## 3 South 0.0245 0.000388 ## 4 West 0.0246 0.000875 Though not a statistical test, it does appear the cost rates in the Midwest for liquid propane are the lowest. 5.8 Correlations The correlation is a measure of linear relationship between two continuous variables, which ranges between -1 and 1. The most common one used is Pearson’s correlation (referred to as correlation henceforth). A sample correlation for a simple random sample is calculated as follows: \\[\\frac{\\sum (x_i-\\bar{x})(y_i-\\bar{y})}{\\sqrt{\\sum (x_i-\\bar{x})^2} \\sqrt{\\sum(y_i-\\bar{y})^2}} \\] When using survey_corr() for designs other than a simple random sample, the weights are applied when estimating the correlation. 5.8.1 Syntax The syntax for survey_corr() is as follows: survey_corr( x, y, na.rm = FALSE, vartype = c("se", "ci", "var", "cv"), level = 0.95, df = NULL ) The arguments are: x: A variable or expression y: A variable or expression na.rm: A logical value to indicate whether missing values should be dropped vartype: type(s) of variation estimate to calculate including any of c(\"se\", \"ci\", \"var\", \"cv\"), defaults to se (standard error) (see 5.3.1 for more information) level: (For vartype = “ci” only) A single number or vector of numbers indicating the confidence level df: (For vartype = “ci” only) A numeric value indicating the degrees of freedom for t-distribution 5.8.2 Examples Example 1: Overall Correlation We can calculate the correlation between total square footage (TOTSQFT_EN)10 and electricity consumption (BTUEL)11. recs_des %>% summarize(SQFT_Elec_Corr = survey_corr(TOTSQFT_EN, BTUEL)) ## # A tibble: 1 × 2 ## SQFT_Elec_Corr SQFT_Elec_Corr_se ## <dbl> <dbl> ## 1 0.417 0.00689 Example 2: Correlations by Subgroup Like with other statistics, we can do this by subgroups. For example, we can examine the correlation by whether air conditioning is used (ACUsed). recs_des %>% group_by(ACUsed) %>% summarize(SQFT_Elec_Corr = survey_corr(TOTSQFT_EN, DOLLAREL)) ## # A tibble: 2 × 3 ## ACUsed SQFT_Elec_Corr SQFT_Elec_Corr_se ## <lgl> <dbl> <dbl> ## 1 FALSE 0.290 0.0240 ## 2 TRUE 0.401 0.00808 5.9 Standard Deviation and Variance All survey functions produce an estimate of the variability of a given estimate. No additional function is needed when dealing with variable estimates. However, if estimates of the population variance and population standard deviation are needed, we can use the survey_var() and survey_sd() functions. In our experience, most researchers will not use these functions. These are sometimes used when designing a future study, as understanding the variability in the population can help inform the precision of a future sampling design. 5.9.1 Syntax As with non-survey data, the standard deviation estimate is the square root of the variance estimate, and thus, the functions have the same arguments, except the standard deviation does not allow the usage of vartype. survey_var( x, na.rm = FALSE, vartype = c("se", "ci", "var"), level = 0.95, df = NULL ) survey_sd( x, na.rm = FALSE ) The arguments are: x: A variable or expression, or empty na.rm: A logical value to indicate whether missing values should be dropped vartype: type(s) of variation estimate to calculate including any of c(\"se\", \"ci\", \"var\"), defaults to se (standard error) (see 5.3.1 for more information) level: (For vartype = “ci” only) A single number or vector of numbers indicating the confidence level. df: (For vartype = “ci” only) A numeric value indicating the degrees of freedom for t-distribution 5.9.2 Examples Example 1: Overall Variability Returning to electricity bills, we look at the variability in electricity expenditure. recs_des %>% summarize(var_elbill = survey_var(DOLLAREL), sd_elbill = survey_sd(DOLLAREL)) ## Warning: There were 2 warnings in `dplyr::summarise()`. ## The first warning was: ## ℹ In argument: `var_elbill = survey_var(DOLLAREL)`. ## Caused by warning in `thetas - meantheta`: ## ! Recycling array of length 1 in vector-array arithmetic is deprecated. ## Use c() or as.vector() instead. ## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning. ## # A tibble: 1 × 3 ## var_elbill var_elbill_se sd_elbill ## <dbl> <dbl> <dbl> ## 1 704906. 13926. 840. A warning message may be displayed if using a replicate design. The results are still valid. The results above give an estimate of the population variance of electricity bills (var_elbill), the standard error of that variance (var_elbill_se), and the estimated population standard deviation of electricity bills (sd_elbill). Note that no standard error is associated with the standard deviation - this is the only estimate that does not include a standard error. Example 2: Variability by Subgroup Like other estimates, we can calculate the variance by region. This would be useful to learn if the variability is similar across regions: recs_des %>% group_by(Region) %>% summarize(var_elbill = survey_var(DOLLAREL), sd_elbill = survey_sd(DOLLAREL)) ## Warning: There were 8 warnings in `dplyr::summarise()`. ## The first warning was: ## ℹ In argument: `var_elbill = survey_var(DOLLAREL)`. ## ℹ In group 1: `Region = Northeast`. ## Caused by warning in `thetas - meantheta`: ## ! Recycling array of length 1 in vector-array arithmetic is deprecated. ## Use c() or as.vector() instead. ## ℹ Run `dplyr::last_dplyr_warnings()` to see the 7 remaining warnings. ## # A tibble: 4 × 4 ## Region var_elbill var_elbill_se sd_elbill ## <fct> <dbl> <dbl> <dbl> ## 1 Northeast 775450. 38843. 881. ## 2 Midwest 552423. 25252. 743. ## 3 South 702521. 30641. 838. ## 4 West 717886. 30597. 847. 5.10 Additional Topics 5.10.1 Unweighted Analysis Sometimes, it is helpful to calculate an unweighted estimate of a given variable. For this, we use the unweighted() function in the summarize() function. The unweighted() function calculates unweighted summaries from tbl_svy object, which reflects the summary among the respondents and does not extrapolate to a population estimate. The unweighted function can be used in conjunction with any {dplyr} functions. Here is an example looking at the average household electricity cost. recs_des %>% summarize(elec_bill = survey_mean(DOLLAREL), elec_unweight = unweighted(mean(DOLLAREL))) ## # A tibble: 1 × 3 ## elec_bill elec_bill_se elec_unweight ## <dbl> <dbl> <dbl> ## 1 1380. 5.38 1425. It is estimated that American residential households spent an average of $1,380 on electricity in 2020, and the estimate has a standard error of $5. The unweighted function calculates the unweighted average and illustrates the average amount of money the respondents spent on electricity in 2020, which was $1,425. 5.10.2 Subpopulation Analysis Briefly, we mentioned using filter() to subset a survey object for analysis. This operation should be done after creating the design object. In rare circumstances, subsetting data before creating the object can lead to incorrect variability estimates. This can occur if subsetting removes an entire PSU. Suppose we wanted estimates of the average amount spent on natural gas among housing units that use natural gas using the variable BTUNG12. This could be obtained by first filtering records to only include records where BTUNG > 0 and then finding the average amount of money spent. recs_des %>% filter(BTUNG > 0) %>% summarize(NG_mean = survey_mean(DOLLARNG, vartype = c("se", "ci"))) ## # A tibble: 1 × 4 ## NG_mean NG_mean_se NG_mean_low NG_mean_upp ## <dbl> <dbl> <dbl> <dbl> ## 1 631. 4.64 621. 640. Note that this yields a higher mean than when not applying the filter. When including housing units that do not use natural gas, many $0 amounts are included in the mean calculation. recs_des %>% summarize(NG_mean = survey_mean(DOLLARNG, vartype = c("se", "ci"))) ## # A tibble: 1 × 4 ## NG_mean NG_mean_se NG_mean_low NG_mean_upp ## <dbl> <dbl> <dbl> <dbl> ## 1 382. 3.41 375. 389. 5.10.3 Design Effects The design effect measures how the precision of an estimate is impacted by the sampling design. A design effect is calculated as the ratio of the variance of an estimate under the design at hand to the variance of the estimate under a simple random sample without replacement (SRS). A design effect less than 1 indicates that the design is more statistically efficient than a SRS design. This is rare but possible in a stratified sampling design where the outcome is correlated with the stratification variable(s). A design effect greater than 1 indicates that the design is less statistically efficient than a SRS design. From a design effect, we can calculate the effective sample size as follows: \\[n_{eff}=\\frac{n}{D_{eff}} \\] where \\(n\\) is the nominal sample size (number of survey responses) and \\(D_{eff}\\) is the estimated design effect. The effective sample size has an interesting interpretation that a survey using an SRS design would need a sample size of \\(n_{eff}\\) to obtain the same precision as the design at hand, which is where the efficiency interpretation comes in. Design effects are outcome-specific — outcomes that are less clustered in the population have smaller design effects than outcomes that are clustered. In the {srvyr} package, design effects can be calculated for totals, proportions, means, and ratio estimates by setting the deff argument to TRUE in the corresponding functions. For example, the design effect can be calculated for the average consumption of electricity (BTUEL), natural gas (BTUNG), liquid propane (BTULP), fuel oil (BTUFO), and wood (BTUWOOD). recs_des %>% summarize(across(c(BTUEL, BTUNG, BTULP, BTUFO, BTUWOOD), ~survey_mean(.x, deff = TRUE, vartype = NULL))) %>% select(ends_with("deff")) ## # A tibble: 1 × 5 ## BTUEL_deff BTUNG_deff BTULP_deff BTUFO_deff BTUWOOD_deff ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.597 0.938 1.21 0.720 1.10 5.10.4 Creating Summary Rows When using group_by() in analysis, results are returned with a row for each group or group combination. Often, we want both the breakdowns by group and a summary row for the estimate for the entire population. For example, we may want the average electricity consumption by region AND nationally. The {srvyr} package has a function cascade(), which adds summary rows for the total of a group. It is used in place of summarize() and has similar functions along with some additional features. Syntax The syntax is as follows: cascade( .data, ..., .fill = NA, .fill_level_top = FALSE, .groupings = NULL ) where the arguments are: .data: A tbl_svy object ...: Name-value pairs of summary functions (same as the summarize() function) .fill: Value to fill in for group summaries (defaults to NA) .fill_level_top: When filling factor variables, whether to put the value ‘.fill’ in the first position (defaults to FALSE, placing it in the bottom). .groupings: (Experimental) A list of quosures to manually specify the groupings to use, rather than the default. Example First, let’s look at a simple example and then build on it to examine the features of the function. In the first example, all default values are used. recs_des %>% group_by(Region) %>% cascade(DOLLAREL_mn = survey_mean(DOLLAREL)) ## # A tibble: 5 × 3 ## Region DOLLAREL_mn DOLLAREL_mn_se ## <fct> <dbl> <dbl> ## 1 Northeast 1343. 14.6 ## 2 Midwest 1293. 11.7 ## 3 South 1548. 10.3 ## 4 West 1211. 12.0 ## 5 <NA> 1380. 5.38 The last row where Region = NA is the national average electricity bill. We might wish to have a better name for it and can do that using the .fill argument. recs_des %>% group_by(Region) %>% cascade(DOLLAREL_mn = survey_mean(DOLLAREL), .fill = "National") ## # A tibble: 5 × 3 ## Region DOLLAREL_mn DOLLAREL_mn_se ## <fct> <dbl> <dbl> ## 1 Northeast 1343. 14.6 ## 2 Midwest 1293. 11.7 ## 3 South 1548. 10.3 ## 4 West 1211. 12.0 ## 5 National 1380. 5.38 We can also have more than one grouping variable as follows: recs_des %>% group_by(Region, Urbanicity) %>% cascade(DOLLAREL_mn = survey_mean(DOLLAREL), .fill = "Total") %>% ungroup() ## # A tibble: 17 × 4 ## Region Urbanicity DOLLAREL_mn DOLLAREL_mn_se ## <fct> <fct> <dbl> <dbl> ## 1 Northeast Urban Area 1315. 15.9 ## 2 Northeast Urban Cluster 1218. 59.5 ## 3 Northeast Rural 1529. 38.1 ## 4 Northeast Total 1343. 14.6 ## 5 Midwest Urban Area 1186. 13.6 ## 6 Midwest Urban Cluster 1214. 33.7 ## 7 Midwest Rural 1633. 32.1 ## 8 Midwest Total 1293. 11.7 ## 9 South Urban Area 1466. 12.9 ## 10 South Urban Cluster 1473. 29.3 ## 11 South Rural 1812. 22.1 ## 12 South Total 1548. 10.3 ## 13 West Urban Area 1179. 13.2 ## 14 West Urban Cluster 1174. 43.4 ## 15 West Rural 1544. 43.5 ## 16 West Total 1211. 12.0 ## 17 Total Total 1380. 5.38 We can move the summary row to the first row: recs_des %>% group_by(Region) %>% cascade(DOLLAREL_mn = survey_mean(DOLLAREL), .fill = "National", .fill_level_top = TRUE) %>% ungroup() ## # A tibble: 5 × 3 ## Region DOLLAREL_mn DOLLAREL_mn_se ## <fct> <dbl> <dbl> ## 1 National 1380. 5.38 ## 2 Northeast 1343. 14.6 ## 3 Midwest 1293. 11.7 ## 4 South 1548. 10.3 ## 5 West 1211. 12.0 5.10.5 Calculating Estimates for Many Outcomes Often, we are interested in a summary statistic across many variables. Two useful tools in doing this are the across() function in {dplyr} which has been shown a few times above and the map() function in {purrr}. The across() function allows you to apply the same function to several columns within summarize(). This works well for usage with all functions shown above except survey_prop(). In a later example, we will tackle several proportions. Example 1: across() Suppose we want to calculate the total consumption for each fuel type and the average consumption for each fuel type with coefficients of variation. These include the consumption of electricity (BTUEL), natural gas (BTUNG), liquid propane (BTULP), fuel oil (BTUFO), and wood (BTUWOOD), as illustrated in the discussion on design effects. These are the only variables that start with “BTU”, so we can use that to our advantage. consumption_ests <- recs_des %>% summarize(across(starts_with("BTU"), list(Total = ~survey_total(.x, vartype = "cv"), Mean = ~survey_mean(.x, vartype = "cv")), .unpack = "{outer}.{inner}")) consumption_ests ## # A tibble: 1 × 20 ## BTUEL_Total.coef BTUEL_Total._cv BTUEL_Mean.coef BTUEL_Mean._cv ## <dbl> <dbl> <dbl> <dbl> ## 1 4453284510065 0.00377 36051. 0.00377 ## # ℹ 16 more variables: BTUNG_Total.coef <dbl>, BTUNG_Total._cv <dbl>, ## # BTUNG_Mean.coef <dbl>, BTUNG_Mean._cv <dbl>, ## # BTULP_Total.coef <dbl>, BTULP_Total._cv <dbl>, ## # BTULP_Mean.coef <dbl>, BTULP_Mean._cv <dbl>, ## # BTUFO_Total.coef <dbl>, BTUFO_Total._cv <dbl>, ## # BTUFO_Mean.coef <dbl>, BTUFO_Mean._cv <dbl>, ## # BTUWOOD_Total.coef <dbl>, BTUWOOD_Total._cv <dbl>, … In the example above, this results in a very wide table. We may instead want a row for each fuel type. Using the pivot_longer() and pivot_wider() functions from {tidyr} can help us get there. We will first make the data longer and split out the components of the name with pivot_longer(): consumption_ests_long <- consumption_ests %>% pivot_longer(cols = everything(), names_to = c("FuelType", "Stat", "Type"), names_pattern = "BTU(.*)_(.*)\\\\.(.*)") consumption_ests_long ## # A tibble: 20 × 4 ## FuelType Stat Type value ## <chr> <chr> <chr> <dbl> ## 1 EL Total coef 4.45e+12 ## 2 EL Total _cv 3.77e- 3 ## 3 EL Mean coef 3.61e+ 4 ## 4 EL Mean _cv 3.77e- 3 ## 5 NG Total coef 4.24e+12 ## 6 NG Total _cv 9.08e- 3 ## 7 NG Mean coef 3.43e+ 4 ## 8 NG Mean _cv 9.08e- 3 ## 9 LP Total coef 3.91e+11 ## 10 LP Total _cv 3.80e- 2 ## 11 LP Mean coef 3.17e+ 3 ## 12 LP Mean _cv 3.80e- 2 ## 13 FO Total coef 3.96e+11 ## 14 FO Total _cv 3.43e- 2 ## 15 FO Mean coef 3.20e+ 3 ## 16 FO Mean _cv 3.43e- 2 ## 17 WOOD Total coef 3.45e+11 ## 18 WOOD Total _cv 4.54e- 2 ## 19 WOOD Mean coef 2.79e+ 3 ## 20 WOOD Mean _cv 4.54e- 2 Then, we make the names for each element more descriptive and informative before using pivot_wider() to create a table that is almost ready for publication. A bit more on that will be covered in Chapter 8. consumption_ests_long %>% mutate(Type = case_when(Type == "coef" ~ "", Type == "_cv" ~ " (CV)")) %>% pivot_wider(id_cols = FuelType, names_from = c(Stat, Type), names_glue = "{Stat}{Type}", values_from = value) ## # A tibble: 5 × 5 ## FuelType Total `Total (CV)` Mean `Mean (CV)` ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 EL 4.45e12 0.00377 36051. 0.00377 ## 2 NG 4.24e12 0.00908 34330. 0.00908 ## 3 LP 3.91e11 0.0380 3169. 0.0380 ## 4 FO 3.96e11 0.0343 3203. 0.0343 ## 5 WOOD 3.45e11 0.0454 2794. 0.0454 Example 2: Proportions with across() As mentioned earlier, proportions will not work as well directly with the across() method. If we want the proportion of houses with air conditioning and the proportion of houses with heating, we need two group_by() statements as follows: recs_des %>% group_by(ACUsed) %>% summarize(p = survey_prop()) ## # A tibble: 2 × 3 ## ACUsed p p_se ## <lgl> <dbl> <dbl> ## 1 FALSE 0.113 0.00306 ## 2 TRUE 0.887 0.00306 recs_des %>% group_by(SpaceHeatingUsed) %>% summarize(p = survey_prop()) ## # A tibble: 2 × 3 ## SpaceHeatingUsed p p_se ## <lgl> <dbl> <dbl> ## 1 FALSE 0.0469 0.00207 ## 2 TRUE 0.953 0.00207 If we are only interested in the TRUE outcomes, that is, the proportion that have air conditioning and the proportion that have heating, we can use the fact that survey_mean() applied to a logical variable is the same as using survey_prop(), as shown below: cool_heat_tab <- recs_des %>% summarize(across(c(ACUsed, SpaceHeatingUsed), ~survey_mean(.x), .unpack = "{outer}.{inner}")) cool_heat_tab ## # A tibble: 1 × 4 ## ACUsed.coef ACUsed._se SpaceHeatingUsed.coef SpaceHeatingUsed._se ## <dbl> <dbl> <dbl> <dbl> ## 1 0.887 0.00306 0.953 0.00207 Note that the estimates are the same as when using the separate group_by() statements. Like previously done, we can use pivot_longer() to create a table in a format better suited for distribution. cool_heat_tab %>% pivot_longer(everything(), names_to = c("Comfort", ".value"), names_pattern = "(.*)\\\\.(.*)") %>% rename(p = coef, se = `_se`) ## # A tibble: 2 × 3 ## Comfort p se ## <chr> <dbl> <dbl> ## 1 ACUsed 0.887 0.00306 ## 2 SpaceHeatingUsed 0.953 0.00207 Example 3: purrr::map() Loops are a common tool if we want to calculate the same thing for many elements. The {purrr} package has the map() functions. Like a loop, they allow you to do something in the same way many times. In our case, we may want to calculate proportions from the same design multiple times. An easy way to do this is to think about how we would do it for one outcome, build a function from there, and then iterate. Suppose we want to create a table that shows the proportion of people that trust in their government (TrustGovernment)13 as well as those that trust in people (TrustPeople)14. First, we do this for a single variable. We create a table that has the variable name as a column, the answer as a column, and then the percentage and its standard error. anes_des %>% drop_na(TrustGovernment) %>% group_by(TrustGovernment) %>% summarize(p = survey_prop() * 100) %>% mutate(Variable = "TrustGovernment") %>% rename(Answer = TrustGovernment) %>% select(Variable, everything()) ## # A tibble: 5 × 4 ## Variable Answer p p_se ## <chr> <fct> <dbl> <dbl> ## 1 TrustGovernment Always 1.55 0.204 ## 2 TrustGovernment Most of the time 13.2 0.553 ## 3 TrustGovernment About half the time 30.9 0.829 ## 4 TrustGovernment Some of the time 43.4 0.855 ## 5 TrustGovernment Never 11.0 0.566 Then, we create a function to replace TrustGovernment with the functions argument. To do this, we need to use a bit of tidy evaluation, which is a more advanced skill. If you want to learn more, we recommend Wickham (2019). calcps <- function(var) { anes_des %>% drop_na(!!sym(var)) %>% group_by(!!sym(var)) %>% summarize(p = survey_prop() * 100) %>% mutate(Variable = var) %>% rename(Answer := !!sym(var)) %>% select(Variable, everything()) } We can then run this function on the two variables of interest: calcps("TrustGovernment") ## # A tibble: 5 × 4 ## Variable Answer p p_se ## <chr> <fct> <dbl> <dbl> ## 1 TrustGovernment Always 1.55 0.204 ## 2 TrustGovernment Most of the time 13.2 0.553 ## 3 TrustGovernment About half the time 30.9 0.829 ## 4 TrustGovernment Some of the time 43.4 0.855 ## 5 TrustGovernment Never 11.0 0.566 calcps("TrustPeople") ## # A tibble: 5 × 4 ## Variable Answer p p_se ## <chr> <fct> <dbl> <dbl> ## 1 TrustPeople Always 0.809 0.164 ## 2 TrustPeople Most of the time 41.4 0.857 ## 3 TrustPeople About half the time 28.2 0.776 ## 4 TrustPeople Some of the time 24.5 0.670 ## 5 TrustPeople Never 5.05 0.422 Finally, we can use map to iterate over as many variables as we want. It will output a tibble with the variable name in the column “Variable”, the responses in “Answer”, the percentage, and then the standard error. This example extends nicely if we have many variables for which we want the percentage estimate. c("TrustGovernment", "TrustPeople") %>% map(calcps) %>% list_rbind() ## # A tibble: 10 × 4 ## Variable Answer p p_se ## <chr> <fct> <dbl> <dbl> ## 1 TrustGovernment Always 1.55 0.204 ## 2 TrustGovernment Most of the time 13.2 0.553 ## 3 TrustGovernment About half the time 30.9 0.829 ## 4 TrustGovernment Some of the time 43.4 0.855 ## 5 TrustGovernment Never 11.0 0.566 ## 6 TrustPeople Always 0.809 0.164 ## 7 TrustPeople Most of the time 41.4 0.857 ## 8 TrustPeople About half the time 28.2 0.776 ## 9 TrustPeople Some of the time 24.5 0.670 ## 10 TrustPeople Never 5.05 0.422 5.11 Exercises The exercises use the design objects anes_des and recs_des as provided in the Prerequisites box in the beginning of the chapter. How many females have a graduate degree? Hint: the variables Gender and Education will be useful. # Option 1: femgd_option1 <- anes_des %>% filter(Gender == "Female", Education == "Graduate") %>% survey_count(name = "n") femgd_option1 ## # A tibble: 1 × 2 ## n n_se ## <dbl> <dbl> ## 1 15072196. 837872. # Option 2: femgd_option2 <- anes_des %>% filter(Gender == "Female", Education == "Graduate") %>% summarize(N = survey_total(), .groups = "drop") femgd_option2 ## # A tibble: 1 × 2 ## N N_se ## <dbl> <dbl> ## 1 15072196. 837872. What percentage of people identify as “Strong Democrat”? Hint: The variable PartyID indicates someone’s party affiliation. psd <- anes_des %>% group_by(PartyID) %>% summarize(p = survey_mean()) %>% filter(PartyID == "Strong democrat") psd ## # A tibble: 1 × 3 ## PartyID p p_se ## <fct> <dbl> <dbl> ## 1 Strong democrat 0.219 0.00646 What percentage of people who voted in the 2020 election identify as “Strong Republican”? Hint: The variable VotedPres2020 indicates whether someone voted in 2020. psr <- anes_des %>% filter(VotedPres2020 == "Yes") %>% group_by(PartyID) %>% summarize(p = survey_mean()) %>% filter(PartyID == "Strong republican") psr ## # A tibble: 1 × 3 ## PartyID p p_se ## <fct> <dbl> <dbl> ## 1 Strong republican 0.228 0.00815 What percentage of people voted in both the 2016 election and the 2020 election? Include the logit confidence interval. Hint: The variable VotedPres2016 indicates whether someone voted in 2016. pvb <- anes_des %>% filter(!is.na(VotedPres2016), !is.na(VotedPres2020)) %>% group_by(interact(VotedPres2016, VotedPres2020)) %>% summarize(p = survey_prop(var = "ci", method = "logit"), ) %>% filter(VotedPres2016 == "Yes", VotedPres2020 == "Yes") pvb ## # A tibble: 1 × 5 ## VotedPres2016 VotedPres2020 p p_low p_upp ## <fct> <fct> <dbl> <dbl> <dbl> ## 1 Yes Yes 0.796 0.777 0.813 What is the design effect for the proportion of people who voted early? Hint: The variable EarlyVote2020 indicates whether someone voted early in 2020. pdeff <- anes_des %>% filter(!is.na(EarlyVote2020)) %>% group_by(EarlyVote2020) %>% summarize(p = survey_mean(deff = TRUE)) %>% filter(EarlyVote2020 == "Yes") pdeff ## # A tibble: 1 × 4 ## EarlyVote2020 p p_se p_deff ## <fct> <dbl> <dbl> <dbl> ## 1 Yes 0.0535 0.00426 2.27 What is the median temperature people set their thermostats to at night during the winter? Hint: The variable WinterTempNight indicates the temperature that people set their temperature in the winter at night. mean_wintertempnight <- recs_des %>% summarize(wtn_mean = survey_mean(x = WinterTempNight, na.rm = TRUE)) mean_wintertempnight ## # A tibble: 1 × 2 ## wtn_mean wtn_mean_se ## <dbl> <dbl> ## 1 68.3 0.0446 People sometimes set their temperature differently over different seasons and during the day. What median temperatures do people set their thermostat to in the summer and winter, both during the day and at night? Include confidence intervals. Hint: Use the variables WinterTempDay, WinterTempNight, SummerTempDay, and SummerTempNight. # Option 1 med_wintertempday <- recs_des %>% summarize(wtd_mean = survey_median(WinterTempDay, vartype = "se", na.rm = TRUE)) med_wintertempday ## # A tibble: 1 × 2 ## wtd_mean wtd_mean_se ## <dbl> <dbl> ## 1 70 0.250 med_wintertempnight <- recs_des %>% summarize(wtn_mean = survey_median(WinterTempNight, vartype = "se", na.rm = TRUE)) med_wintertempnight ## # A tibble: 1 × 2 ## wtn_mean wtn_mean_se ## <dbl> <dbl> ## 1 68 0.250 med_summertempday <- recs_des %>% summarize(std_mean = survey_median(SummerTempDay, vartype = "se", na.rm = TRUE)) med_summertempday ## # A tibble: 1 × 2 ## std_mean std_mean_se ## <dbl> <dbl> ## 1 72 0.250 med_summertempnight <- recs_des %>% summarize(stn_mean = survey_median(SummerTempNight, vartype = "se", na.rm = TRUE)) med_summertempnight ## # A tibble: 1 × 2 ## stn_mean stn_mean_se ## <dbl> <dbl> ## 1 72 0.250 # Alternatively, could use `survey_quantile()` as shown below for # WinterTempNight: quant_wintertemp <- recs_des %>% summarize(wnt_quant = survey_quantile(WinterTempNight, quantiles = 0.5, vartype = "se", na.rm = TRUE)) quant_wintertemp ## # A tibble: 1 × 2 ## wnt_quant_q50 wnt_quant_q50_se ## <dbl> <dbl> ## 1 68 0.250 What is the correlation between the temperature that people set their temperature at during the night and during the day in the summer? corr_summer_temp <- recs_des %>% summarize(summer_corr = survey_corr(SummerTempNight, SummerTempDay, na.rm = TRUE)) corr_summer_temp ## # A tibble: 1 × 2 ## summer_corr summer_corr_se ## <dbl> <dbl> ## 1 0.806 0.00806 What is the 1st, 2nd, and 3rd quartile of the amount of money spent on energy by Building America (BA) climate zone? Hint: TOTALDOL indicates the total amount spent on electricity, and ClimateRegion_BA indicates the BA climate zones. quant_baenergyexp <- recs_des %>% group_by(ClimateRegion_BA) %>% summarize(dol_quant = survey_quantile(TOTALDOL, quantiles = c(0.25, 0.5, 0.75), vartype = "se", na.rm = TRUE)) quant_baenergyexp ## # A tibble: 8 × 7 ## ClimateRegion_BA dol_quant_q25 dol_quant_q50 dol_quant_q75 ## <fct> <dbl> <dbl> <dbl> ## 1 Mixed-Dry 1091. 1541. 2139. ## 2 Mixed-Humid 1317. 1840. 2462. ## 3 Hot-Humid 1094. 1622. 2233. ## 4 Hot-Dry 926. 1513. 2223. ## 5 Very-Cold 1195. 1986. 2955. ## 6 Cold 1213. 1756. 2422. ## 7 Marine 938. 1380. 1987. ## 8 Subarctic 2404. 3535. 5219. ## # ℹ 3 more variables: dol_quant_q25_se <dbl>, dol_quant_q50_se <dbl>, ## # dol_quant_q75_se <dbl> References Shah, Babubhai V, and Akhil K Vaish. 2006. “Confidence Intervals for Quantile Estimation from Complex Survey Data.” In Proceedings of the Section on Survey Research Methods. Wickham, Hadley. 2019. Advanced R. https://adv-r.hadley.nz/; CRC press. RECS has two components: a household survey and an energy supplier survey. For each household that responds, their energy provider(s) are contacted to obtain their energy consumption and expenditure. This value reflects the dollars spent on electricity in 2020, according to the energy supplier. See https://www.eia.gov/consumption/residential/data/2020/pdf/2020%20RECS%20CE%20Methodology_Final.pdf for more details.↩︎ Question text: Is any air conditioning equipment used in your home?↩︎ The value of DOLLARLP reflects the annualized amount spent on liquid propane and BTULP reflects the annualized consumption in Btu of liquid propane.↩︎ Question text: What is the square footage of your home?↩︎ BTUEL is derived from the supplier side component of the survey where BTUEL represents the electricity consumption in British thermal units (Btus) converted from kilowatt hours (kWh) in a year↩︎ BTUNG is derived from the supplier side component of the survey where BTUNG represents the natural gas consumption in British thermal units (Btus) in a year↩︎ Question: How often can you trust the federal government in Washington to do what is right? (Always, most of the time, about half the time, some of the time, or never / Never, some of the time, about half the time, most of the time, or always)?↩︎ Question: Generally speaking, how often can you trust other people? (Always, most of the time, about half the time, some of the time, or never / Never, some of the time, about half the time, most of the time, or always)? ↩︎ "],["c06-statistical-testing.html", "Chapter 6 Statistical testing 6.1 Introduction 6.2 Dot Notation 6.3 Comparison of Proportions and Means 6.4 Chi-Square Tests 6.5 Exercises", " Chapter 6 Statistical testing Prerequisites For this chapter, load the following packages: library(tidyverse) library(survey) library(srvyr) library(srvyrexploR) library(broom) library(gt) We will be using data from ANES and RECS described in Chapter 4. As a reminder, here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter 4 for more information). targetpop <- 231592693 data(anes_2020) anes_adjwgt <- anes_2020 %>% mutate(Weight = Weight / sum(Weight) * targetpop) anes_des <- anes_adjwgt %>% as_survey_design( weights = Weight, strata = Stratum, ids = VarUnit, nest = TRUE ) For RECS, details are included in the RECS documentation and Chapters 4 and 10. data(recs_2020) recs_des <- recs_2020 %>% as_survey_rep( weights = NWEIGHT, repweights = NWEIGHT1:NWEIGHT60, type = "JK1", scale = 59/60, mse = TRUE ) 6.1 Introduction When analyzing results from a survey, the point estimates described in Chapter 5 help us understand the data at a high level. Still, researchers and the public often want to make comparisons between different groups. These comparisons are calculated through statistical testing. The general idea of statistical testing is the same for data obtained through surveys and data obtained through other methods, where we compare the point estimates and variance estimates of each statistic to see if statistically significant differences exist. However, statistical testing for complex surveys involves additional considerations due to the need to account for the sampling design in order to obtain accurate variance estimates. Statistical testing, also called hypothesis testing, involves declaring a null and alternative hypothesis. A null hypothesis is denoted as \\(H_0\\) and the alternative hypothesis is denoted as \\(H_A\\). The null hypothesis is the default assumption in that there are no differences in the data, or that the data is operating under “standard” behaviors. On the other hand, the alternative hypothesis is the break from the “standard” and what we are trying to determine if the data supports. Let’s review an example outside of survey data. If we are flipping a coin, a null hypothesis would be that the coin is fair and that each side has an equal chance of being flipped. In other words, the probability of the coin landing on each side is 1/2. Whereas an alternative hypothesis could be that the coin is unfair and that one side has a higher probability of being flipped (e.g., a probability of 1/4 to get heads, but a probability of 3/4 to get tails). We write this set of hypotheses as: \\(H_0: \\rho_{heads} = \\rho_{tails}\\), where \\(\\rho_{x}\\) is the probability of flipping the coin and having it land on heads (\\(\\rho_{heads}\\)) or tails (\\(\\rho_{tails}\\)) \\(H_A: \\rho_{heads} \\neq \\rho_{tails}\\) When we conduct hypothesis testing, the statistical models calculate a p-value, which shows how likely we are to observe the data if the null hypothesis is true. If the p-value (a probability between 0 and 1) is small, we have strong evidence to reject the null hypothesis as it is unlikely to see the data we are observing if the null hypothesis is true. However, if the p-value is large, we say we do not have evidence to reject the null hypothesis. The size of the p-value for this cut off is determined by type 1 error known as \\(\\alpha\\). A common type 1 error value for statistical testing is to use \\(\\alpha = 0.05\\).15 It is common for explanations of statistical testing to refer to confidence level. The confidence level is the inverse of the type 1 error. Thus, if \\(\\alpha = 0.05\\), the confidence level would be 95%. The functions in the {survey} package allow for the correct estimation of the variances. This chapter will cover the following statistical tests with survey data and functions: Comparison of proportions svyttest() Comparison of means svyttest() Goodness of fit tests svygofchisq() Tests of independence svychisq() Tests of homogeneity svychisq() 6.2 Dot Notation Up to this point, we have shown functions that use wrappers from the {srvyr} package. This means that the functions work with tidyverse syntax. However, the functions in this chapter do not have wrappers in the {srvyr} package and are instead used directly from the {survey} package. Therefore, the design object is not the first argument, and to use these functions with the magrittr pipe (%>%) and tidyverse syntax, we will need to use dot (.) notation16 Functions that work with the magrittr pipe (%>%) have the data as the first argument. When we run a function with the pipe, it automatically places anything to the left of the pipe into the first argument of the function to the right of the pipe. For example, if we wanted to take the mtcars data and filter to cars with six cylinders, we can write the code in at least four different ways: filter(mtcars, cyl == 6) mtcars %>% filter(cyl == 6) mtcars %>% filter(., cyl == 6) mtcars %>% filter(.data = ., cyl == 6) Each of these lines of code will produce the same output since the argument that takes the data is in the first spot in filter(). The first two are probably familiar to those who have worked with the tidyverse. The third option functions the same way as the second one but is explicit that mtcars goes into the first argument, and the fourth option indicates that mtcars is going into the named argument of .data. Here, we are telling R to take what’s on the left side of the pipe (mtcars) and pipe it into the spot with the dot (.)—the first argument. In functions that are not part of the tidyverse, the data argument may not be in the first spot. For example, in svyttest(), the data argument is in the second spot, which means we need to place the dot (.) in the second spot and not the first. For example: svydata_des %>% svyttest(x ~ y, .) By default, the pipe places the left-hand object in the first argument spot. Placing the dot (.) in the second argument spot indicates that the survey design object svydata_des should be used in the second argument and not the first. Alternatively, named arguments could be used to place the dot first as named arguments can appear at any location, as in the following: svydata_des %>% svyttest(design = ., x ~ y) However, the following code will not work as the svyttest() function expects the formula as the first argument when arguments are not named: svydata_des %>% svyttest(., x ~ y) 6.3 Comparison of Proportions and Means We use t-tests to compare two proportions or means. T-tests allow us to determine if one proportion or mean is statistically different from another. They are commonly used to determine if a single estimate differs from a known value (e.g., 0 or 50%) or to compare two group means (e.g., North versus South). Comparing a single estimate to a known value is called a one sample t-test, and we can set up the hypothesis test as follows: \\(H_0: \\mu = 0\\) where \\(\\mu\\) is the mean outcome and \\(0\\) is the value we are comparing it to \\(H_A: \\mu \\neq 0\\) For comparing two estimates, this is called a two-sample t-test and we can set up the hypothesis test as follows: \\(H_0: \\mu_1 = \\mu_2\\) where \\(\\mu_i\\) is the mean outcome for group \\(i\\) \\(H_A: \\mu_1 \\neq \\mu_2\\) Two sample t-tests can also be paired or unpaired. If the data come from two different populations (e.g., North versus South), the t-test run will be an unpaired or independent samples t-test. Paired t-tests occur when the data come from the same population. This is commonly seen with data from the same population in two different time periods (e.g., before and after an intervention). The difference between t-tests with non-survey data and survey data is based on the underlying variance estimation difference. Chapter 10 provides a detailed overview of the math behind the mean and sampling error calculations for various sample designs. The functions in the {survey} package will account for these nuances, provided the design object is correctly defined. 6.3.1 Syntax When we do not have survey data, we can use the t.test() function from the {stats} package. This function does not allow for weights or the variance structure that need to be accounted for with survey data. Therefore, we need to use the svyttest() function from {survey} when using survey data. Many of the arguments are the same between the two functions, but there are a few key differences: We need to use the survey design object instead of the original data frame We can only use a formula and not separate x and y data The confidence level cannot be specified and will always be set to 95%. However, we will show examples of how the confidence level can be changed after running the svyttest() function by using the confint() function. Here is the syntax for the svyttest() function: svyttest(formula, design, ...) The arguments are: formula: Formula, outcome~group for two-sample, outcome~0 or outcome~1 for one-sample. The group variable must be a factor or character with two levels, or be coded 0/1 or 1/2. We give more details on formula set-up below for different types of tests. design: survey design object ...: This passes options on for one-sided tests only, and thus, we can specify na.rm=TRUE Notice that the first argument here is the formula and not the design. This means we must use the dot (.) if we pipe in the survey design object (as described in Section 6.2). The formula argument can take several different forms depending on what we are measuring. Here are a few common scenarios: One-sample t-test: Comparison to 0: var ~ 0, where var is the measure of interest, and we compare it to the value 0. For example, we could test if the population mean of household debt is different from 0 given the sample data collected. Comparison to a different value: var - value ~ 0, where var is the measure of interest and value is what we are comparing to. For example, we could test if the proportion of the population that has blue eyes is different from 25% by using var - 0.25 ~ 0. Note that specifying the formula as var ~ 0.25 is not equivalent and will result in a syntax error. Two-sample t-test: Unpaired: 2 level grouping variable: var ~ groupVar, where var is the measure of interest and groupVar is a variable with two categories. For example, we could test if the average age of the population who voted for president in 2020 differed from the age of people who did not vote. In this case, age would be used for var, and a binary variable indicating voting activity would be the groupVar. 3+ level grouping variable: var ~ groupVar == level, where var is the measure of interest, groupVar is the categorical variable, and level is the category level to isolate. For example, we could test if the test scores in one classroom differed from all other classrooms where groupVar would be the variable holding the values for classroom IDs and level is the classroom ID we want to compare to the others. Paired: var_1 - var_2 ~ 0, where var_1 is the first variable of interest and var_2 is the second variable of interest. For example, we could test if test scores on a subject differed between the start and the end of a course so var_1 would be the test score at the beginning of the course and var_2 would be the score at the end of the course. The na.rm argument defaults to FALSE, which means if any data is missing, the t-test will not compute. Throughout this chapter, we will always set na.rm = TRUE, but before analyzing the survey data, review the notes provided in Chapter 3 to better understand how to handle missing data. Let’s walk through a few examples using the ANES and RECS data. 6.3.2 Examples Example 1: One-sample t-test for Mean RECS asks respondents to indicate what temperature they set their house to during the summer at night.17 In our data, we have called this variable SummerTempNight. If we want to see if the average U.S. household sets its temperature at a value different from 68\\(^\\circ\\)F18, we could set up the hypothesis as follows: \\(H_0: \\mu = 68\\) where \\(\\mu\\) is the average temperature U.S. households set their thermostat to in the summer at night \\(H_A: \\mu \\neq 68\\) To conduct this in R, we use svyttest() and subtract the temperature on the left-hand side of the formula: ttest_ex1 <- recs_des %>% svyttest( formula = SummerTempNight - 68 ~ 0, design = ., na.rm = TRUE ) ttest_ex1 ## ## Design-based one-sample t-test ## ## data: SummerTempNight - 68 ~ 0 ## t = 85, df = 58, p-value <2e-16 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## 3.288 3.447 ## sample estimates: ## mean ## 3.367 To pull out specific output, we can use R’s built-in $ operator. For instance, to obtain the estimate \\(\\mu - 68\\), we run ttest_ex1$estimate. If we want the average, we take our t-test estimate and add it to 68: ttest_ex1$estimate + 68 ## mean ## 71.37 Or, we can use the survey_mean() function described in Chapter 5: recs_des %>% summarize(mu = survey_mean(SummerTempNight, na.rm = TRUE)) ## # A tibble: 1 × 2 ## mu mu_se ## <dbl> <dbl> ## 1 71.4 0.0397 The result is the same in both methods, so we see that the average temperature U.S. households set their thermostat to in the summer at night is 71.4\\(^\\circ\\)F. Looking at the output from svyttest(), the t-statistic is 84.8, and the p-value is \\(<0.0001\\), indicating that the average is statistically different from 68\\(^\\circ\\)F at an \\(\\alpha\\) level of \\(0.05\\). If we want an 80% confidence interval for the test statistic, we can use the function confint() to change the confidence level. Below, we print both the original 95% confidence interval and the 80% confidence interval: confint(ttest_ex1, level = 0.95) ## 2.5 % 97.5 % ## as.numeric(SummerTempNight - 68) 3.288 3.447 ## attr(,"conf.level") ## [1] 0.95 confint(ttest_ex1, level = 0.8) ## [1] 3.316 3.419 ## attr(,"conf.level") ## [1] 0.8 In this case, neither confidence interval contains 0, and we draw the same conclusion from either that the average temperature households set their thermostat in the summer at night is significantly higher than 68\\(^\\circ\\)F. Example 2: One-sample t-test for Proportion RECS asked respondents if they use any air conditioning (AC) in their home.19 In our data, we call this variable ACUsed. Let’s look at the proportion of U.S. households that use AC in their homes using the survey_prop() function we learned in Chapter 5. acprop <- recs_des %>% group_by(ACUsed) %>% summarize(p = survey_prop()) acprop ## # A tibble: 2 × 3 ## ACUsed p p_se ## <lgl> <dbl> <dbl> ## 1 FALSE 0.113 0.00306 ## 2 TRUE 0.887 0.00306 Based on this, 88.7% of U.S. households use AC in their homes. If we wanted to know if this differs from 90%, we could set up our hypothesis as follows: \\(H_0: p = 0.90\\) where \\(p\\) is the proportion of the U.S. households that use AC in their homes \\(H_A: p \\neq 0.90\\) To conduct this in R, we use the svyttest() function as follows: ttest_ex2 <- recs_des %>% svyttest( formula = (ACUsed == TRUE) - 0.90 ~ 0, design = ., na.rm = TRUE ) ttest_ex2 ## ## Design-based one-sample t-test ## ## data: (ACUsed == TRUE) - 0.9 ~ 0 ## t = -4.4, df = 58, p-value = 5e-05 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## -0.019603 -0.007348 ## sample estimates: ## mean ## -0.01348 The output from the svyttest() function can be a bit hard to read. Using the {broom} package from tidymodels, a collection of packages for modeling using the tidyverse principles, we can clean up the output into a tibble to more easily understand what the test tells us. broom::tidy(ttest_ex2) ## # A tibble: 1 × 8 ## estimate statistic p.value parameter conf.low conf.high method ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 -0.0135 -4.40 0.0000466 58 -0.0196 -0.00735 Design-base… ## # ℹ 1 more variable: alternative <chr> The estimate differs from Example 1 in that the estimate is not displaying \\(\\mu - 0.90\\) but rather \\(\\mu\\), or the difference between the U.S. households that use AC and the proportion we are comparing to. We can see that there is a difference of -1.35 percentage points. Additionally, the t-statistic value in the statistic column is -4.4, and the p-value is <0.0001. These results indicate that the fewer than 90% of U.S. households use AC in their homes. Example 3: Unpaired two-sample t-test Two additional variables in the RECS data are the electric bill cost (DOLLAREL) and whether the house used AC or not (ACUsed).20 If we want to know if the U.S. households that used AC had higher electrical bills compared to those that did not, we could set up the hypothesis as follows: \\(H_0: \\mu_{AC} = \\mu_{noAC}\\) where \\(\\mu_{AC}\\) is the electrical bill cost for U.S. households that used AC and \\(\\mu_{noAC}\\) is the electrical bill cost for U.S. households that did not use AC \\(H_A: \\mu_{AC} \\neq \\mu_{noAC}\\) Let’s take a quick look at the data to see the format the data are in: recs_des %>% group_by(ACUsed) %>% summarize(mean = survey_mean(DOLLAREL, na.rm = TRUE)) ## # A tibble: 2 × 3 ## ACUsed mean mean_se ## <lgl> <dbl> <dbl> ## 1 FALSE 1056. 16.0 ## 2 TRUE 1422. 5.69 To conduct this in R, we use svyttest(): ttest_ex3 <- recs_des %>% svyttest(formula = DOLLAREL ~ ACUsed, design = ., na.rm = TRUE) broom::tidy(ttest_ex3) ## # A tibble: 1 × 8 ## estimate statistic p.value parameter conf.low conf.high method ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 366. 21.3 4.29e-29 58 331. 400. Design-based… ## # ℹ 1 more variable: alternative <chr> The results indicate that the difference in electrical bills for those that used AC and those that did not is, on average, $365.72. The difference appears to be statistically significant as the t-statistic is 21.3 and the p-value is \\(<0.0001\\). Households that used AC spent, on average, $365.72 more in 2020 on electricity than households without AC. Example 4: Paired two-sample t-test Let’s say we want to test whether the temperature that U.S. households set their thermostat at night differs depending on the season (comparing summer21 and winter22 temperatures). We could set up the hypothesis as follows: \\(H_0: \\mu_{summer} = \\mu_{winter}\\) where \\(\\mu_{summer}\\) is the temperature that U.S. households set their thermostat to during summer nights, and \\(\\mu_{winter}\\) is the temperature that U.S. households set their thermostat to during winter nights \\(H_A: \\mu_{summer} \\neq \\mu_{winter}\\) To conduct this in R, we use svyttest() by calculating the temperature difference on the left-hand side as follows: ttest_ex4 <- recs_des %>% svyttest( design = ., formula = SummerTempNight - WinterTempNight ~ 0, na.rm = TRUE ) broom::tidy(ttest_ex4) ## # A tibble: 1 × 8 ## estimate statistic p.value parameter conf.low conf.high method ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 2.85 50.8 8.45e-50 58 2.74 2.96 Design-based… ## # ℹ 1 more variable: alternative <chr> U.S. households set their thermostat on average 2.9\\(^\\circ\\)F warmer in summer nights than winter nights, which is statistically significant (t = 50.8, p-value = \\(<0.0001\\)). 6.4 Chi-Square Tests Chi-square tests (\\(\\chi^2\\)) allow us to examine multiple proportions using a goodness-of-fit test, a test of independence, or a test of homogeneity. These three tests have the same \\(\\chi^2\\) distributions but with slightly different underlying assumptions. First, goodness-of-fit tests are used when comparing observed data to expected data. For example, this could be used to determine if respondent demographics (the observed data in the sample) match known population information (the expected data). In this case, we can set up the hypothesis test as follows: \\(H_0: p_1 = \\pi_1, ~ p_2 = \\pi_2, ~ ..., ~ p_k = \\pi_k\\) where \\(p_i\\) is the observed proportion for category \\(i\\), \\(\\pi_i\\) is expected proportion for category \\(i\\), and \\(k\\) is the number of categories \\(H_A:\\) at least one level of \\(p_i\\) does not match \\(\\pi_i\\) Second, tests of independence are used when comparing two types of observed data to see if there is a relationship. For example, this could be used to determine if the proportion of respondents who voted for each political party in the presidential election matches the proportion of respondents who voted for each political party in a local election. In this case, we can set up the hypothesis test as follows: \\(H_0:\\) The two variables/factors are independent \\(H_A:\\) The two variables/factors are not independent Third, tests of homogeneity are used to compare two distributions to see if they match. For example, this could be used to determine if the highest education achieved is the same for both men and women. In this case, we can set up the hypothesis test as follows: \\(H_0: p_{1a} = p_{1b}, ~ p_{2a} = p_{2b}, ~ ..., ~ p_{ka} = p_{kb}\\) where \\(p_{ia}\\) is the observed proportion of category \\(i\\) for subgroup \\(a\\), \\(p_{ib}\\) is the observed proportion of category \\(i\\) for subgroup \\(a\\) and \\(k\\) is the number of categories \\(H_A:\\) at least one category of \\(p_{ia}\\) does not match \\(p_{ib}\\) As with t-tests, the difference between using \\(\\chi^2\\) tests with non-survey data and survey data is based on the underlying variance estimation. The functions in the {survey} package will account for these nuances, provided the design object is correctly defined. For basic variance estimation formulas for different survey design types, refer to Chapter 10. 6.4.1 Syntax When we do not have survey data, we may be able to use the chisq.test() function from the {stats} package. However, this function does not allow for weights or the variance structure to be accounted for with survey data. Therefore, when using survey data, we need to use one of two functions: svygofchisq(): For goodness of fit tests svychisq(): For tests of independence and homogeneity The non-survey data function of chisq.test() requires either a single set of counts and given proportions (for goodness of fit tests) or two sets of counts for tests of independence and homogeneity. The functions we use with survey data require respondent-level data and formulas instead of counts. This ensures that the variances are correctly calculated. First, the function for the goodness of fit tests is svygofchisq(): svygofchisq(formula, p, design, na.rm = TRUE, ...) The arguments are: formula: Formula specifying a single factor variable p: Vector of probabilities for the categories of the factor in the correct order. If they probabilities do not sum to 1, they will be rescaled to sum to 1. design: Survey design object …: Other arguments to pass on, such as na.rm Based on the order of the arguments, we again must use the dot (.) notation if we pipe in the survey design object or explicitly name the arguments as described in Section 6.2. For the goodness of fit tests, the formula will be a single variable formula = ~var as we compare the observed data from this variable to the expected data. The expected probabilities are then entered in the p argument and need to be a vector of the same length as the number of categories in the variable. For example, if we want to know if the proportion of males and females matches a distribution of 30/70, then the sex variable (with two categories) would be used formula = ~SEX, and the proportions would be included as p = c(.3, .7). It is important to note that the variable entered into the formula should be formatted as either a factor or a character. The examples below provide more detail and tips on how to make sure the levels match up correctly. For tests of homogeneity and independence, the svychisq() function should be used. The syntax is as follows: svychisq( formula, design, statistic = c("F", "Chisq", "Wald", "adjWald", "lincom", "saddlepoint"), na.rm = TRUE ) The arguments are: formula: Model formula specifying the table (shown in examples) design: Survey design object statistic: Type of test statistic to use in test (details below) na.rm: Remove missing values There are six statistics that are accepted in this formula. For tests of homogeneity (when comparing cross-tabulations), the F or Chisq statistics should be used.23 The F statistic is the default and uses the Rao-Scott second-order correction. This correction is designed to assist with complicated sampling designs (i.e., those other than a simple random sample) (Scott 2007). The Chisq statistic is an adjusted version of the Pearson \\(\\chi^2\\) statistic. The version of this statistic in the svychisq() function compares the design effect estimate from the provided survey data to what the \\(\\chi^2\\) distribution would have been if the data came from a simple random sampling. For tests of independence, the Wald and adjWald are recommended as they provide a better adjustment for variable comparisons (Lumley 2010). If the data has a small number of primary sampling units (PSUs) compared to the degrees of freedom, then the adjWald statistic should be used to account for this. The lincom and saddlepoint statistics are available for more complicated data structures. The formula argument will always be one-sided, unlike the svyttest() function. The two variables of interest should be included with a plus sign: formula = ~ var_1 + var_2. As with the svygofchisq() function, the variables entered into the formula should be formatted as either a factor or a character. Additionally, as with the t-test function, both svygofchisq() and svychisq() have the na.rm argument. If any data is missing, the \\(\\chi^2\\) tests will assume that NA is a category and include it in the calculation. Throughout this chapter, we will always set na.rm = TRUE, but before analyzing the survey data, review the notes provided in Chapter 3 to better understand how to handle missing data. 6.4.2 Examples Let’s walk through a few examples using the ANES data. Example 1: Goodness of Fit Test ANES asked respondents about their highest education level.24 Based on the data from the 2020 American Community Survey (ACS) 5-year estimates25, the education distribution of those aged 18+ in the United States (among the 50 states and District of Columbia) is as follows: 11% had less than High School degree 27% had a High School degree 29% had some college or associate’s degree 33% had a bachelor’s degree or higher If we want to see if the weighted distribution from the ANES 2020 data matches this distribution, we could set up the hypothesis as follows: \\(H_0: p_1 = 0.11, ~ p_2 = 0.27, ~ p_3 = 0.29, ~ p_4 = 0.33\\) \\(H_A:\\) at least one of the education levels does not match between the ANES and the ACS To conduct this in R, let’s first look at the education variable (Education) we have on the ANES data. Using the survey_mean() function discussed in Chapter 5, we can see the education levels and estimated proportions. anes_des %>% drop_na(Education) %>% group_by(Education) %>% summarize(p = survey_mean()) ## # A tibble: 5 × 3 ## Education p p_se ## <fct> <dbl> <dbl> ## 1 Less than HS 0.0805 0.00568 ## 2 High school 0.277 0.0102 ## 3 Post HS 0.290 0.00713 ## 4 Bachelor's 0.226 0.00633 ## 5 Graduate 0.126 0.00499 Based on this output, we can see that we have different levels than the ACS data provides. Specifically, the education data from ANES has two levels for Bachelor’s Degree or Higher (Bachelor’s and Graduate), so these two categories need to be collapsed into a single category to match the ACS data. For this, among other methods, we can use the {forcats} package from the tidyverse. The package’s fct_collapse() function helps us create a new variable by collapsing categories into a single one. Then, we will use the svygofchisq() function to compare the ANES data to the ACS data where we specify the updated design object, the formula using the collapsed education variable, the ACS estimates for education levels as p, and removing NA values. anes_des_educ <- anes_des %>% mutate(Education2 = fct_collapse(Education, "Bachelor or Higher" = c("Bachelor's", "Graduate"))) anes_des_educ %>% drop_na(Education2) %>% group_by(Education2) %>% summarize(p = survey_mean()) ## # A tibble: 4 × 3 ## Education2 p p_se ## <fct> <dbl> <dbl> ## 1 Less than HS 0.0805 0.00568 ## 2 High school 0.277 0.0102 ## 3 Post HS 0.290 0.00713 ## 4 Bachelor or Higher 0.352 0.00732 chi_ex1 <- anes_des_educ %>% svygofchisq( formula = ~ Education2, p = c(0.11, 0.27, 0.29, 0.33), design = ., na.rm = TRUE ) chi_ex1 ## ## Design-based chi-squared test for given probabilities ## ## data: ~Education2 ## X-squared = 2172220, scale = 1.1e+05, df = 2.3e+00, p-value = ## 9e-05 The output from the svygofchisq() indicates that at least one proportion from ANES does not match the ACS data (\\(\\chi^2 =\\) 2,172,220; p-value <0.0001). To get a better idea of the differences, we can use the expected output along with survey_mean() to create a comparison table: ex1_table <- anes_des_educ %>% drop_na(Education2) %>% group_by(Education2) %>% summarize(Observed = survey_mean(vartype = "ci")) %>% rename(Education = Education2) %>% mutate(Expected=c(0.11, 0.27, 0.29, 0.33)) %>% select(Education, Expected, everything()) ex1_table ## # A tibble: 4 × 5 ## Education Expected Observed Observed_low Observed_upp ## <fct> <dbl> <dbl> <dbl> <dbl> ## 1 Less than HS 0.11 0.0805 0.0691 0.0919 ## 2 High school 0.27 0.277 0.257 0.298 ## 3 Post HS 0.29 0.290 0.276 0.305 ## 4 Bachelor or Higher 0.33 0.352 0.337 0.367 This output includes our expected proportions from the ACS that we provided the svygofchisq() function along with the output of the observed proportions and their confidence intervals. This table shows that the “High school” and “Post HS” categories have nearly identical proportions but that the other two categories are slightly different. Looking at the confidence intervals, we can see that the ANES data skews to include fewer people in the “Less than HS” category and more people in the “Bachelor or Higher” category. This may be easier to see if we plot this. The code below uses the tabular output to create Figure 6.1. ex1_table %>% pivot_longer( cols = c("Expected", "Observed"), names_to = "Names", values_to = "Proportion" ) %>% mutate( Observed_low = if_else(Names == "Observed", Observed_low, NA_real_), Observed_upp = if_else(Names == "Observed", Observed_upp, NA_real_), Names = if_else(Names == "Observed", "ANES (observed)", "ACS (expected)") ) %>% ggplot(aes(x = Education, y = Proportion, color = Names)) + geom_point(alpha = 0.75, size = 2) + geom_errorbar(aes(ymin = Observed_low, ymax = Observed_upp), width = 0.25) + theme_bw() + scale_color_manual(name = "Type", values = book_colors[c(4, 1)]) + theme(legend.position = "bottom", legend.title=element_blank()) FIGURE 6.1: Expected and observed proportions of education, showing the confidence intervals for the expected proportions and whether the observed proportions lie within them. Example 2: Test of Independence ANES asked respondents two questions about trust: How often can you trust the federal government to do what is right? How often can you trust other people? If we want to see if the distributions of these two questions are similar or not, we can conduct a test of independence. Here is how the hypothesis could be set up: \\(H_0:\\) People’s trust in the federal government and their trust in other people are independent (i.e., not related) \\(H_A:\\) People’s trust in the federal government and their trust in other people are not independent (i.e., they are related) To conduct this in R, we use the svychisq() function to compare the two variables: chi_ex2 <- anes_des %>% svychisq( formula = ~ TrustGovernment + TrustPeople, design = ., statistic = "Wald", na.rm = TRUE ) chi_ex2 ## ## Design-based Wald test of association ## ## data: NextMethod() ## F = 21, ndf = 16, ddf = 51, p-value <2e-16 The output from svychisq() indicates that the distribution of people’s trust in the federal government and their trust in other people are not independent, meaning that they are related. Let’s output the distributions in a table to see the relationship. The observed output from the test provides a cross-tabulation of the counts for each category: chi_ex2$observed ## TrustPeople ## TrustGovernment Always Most of the time About half the time ## Always 16.470 25.009 31.848 ## Most of the time 11.020 539.377 196.258 ## About half the time 11.772 934.858 861.971 ## Some of the time 17.007 1353.779 839.863 ## Never 3.174 236.785 174.272 ## TrustPeople ## TrustGovernment Some of the time Never ## Always 36.854 5.523 ## Most of the time 206.556 27.184 ## About half the time 428.871 65.024 ## Some of the time 932.628 89.596 ## Never 217.994 189.307 However, as researchers, we often want to know about the proportions and not just the respondent counts from the survey. There are a couple of different ways that we can do this. The first is using the counts from chi_ex2$observed to calculate the proportion. We can then pivot the table to create a cross-tabulation similar to the counts table above. Adding group_by() to the code means that we are obtaining the proportions within each level of that variable. In this case, we are looking at the distribution of TrustGovernment for each level of TrustPeople. The resulting table is shown in Table 6.1 and in Chapter 8, we will discuss more on how to make publication-quality tables like this. chi_ex2_table<-chi_ex2$observed %>% as_tibble() %>% group_by(TrustPeople) %>% mutate(prop = round(n / sum(n), 3)) %>% select(-n) %>% pivot_wider(names_from = TrustPeople, values_from = prop) %>% gt(rowname_col = "TrustGovernment") %>% tab_stubhead(label = "Trust in Government") %>% tab_spanner(label = "Trust in People", columns = everything()) %>% cols_label(`Most of the time` = md("Most of<br />the time"), `About half the time` = md("About half<br />the time"), `Some of the time` = md("Some of<br />the time")) chi_ex2_table #ykgfaayxif table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #ykgfaayxif thead, #ykgfaayxif tbody, #ykgfaayxif tfoot, #ykgfaayxif tr, #ykgfaayxif td, #ykgfaayxif th { border-style: none; } #ykgfaayxif p { margin: 0; padding: 0; } #ykgfaayxif .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #ykgfaayxif .gt_caption { padding-top: 4px; padding-bottom: 4px; } #ykgfaayxif .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #ykgfaayxif .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #ykgfaayxif .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #ykgfaayxif .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #ykgfaayxif .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #ykgfaayxif .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #ykgfaayxif .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #ykgfaayxif .gt_column_spanner_outer:first-child { padding-left: 0; } #ykgfaayxif .gt_column_spanner_outer:last-child { padding-right: 0; } #ykgfaayxif .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #ykgfaayxif .gt_spanner_row { border-bottom-style: hidden; } #ykgfaayxif .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #ykgfaayxif .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #ykgfaayxif .gt_from_md > :first-child { margin-top: 0; } #ykgfaayxif .gt_from_md > :last-child { margin-bottom: 0; } #ykgfaayxif .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #ykgfaayxif .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #ykgfaayxif .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #ykgfaayxif .gt_row_group_first td { border-top-width: 2px; } #ykgfaayxif .gt_row_group_first th { border-top-width: 2px; } #ykgfaayxif .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #ykgfaayxif .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #ykgfaayxif .gt_first_summary_row.thick { border-top-width: 2px; } #ykgfaayxif .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #ykgfaayxif .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #ykgfaayxif .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #ykgfaayxif .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #ykgfaayxif .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #ykgfaayxif .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #ykgfaayxif .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #ykgfaayxif .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #ykgfaayxif .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #ykgfaayxif .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #ykgfaayxif .gt_left { text-align: left; } #ykgfaayxif .gt_center { text-align: center; } #ykgfaayxif .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #ykgfaayxif .gt_font_normal { font-weight: normal; } #ykgfaayxif .gt_font_bold { font-weight: bold; } #ykgfaayxif .gt_font_italic { font-style: italic; } #ykgfaayxif .gt_super { font-size: 65%; } #ykgfaayxif .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #ykgfaayxif .gt_asterisk { font-size: 100%; vertical-align: 0; } #ykgfaayxif .gt_indent_1 { text-indent: 5px; } #ykgfaayxif .gt_indent_2 { text-indent: 10px; } #ykgfaayxif .gt_indent_3 { text-indent: 15px; } #ykgfaayxif .gt_indent_4 { text-indent: 20px; } #ykgfaayxif .gt_indent_5 { text-indent: 25px; } TABLE 6.1: Proportion of adults in the U.S. by levels of trust in people and government, ANES 2020 Trust in Government Trust in People Always Most ofthe time About halfthe time Some ofthe time Never Always 0.277 0.008 0.015 0.020 0.015 Most of the time 0.185 0.175 0.093 0.113 0.072 About half the time 0.198 0.303 0.410 0.235 0.173 Some of the time 0.286 0.438 0.399 0.512 0.238 Never 0.053 0.077 0.083 0.120 0.503 In Table 6.1, each column sums to 1. For example, we can say that it is estimated that of people who always trust in people, 27.7% also always trust in government based on the top-left cell but 5.3% never trust in government. The second option is to use group_by() and survey_mean() functions to calculate the proportions from the ANES design object. A reminder that with more than one variable listed in the group_by() statement, the proportions are within the first variable listed. As mentioned above, we are looking at the distribution of TrustGovernment for each level of TrustPeople. chi_ex2_obs <- anes_des %>% drop_na(TrustPeople, TrustGovernment) %>% group_by(TrustPeople, TrustGovernment) %>% summarize(Observed = round(survey_mean(vartype = "ci"), 3), .groups="drop") chi_ex2_obs_table<-chi_ex2_obs %>% mutate(prop = paste0(Observed, " (", Observed_low, ", ", Observed_upp, ")")) %>% select(TrustGovernment, TrustPeople, prop) %>% pivot_wider(names_from = TrustPeople, values_from = prop) %>% gt(rowname_col = "TrustGovernment") %>% tab_stubhead(label = "Trust in Government") %>% tab_spanner(label = "Trust in People", columns = everything()) %>% tab_options(page.orientation = "landscape") chi_ex2_obs_table #mohbstruvy table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #mohbstruvy thead, #mohbstruvy tbody, #mohbstruvy tfoot, #mohbstruvy tr, #mohbstruvy td, #mohbstruvy th { border-style: none; } #mohbstruvy p { margin: 0; padding: 0; } #mohbstruvy .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #mohbstruvy .gt_caption { padding-top: 4px; padding-bottom: 4px; } #mohbstruvy .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #mohbstruvy .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #mohbstruvy .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #mohbstruvy .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #mohbstruvy .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #mohbstruvy .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #mohbstruvy .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #mohbstruvy .gt_column_spanner_outer:first-child { padding-left: 0; } #mohbstruvy .gt_column_spanner_outer:last-child { padding-right: 0; } #mohbstruvy .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #mohbstruvy .gt_spanner_row { border-bottom-style: hidden; } #mohbstruvy .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #mohbstruvy .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #mohbstruvy .gt_from_md > :first-child { margin-top: 0; } #mohbstruvy .gt_from_md > :last-child { margin-bottom: 0; } #mohbstruvy .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #mohbstruvy .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #mohbstruvy .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #mohbstruvy .gt_row_group_first td { border-top-width: 2px; } #mohbstruvy .gt_row_group_first th { border-top-width: 2px; } #mohbstruvy .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #mohbstruvy .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #mohbstruvy .gt_first_summary_row.thick { border-top-width: 2px; } #mohbstruvy .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #mohbstruvy .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #mohbstruvy .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #mohbstruvy .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #mohbstruvy .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #mohbstruvy .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #mohbstruvy .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #mohbstruvy .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #mohbstruvy .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #mohbstruvy .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #mohbstruvy .gt_left { text-align: left; } #mohbstruvy .gt_center { text-align: center; } #mohbstruvy .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #mohbstruvy .gt_font_normal { font-weight: normal; } #mohbstruvy .gt_font_bold { font-weight: bold; } #mohbstruvy .gt_font_italic { font-style: italic; } #mohbstruvy .gt_super { font-size: 65%; } #mohbstruvy .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #mohbstruvy .gt_asterisk { font-size: 100%; vertical-align: 0; } #mohbstruvy .gt_indent_1 { text-indent: 5px; } #mohbstruvy .gt_indent_2 { text-indent: 10px; } #mohbstruvy .gt_indent_3 { text-indent: 15px; } #mohbstruvy .gt_indent_4 { text-indent: 20px; } #mohbstruvy .gt_indent_5 { text-indent: 25px; } TABLE 6.2: Proportion of adults in the U.S. by levels of trust in people and government with confidence intervals, ANES 2020 Trust in Government Trust in People Always Most of the time About half the time Some of the time Never Always 0.277 (0.11, 0.444) 0.008 (0.004, 0.012) 0.015 (0.006, 0.024) 0.02 (0.008, 0.033) 0.015 (0, 0.029) Most of the time 0.185 (-0.009, 0.38) 0.175 (0.157, 0.192) 0.093 (0.078, 0.109) 0.113 (0.085, 0.141) 0.072 (0.021, 0.123) About half the time 0.198 (0.046, 0.35) 0.303 (0.281, 0.324) 0.41 (0.378, 0.441) 0.235 (0.2, 0.271) 0.173 (0.099, 0.246) Some of the time 0.286 (0.069, 0.503) 0.438 (0.415, 0.462) 0.399 (0.365, 0.433) 0.512 (0.481, 0.543) 0.238 (0.178, 0.298) Never 0.053 (-0.01, 0.117) 0.077 (0.064, 0.089) 0.083 (0.063, 0.103) 0.12 (0.097, 0.142) 0.503 (0.422, 0.583) Both methods produce the same output as the svychisq() function does account for the survey design. However, calculating the proportions directly from the design object means we can also obtain the variance information. In this case, the table output displays the survey estimate followed by the confidence intervals. Based on the output, we can see that of those who never trust people, 50.3% also never trust the government, while the proportions of never trusting the government are much lower for each of the other levels of trusting people. We may find it easier to look at these proportions graphically. We can use ggplot() and facets to provide an overview as shown below to create Figure 6.2: chi_ex2_obs %>% mutate(TrustPeople= fct_reorder(str_c("Trust in People:\\n", TrustPeople), order(TrustPeople))) %>% ggplot(aes(x = TrustGovernment, y = Observed, color = TrustGovernment)) + facet_wrap( ~ TrustPeople, ncol = 5) + geom_point() + geom_errorbar(aes(ymin = Observed_low, ymax = Observed_upp)) + ylab("Proportion") + xlab("") + theme_bw() + scale_color_manual(name="Trust in Government", values=book_colors) + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(), legend.position = "bottom") + guides(col = guide_legend(nrow=2)) FIGURE 6.2: Proportion of adults in the U.S. by levels of trust in people and government with confidence intervals, ANES 2020 Example 3: Test of Homogeneity Researchers and politicians often look at specific demographics each election cycle to understand how each group is leaning or voting toward candidates. The ANES data are collected post-election, but we can still see if there are differences in how specific demographic groups voted. If we want to see if there is a difference in how each age group voted for the 2020 candidates, this would be a test of homogeneity, and we can set up the hypothesis as follows: \\[\\begin{align*} H_0: p_{1_{Biden}} &= p_{1_{Trump}} = p_{1_{Other}},\\\\ p_{2_{Biden}} &= p_{2_{Trump}} = p_{2_{Other}},\\\\ p_{3_{Biden}} &= p_{3_{Trump}} = p_{3_{Other}},\\\\ p_{4_{Biden}} &= p_{4_{Trump}} = p_{4_{Other}},\\\\ p_{5_{Biden}} &= p_{5_{Trump}} = p_{5_{Other}},\\\\ p_{6_{Biden}} &= p_{6_{Trump}} = p_{6_{Other}} \\end{align*}\\] where \\(p_{i_{Biden}}\\) is the observed proportion of each age group (\\(i\\)) that voted for Joseph Biden, \\(p_{i_{Trump}}\\) is the observed proportion of each age group (\\(i\\)) that voted for Donald Trump, and \\(p_{i_{Other}}\\) is the observed proportion of each age group (\\(i\\)) that voted for another candidate \\(H_A:\\) at least one category of \\(p_{i_{Biden}}\\) does not match \\(p_{i_{Trump}}\\) or \\(p_{i_{Other}}\\) To conduct this in R, we use the svychisq() function to compare the two variables: chi_ex3 <- anes_des %>% drop_na(VotedPres2020_selection, AgeGroup) %>% svychisq( formula = ~ AgeGroup + VotedPres2020_selection, design = ., statistic = "Chisq", na.rm = TRUE ) chi_ex3 ## ## Pearson's X^2: Rao & Scott adjustment ## ## data: NextMethod() ## X-squared = 171, df = 10, p-value <2e-16 The output from svychisq() indicates a difference in how each age group voted in the 2020 election. To get a better idea of the different distributions, let’s output proportions to see the relationship. As we learned in Example 2 above, we can use chi_ex3$observed, or if we want to get the variance information (which is crucial with survey data), we can use survey_mean(). Remember, when we have two variables in group_by(), we obtain the proportions within each level of the variable listed. In this case, we are looking at the distribution of AgeGroup for each level of VotedPres2020_selection. chi_ex3_obs <- anes_des %>% filter(VotedPres2020 == "Yes") %>% drop_na(VotedPres2020_selection, AgeGroup) %>% group_by(VotedPres2020_selection, AgeGroup) %>% summarize(Observed = round(survey_mean(vartype = "ci"), 3)) chi_ex3_obs_table<-chi_ex3_obs %>% mutate(prop = paste0(Observed, " (", Observed_low, ", ", Observed_upp, ")")) %>% select(AgeGroup, VotedPres2020_selection, prop) %>% pivot_wider(names_from = VotedPres2020_selection, values_from = prop) %>% gt(rowname_col = "AgeGroup") %>% tab_stubhead(label = "Age Group") chi_ex3_obs_table #slkrgqeuoa table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #slkrgqeuoa thead, #slkrgqeuoa tbody, #slkrgqeuoa tfoot, #slkrgqeuoa tr, #slkrgqeuoa td, #slkrgqeuoa th { border-style: none; } #slkrgqeuoa p { margin: 0; padding: 0; } #slkrgqeuoa .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #slkrgqeuoa .gt_caption { padding-top: 4px; padding-bottom: 4px; } #slkrgqeuoa .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #slkrgqeuoa .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #slkrgqeuoa .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #slkrgqeuoa .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #slkrgqeuoa .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #slkrgqeuoa .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #slkrgqeuoa .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #slkrgqeuoa .gt_column_spanner_outer:first-child { padding-left: 0; } #slkrgqeuoa .gt_column_spanner_outer:last-child { padding-right: 0; } #slkrgqeuoa .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #slkrgqeuoa .gt_spanner_row { border-bottom-style: hidden; } #slkrgqeuoa .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #slkrgqeuoa .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #slkrgqeuoa .gt_from_md > :first-child { margin-top: 0; } #slkrgqeuoa .gt_from_md > :last-child { margin-bottom: 0; } #slkrgqeuoa .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #slkrgqeuoa .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #slkrgqeuoa .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #slkrgqeuoa .gt_row_group_first td { border-top-width: 2px; } #slkrgqeuoa .gt_row_group_first th { border-top-width: 2px; } #slkrgqeuoa .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #slkrgqeuoa .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #slkrgqeuoa .gt_first_summary_row.thick { border-top-width: 2px; } #slkrgqeuoa .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #slkrgqeuoa .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #slkrgqeuoa .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #slkrgqeuoa .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #slkrgqeuoa .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #slkrgqeuoa .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #slkrgqeuoa .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #slkrgqeuoa .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #slkrgqeuoa .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #slkrgqeuoa .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #slkrgqeuoa .gt_left { text-align: left; } #slkrgqeuoa .gt_center { text-align: center; } #slkrgqeuoa .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #slkrgqeuoa .gt_font_normal { font-weight: normal; } #slkrgqeuoa .gt_font_bold { font-weight: bold; } #slkrgqeuoa .gt_font_italic { font-style: italic; } #slkrgqeuoa .gt_super { font-size: 65%; } #slkrgqeuoa .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #slkrgqeuoa .gt_asterisk { font-size: 100%; vertical-align: 0; } #slkrgqeuoa .gt_indent_1 { text-indent: 5px; } #slkrgqeuoa .gt_indent_2 { text-indent: 10px; } #slkrgqeuoa .gt_indent_3 { text-indent: 15px; } #slkrgqeuoa .gt_indent_4 { text-indent: 20px; } #slkrgqeuoa .gt_indent_5 { text-indent: 25px; } TABLE 6.3: Distribution of age group by presidential candidate selection with confidence intervals Age Group Biden Trump Other 18-29 0.204 (0.177, 0.231) 0.114 (0.095, 0.133) 0.228 (0.151, 0.306) 30-39 0.169 (0.153, 0.185) 0.147 (0.124, 0.17) 0.303 (0.21, 0.396) 40-49 0.163 (0.146, 0.18) 0.157 (0.136, 0.178) 0.21 (0.129, 0.291) 50-59 0.154 (0.136, 0.173) 0.234 (0.207, 0.261) 0.107 (0.041, 0.173) 60-69 0.179 (0.16, 0.199) 0.192 (0.172, 0.213) 0.102 (0.026, 0.179) 70 or older 0.13 (0.118, 0.143) 0.156 (0.139, 0.174) 0.049 (0, 0.099) We can see that the age group distribution that voted for Biden and other candidates was younger than those that voted for Trump. For example, of those who voted for Biden, 20.4% were in the 18-29 age group, compared to only 11.4% of those who voted for Trump were in that age group. On the other side, 23.4% of those who voted for Trump were in the 50-59 age group compared to only 15.4% of those who voted for Biden. 6.5 Exercises The exercises use the design objects anes_des and recs_des as provided in the Prerequisites box in the beginning of the chapter. Here are some exercises for practicing conducting t-tests using svyttest(): Using the RECS data, do more than 50% of U.S. households use AC (ACUsed)? ttest_solution1 <- recs_des %>% svyttest(design = ., formula = ((ACUsed == TRUE) - 0.5) ~ 0, na.rm = TRUE) ttest_solution1 ## ## Design-based one-sample t-test ## ## data: ((ACUsed == TRUE) - 0.5) ~ 0 ## t = 126, df = 58, p-value <2e-16 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## 0.3804 0.3927 ## sample estimates: ## mean ## 0.3865 Using the RECS data, does the average temperature that U.S. households set their thermostats to differ between the day and night in the winter (WinterTempDay and WinterTempNight)? ttest_solution2 <- recs_des %>% svyttest( design = ., formula = WinterTempDay - WinterTempNight ~ 0, na.rm = TRUE ) ttest_solution2 ## ## Design-based one-sample t-test ## ## data: WinterTempDay - WinterTempNight ~ 0 ## t = 46, df = 58, p-value <2e-16 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## 1.594 1.740 ## sample estimates: ## mean ## 1.667 Using the ANES data, does the average age (Age) of those who voted for Joseph Biden in 2020 (VotedPres2020_selection) differ from those who voted for another candidate? ttest_solution3 <- anes_des %>% svyttest( design = ., formula = Age ~ VotedPres2020_selection == "Biden", na.rm = TRUE ) ttest_solution3 ## ## Design-based t-test ## ## data: Age ~ VotedPres2020_selection == "Biden" ## t = -6, df = 50, p-value = 2e-07 ## alternative hypothesis: true difference in mean is not equal to 0 ## 95 percent confidence interval: ## -4.809 -2.388 ## sample estimates: ## difference in mean ## -3.598 If you wanted to determine if the political party affiliation differed for males and females, what test would you use? Goodness of fit test (svygofchisq()) Test of independence (svychisq()) Test of homogeneity (svychisq()) chisq_solution1 <- "c. Test of homogeneity (`svychisq()`)" chisq_solution1 ## [1] "c. Test of homogeneity (`svychisq()`)" In the RECS data, is there a relationship between the type of housing unit (HousingUnitType) and the year the house was built (YearMade)? chisq_solution2 <- recs_des %>% svychisq( formula = ~ HousingUnitType + YearMade, design = ., statistic = "Wald", na.rm = TRUE ) chisq_solution2 ## ## Design-based Wald test of association ## ## data: NextMethod() ## F = 68, ndf = 32, ddf = 59, p-value <2e-16 In the ANES data, is there a difference in the distribution of gender (Gender) across early voting status in 2020 (EarlyVote2020)? chisq_solution3 <- anes_des %>% svychisq( formula = ~ Gender + EarlyVote2020, design = ., statistic = "F", na.rm = TRUE ) chisq_solution3 ## ## Pearson's X^2: Rao & Scott adjustment ## ## data: NextMethod() ## F = 0.32, ndf = 1, ddf = 51, p-value = 0.6 References Lumley, Thomas. 2010. Complex Surveys: A Guide to Analysis Using r: A Guide to Analysis Using r. John Wiley; Sons. Scott, Alastair. 2007. Rao-Scott Corrections and Their Impact. Section on Survey Research Methods. http://www.asasrms.org/Proceedings/y2007/Files/JSM2007-000874.pdf; ASA. For more information on statistical testing, we recommend reviewing introduction to statistics textbooks.↩︎ This could change in the future if another package is built or {srvyr} is expanded to work with tidymodels but no such plans are known at this time.↩︎ During the summer, what is your home’s typical indoor temperature inside your home at night?↩︎ This is the temperature that Stephanie prefers at night during the summer, and she wanted to see if she was different from the population.↩︎ Is any air conditioning equipment used in your home?↩︎ Is any air conditioning equipment used in your home?↩︎ During the summer, what is your home’s typical indoor temperature inside your home at night?↩︎ During the winter, what is your home’s typical indoor temperature inside your home at night?↩︎ These two statistics can also be used for goodness of fit tests if the svygofchisq() function is not used.↩︎ What is the highest level of school you have completed or the highest degree you have received?↩︎ Data was pulled from data.census.gov using the S1501 Education Attainment 2020: ACS 5-Year Estimates Subject Tables↩︎ "],["c07-modeling.html", "Chapter 7 Modeling 7.1 Introduction 7.2 Analysis of Variance (ANOVA) 7.3 Gaussian Linear Regression 7.4 Logistic Regression 7.5 Exercises", " Chapter 7 Modeling Prerequisites For this chapter, load the following packages: library(tidyverse) library(survey) library(srvyr) library(srvyrexploR) library(broom) We will be using data from ANES and RECS described in Chapter 4. As a reminder, here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter 4 for more information). targetpop <- 231592693 data(anes_2020) anes_adjwgt <- anes_2020 %>% mutate(Weight = Weight / sum(Weight) * targetpop) anes_des <- anes_adjwgt %>% as_survey_design( weights = Weight, strata = Stratum, ids = VarUnit, nest = TRUE ) For RECS, details are included in the RECS documentation and Chapters 4 and 10. data(recs_2020) recs_des <- recs_2020 %>% as_survey_rep( weights = NWEIGHT, repweights = NWEIGHT1:NWEIGHT60, type = "JK1", scale = 59/60, mse = TRUE ) 7.1 Introduction Modeling data is a way for researchers to investigate the relationship between a single dependent variable and one or more independent variables. This builds upon the analyses conducted in Chapter 6, which looked at the relationships between just two variables. For example, in Example 3 in Section 6.3.2, we investigated if there is a relationship between the electrical bill cost and whether or not the household used air-conditioning. However, there are potentially other elements that could go into what the cost of electrical bill is in a household (e.g., outside temperature, desired internal temperature, types and number of appliances, etc.). T-tests only allow us to investigate the relationship of one independent variable at a time, but using models we can look into multiple variables and even explore interactions between these variables. There are several types of models, but in this chapter we will cover Analysis of Variance (ANOVA) and linear regression models following common Gaussian and logit distributions. Jonas Kristoffer Lindeløv has an interesting discussion of many statistical tests and models being equivalent to a linear model. For example, a one-way ANOVA is a linear model with one categorical independent variable, and a two-sample t-test is an ANOVA where the independent variable has exactly two levels. When modeling data, it is helpful to first create an equation that provides an overview as to what it is that we are modeling. The main structure of these models is as follows: \\[y_i=\\beta_0 +\\sum_{i=1}^p \\beta_i x_i + \\epsilon_i\\] where \\(y_i\\) is the outcome, \\(\\beta_0\\) is an intercept, \\(x_1, \\cdots, x_p\\) are the predictors with \\(\\beta_1, \\cdots, \\beta_p\\) as the associated coefficients, and \\(\\epsilon_i\\) is the error. Different models may not include an intercept, have interactions between different independent variables (\\(x_i\\)), or may have different underlying structures for the dependent variable (\\(y_i\\)). However, all linear models have the independent variables related to the dependent variable in a linear form. To specify these models in R, the formulas are the same with both survey data and other data. The left side of the formula is the response/dependent variable, and the right side of the formula has the predictor/independent variable(s). There are many symbols used in R to specify the formula. For example, a linear formula mathematically specified as \\[Y_i=\\beta_0+\\beta_1 X_i+\\epsilon_i\\] would be specified in R as y~x where the intercept is not explicitly included. To fit a model with no intercept, that is, \\[Y_i=\\beta_1 X_i+\\epsilon_i\\] it can be specified as y~x-1. Formula notation details in R can be found in the help file for formula26. A quick overview of the common formula notation is in the following table: Common symbols in formula notation Symbol Example Meaning + +X include this variable - -X delete this variable : X:Z include the interaction between these variables * X*Z include these variables and the interactions between them ^n (X+Z+Y)^3 include these variables and all interactions up to n-way I I(X-Z) as-as: include a new variable which is the difference of these variables There are often multiple ways to specify the same formula. For example, consider the following equation using the mtcars data \\[mpg_i=\\beta_0+\\beta_1cyl_{i}+\\beta_2disp_{i}+\\beta_3hp_{i}+\\beta_4cyl_{i}disp_{i}+\\beta_5cyl_{i}hp_{i}+\\beta_6disp_{i}hp_{i}+\\epsilon_i\\] This could be specified as any of the following: mpg~(cyl+disp+hp)^2 mpg~cyl+disp+hp+cyl:disp+cyl:hp+disp:hp mpg~cyl*disp+cyl*hp+disp*hp Note that the following two specifications are not the same: mpg~cyl:disp:hp this only has the interactions and not the main effect mpg~cyl*disp*hp this also has the 3-way interaction in addition to the main effects and 2-way interactions When using non-survey data such as experimental or observational data, researchers will use the glm() function for linear models. With survey data, however, we use svyglm() from the {survey} package to ensure that we account for the survey design and weights in modeling27. This allows us to generalize a model to the target population and accounts for the fact that the observations in the survey data may not be independent. As discussed in Chapter 6, modeling survey data cannot be directly done in {srvyr}, but can be done in the {survey} (Lumley 2010, 2023) package. In this chapter, we will provide syntax and examples for linear models, including ANOVA, Gaussian linear regression, and logistic regression. For details on other types of regression, including ordinal regression, log-linear models, and survival analysis, refer to Lumley (2010). Lumley (2010) also discusses custom models such as a negative binomial or Poisson model in Appendix E of his book. 7.2 Analysis of Variance (ANOVA) In ANOVA, we are testing whether the mean of an outcome is the same across two or more groups. Statistically, we set up this as follows: \\(H_0: \\mu_1 = \\mu_2= \\dots = \\mu_k\\) where \\(\\mu_i\\) is the mean outcome for group \\(i\\) \\(H_A: \\text{At least one mean is different}\\) Some assumptions when using ANOVA on survey data include: The outcome variable is normally distributed within each group The variances of the outcome variable between each group are approximately equal We do NOT assume independence between the groups as with general ANOVA. The covariance is accounted for in the survey design 7.2.1 Syntax To perform this type of analysis in R, the general syntax is as follows: des_obj %>% svyglm( formula = outcome ~ group, design = ., na.action = na.omit, df.resid = NULL ) The arguments are: formula: Formula in the form of outcome~group. The group variable must be a factor or character. design: a tbl_svy object created by as_survey na.action: handling of missing data df.resid: degrees of freedom for Wald tests (optional) - defaults to using degf(design)-(g-1) where \\(g\\) is the number of groups The function svyglm() does not have the design as the first argument so the dot (.) notation is used to pass it with a pipe (see Chapter 6 for more details). The default for missing data is na.omit, this means that we are removing all records with any missing data in either predictors or outcomes from analyses. There are other options for handling missing data and we recommend looking at the help documentation for na.omit (run help(na.omit) or ?na.omit) for more information on options to use for na.action. For a discussion of how to handle missing data see Chapter 3. 7.2.2 Example Looking at an example will help us discuss the output and how to interpret the results. In RECS, respondents are asked what temperature they set their thermostat to during the day and evening when using the air-conditioning during the summer. To analyze this data, we filter the respondents to only those using AC (ACUsed). Then if we want to see if there are differences by region, we can use group_by(). A descriptive analysis of the temperature at night (SummerTempNight) set by region and the sample sizes is displayed below. recs_des %>% filter(ACUsed) %>% group_by(Region) %>% summarize( SMN = survey_mean(SummerTempNight, na.rm = TRUE), n = unweighted(n()), n_na = unweighted(sum(is.na(SummerTempNight))) ) ## # A tibble: 4 × 5 ## Region SMN SMN_se n n_na ## <fct> <dbl> <dbl> <int> <int> ## 1 Northeast 69.7 0.103 3204 0 ## 2 Midwest 71.0 0.0897 3619 0 ## 3 South 71.8 0.0536 6065 0 ## 4 West 72.5 0.129 3283 0 In the following code, we test whether this temperature varies by region by first using svyglm() to run the test and then using broom::tidy() to display the output. Note that the temperature setting is set to NA when the household does not use air-conditioning, and thus na.action=na.omit is specified to ignore these cases. anova_out <- recs_des %>% svyglm(design = ., formula = SummerTempNight ~ Region, na.action = na.omit) tidy(anova_out) ## # A tibble: 4 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 69.7 0.103 674. 3.69e-111 ## 2 RegionMidwest 1.34 0.138 9.68 1.46e- 13 ## 3 RegionSouth 2.05 0.128 16.0 1.36e- 22 ## 4 RegionWest 2.80 0.177 15.9 2.27e- 22 In the output above, we can see the estimated coefficients (estimate), estimated standard errors of the coefficients (std.error), the t-statistic (statistic), and the p-value for each coefficient. In this output, the intercept represents the reference value of the Northeast region28. The other coefficients indicate the difference in temperature relative to the Northeast region. For example, in the Midwest, temperatures are set, on average, 1.34 degrees higher than in the Northeast during summer nights. 7.3 Gaussian Linear Regression Gaussian linear regression is a more generalized method than ANOVA where we fit a model of a continuous outcome with any number of categorical or continuous predictors, such that \\[y_i=\\beta_0 +\\sum_{i=1}^p \\beta_i x_i + \\epsilon_i\\] where \\(y_i\\) is the outcome, \\(\\beta_0\\) is an intercept, \\(x_1, \\cdots, x_n\\) are the predictors with \\(\\beta_1, \\cdots, \\beta_p\\) as the associated coefficients, and \\(\\epsilon_i\\) is the error. Assumptions in Gaussian linear regression using survey data include: The residuals (\\(\\epsilon_i\\)) are normally distributed, but there is not an assumption of independence, and the correlation structure is captured in the survey design object There is a linear relationship between the outcome variable and the independent variables The residuals are homoscedastic, that is, the error term is the same across all values of independent variables 7.3.1 Syntax The syntax for this regression uses the same function as ANOVA, but can have more than one variable listed on the right-hand side of the formula: des_obj %>% svyglm( formula = outcomevar ~ x1 + x2 + x3, design = ., na.action = na.omit, df.resid = NULL ) The arguments are: formula: Formula in the form of y~x design: a tbl_svy object created by as_survey na.action: handling of missing data df.resid: degrees of freedom for Wald tests (optional) - defaults to using degf(design)-p where \\(p\\) is the rank of the design matrix As discussed at the beginning of the chapter, the formula on the right-hand side can be specified in many ways, whether interactions are desired or not, for example. 7.3.2 Examples Example 1: Linear Regression with Single Variable On RECS, we can obtain information on the square footage of homes and the electric bills. We assume that square footage is related to the amount of money spent on electricity and examine a model for this. Before any modeling, we first plot the data to determine whether it is reasonable to assume a linear relationship. In Figure 7.1, each hexagon represents the weighted count of households in the bin and we can see a general positive linear trend (as the square footage increases so does the amount of money spent on electricity). FIGURE 7.1: Relationship between square footage and dollars spent on electricity, RECS 2020 Given that the plot shows a potential relationship, fitting a model will allow us to determine if the relationship is statistically significant. The model is fit below with electricity expenditure as the outcome. m_electric_sqft <- recs_des %>% svyglm(design = ., formula = DOLLAREL ~ TOTSQFT_EN, na.action = na.omit) tidy(m_electric_sqft) ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 837. 12.8 65.5 4.43e-56 ## 2 TOTSQFT_EN 0.299 0.00717 41.7 6.34e-45 In the output above, we can see the estimated coefficients (estimate), estimated standard errors of the coefficients (std.error), the t-statistic (statistic), and the p-value for each coefficient. In these results, we can say that, on average, for every additional square foot of house size, the electricity bill increases by 29.9 cents and that square footage is significantly associated with electricity expenditure. This is a very simple model, and there are likely many more factors in electricity expenditure, including the type of cooling, number of appliances, location, and more. However, often starting with one variable models can help researchers understand what potential relationships there are between variables before fitting more complex models. Often researchers start with known relationships before building models to determine what impact additional variables have on the model. Example 2: Linear Regression with Additional Variables and Interactions In the following example, a model is fit to predict electricity expenditure, including Census region (factor/categorical), urbanicity (factor/categorical), square footage (double/numeric), and whether air-conditioning is used (logical/categorical) with all two-way interactions also included. As a reminder, using -1 means that we are fitting this model without an intercept. m_electric_multi <- recs_des %>% svyglm( design = ., formula = DOLLAREL ~ (Region + Urbanicity + TOTSQFT_EN + ACUsed)^2 - 1, na.action = na.omit ) tidy(m_electric_multi) %>% print(n = 50) ## # A tibble: 25 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 RegionNortheast 5.44e+2 56.6 9.61 2.37e-11 ## 2 RegionMidwest 7.02e+2 78.1 8.99 1.28e-10 ## 3 RegionSouth 9.39e+2 47.0 20.0 1.02e-20 ## 4 RegionWest 6.03e+2 36.3 16.6 3.54e-18 ## 5 UrbanicityUrban Cluster 7.30e+1 81.5 0.896 3.76e- 1 ## 6 UrbanicityRural 2.04e+2 80.7 2.53 1.61e- 2 ## 7 TOTSQFT_EN 2.41e-1 0.0279 8.65 3.28e-10 ## 8 ACUsedTRUE 2.52e+2 54.1 4.66 4.42e- 5 ## 9 RegionMidwest:UrbanicityUrban … 1.83e+2 82.4 2.22 3.28e- 2 ## 10 RegionSouth:UrbanicityUrban Cl… 1.53e+2 76.0 2.01 5.26e- 2 ## 11 RegionWest:UrbanicityUrban Clu… 9.80e+1 75.2 1.30 2.01e- 1 ## 12 RegionMidwest:UrbanicityRural 3.13e+2 50.9 6.15 4.92e- 7 ## 13 RegionSouth:UrbanicityRural 2.20e+2 55.0 4.00 3.12e- 4 ## 14 RegionWest:UrbanicityRural 1.81e+2 58.7 3.08 3.98e- 3 ## 15 RegionMidwest:TOTSQFT_EN -4.88e-2 0.0234 -2.09 4.41e- 2 ## 16 RegionSouth:TOTSQFT_EN 2.97e-3 0.0264 0.113 9.11e- 1 ## 17 RegionWest:TOTSQFT_EN -2.93e-2 0.0294 -0.997 3.25e- 1 ## 18 RegionMidwest:ACUsedTRUE -2.93e+2 60.2 -4.86 2.42e- 5 ## 19 RegionSouth:ACUsedTRUE -2.94e+2 57.4 -5.12 1.12e- 5 ## 20 RegionWest:ACUsedTRUE -7.77e+1 47.0 -1.65 1.08e- 1 ## 21 UrbanicityUrban Cluster:TOTSQF… -3.93e-2 0.0241 -1.63 1.11e- 1 ## 22 UrbanicityRural:TOTSQFT_EN -6.45e-2 0.0248 -2.60 1.37e- 2 ## 23 UrbanicityUrban Cluster:ACUsed… -1.30e+2 60.3 -2.16 3.77e- 2 ## 24 UrbanicityRural:ACUsedTRUE -3.38e+1 59.3 -0.570 5.72e- 1 ## 25 TOTSQFT_EN:ACUsedTRUE 8.29e-2 0.0238 3.48 1.35e- 3 As shown above, there are many terms in this model. To test whether coefficients for a term are different from zero, the function regTermTest() can be used. For example, in the above regression, we can test whether the interaction of region and urbanicity is significant as follows: urb_reg_test <- regTermTest(m_electric_multi, ~Urbanicity:Region) urb_reg_test ## Wald test for Urbanicity:Region ## in svyglm(design = ., formula = DOLLAREL ~ (Region + Urbanicity + ## TOTSQFT_EN + ACUsed)^2 - 1, na.action = na.omit) ## F = 6.851 on 6 and 35 df: p= 7.2e-05 This output indicates there is a significant interaction between urbanicity and region (p-value=\\(<0.0001\\)). To examine the predictions, residuals and more from the model, the function augment() from {broom} can be used. The augment() function will return a tibble with the independent and dependent variables and other fit statistics. The augment() function has not been specifically written for objects of class svyglm, and as such, a warning will be displayed indicating this at this time. As it was not written exactly for this class of objects, a little tweaking needs to be done after using augment to get the predicted (.fitted) and standard error (.se.fit) values. To obtain the standard error of the fitted values we need to use the attr() function on the .fitted values created by augment(). fitstats <- augment(m_electric_multi) %>% mutate(.se.fit = sqrt(attr(.fitted, "var")), .fitted = as.numeric(.fitted)) fitstats ## # A tibble: 18,496 × 13 ## DOLLAREL Region Urbanicity TOTSQFT_EN ACUsed `(weights)` .fitted ## <dbl> <fct> <fct> <dbl> <lgl> <dbl> <dbl> ## 1 1955. West Urban Area 2100 TRUE 0.492 1397. ## 2 713. South Urban Area 590 TRUE 1.35 1090. ## 3 335. West Urban Area 900 TRUE 0.849 1043. ## 4 1425. South Urban Area 2100 TRUE 0.793 1584. ## 5 1087 Northeast Urban Area 800 TRUE 1.49 1055. ## 6 1896. South Urban Area 4520 TRUE 1.09 2375. ## 7 1418. South Urban Area 2100 TRUE 0.851 1584. ## 8 1237. South Urban Clust… 900 FALSE 1.45 1349. ## 9 538. South Urban Area 750 TRUE 0.185 1142. ## 10 625. West Urban Area 760 TRUE 1.06 1002. ## # ℹ 18,486 more rows ## # ℹ 6 more variables: .resid <dbl>, .hat <dbl>, .sigma <dbl>, ## # .cooksd <dbl>, .std.resid <dbl>, .se.fit <dbl> These results can then be used in a variety of ways, including examining residual plots as illustrated in the code below and Figure 7.2. fitstats %>% ggplot(aes(x = .fitted, .resid)) + geom_point() + geom_hline(yintercept = 0, color = "red") + theme_minimal() + xlab("Fitted value of electricity cost") + ylab("Residual of model") + scale_y_continuous(labels = scales::dollar_format()) + scale_x_continuous(labels = scales::dollar_format()) FIGURE 7.2: Residual plot of electric cost model with covariates Region, Urbanicity, TOTSQFT_EN, and ACUsed Additionally, augment() can be used to predict outcomes for data not used in modeling. Perhaps, we would like to predict the energy expenditure for a home in an urban area in the south that uses air-conditioning and is 2,500 square feet. To do this, we first make a tibble including that additional data and then use the newdata argument in the augment() function. As before, to obtain the standard error of the predicted values we need to use the attr() function. add_data <- recs_2020 %>% select(DOEID, Region, Urbanicity, TOTSQFT_EN, ACUsed, DOLLAREL) %>% rbind( tibble( DOEID = NA, Region = "South", Urbanicity = "Urban Area", TOTSQFT_EN = 2500, ACUsed = TRUE, DOLLAREL = NA ) ) %>% tail(1) pred_data <- augment(m_electric_multi, newdata = add_data) %>% mutate(.se.fit = sqrt(attr(.fitted, "var")), .fitted = as.numeric(.fitted)) pred_data ## # A tibble: 1 × 8 ## DOEID Region Urbanicity TOTSQFT_EN ACUsed DOLLAREL .fitted .se.fit ## <dbl> <fct> <fct> <dbl> <lgl> <dbl> <dbl> <dbl> ## 1 NA South Urban Area 2500 TRUE NA 1715. 22.6 In the above example, it is predicted that the energy expenditure would be $1714.57. 7.4 Logistic Regression Logistic regression is used to model a binary outcome and is a specific case of the generalized linear model (GLM). A GLM uses a link function to link the response variable to the linear model. In logistic regression, the link model is the logit function. Specifically, the model is specified as follows: \\[ y_i \\sim \\text{Bernoulli}(\\pi_i)\\] \\[\\begin{equation} \\log \\left(\\frac{\\pi_i}{1-\\pi_i} \\right)=\\beta_0 +\\sum_{i=1}^p \\beta_i x_i \\tag{7.1} \\end{equation}\\] which can be re-expressed as \\[ \\pi_i=\\frac{\\exp \\left(\\beta_0 +\\sum_{i=1}^p \\beta_i x_i \\right)}{1+\\exp \\left(\\beta_0 +\\sum_{i=1}^p \\beta_i x_i \\right)}.\\] where \\(y_i\\) is the outcome, \\(\\beta_0\\) is an intercept, and \\(x_1, \\cdots, x_n\\) are the predictors with \\(\\beta_1, \\cdots, \\beta_n\\) as the associated coefficients. Assumptions in logistic regression using survey data include: The outcome variable has two levels There is a linear relationship between the independent variables and the log odds (Equation (7.1)) The residuals are homoscedastic, that is, the error term is the same across all values of independent variables 7.4.1 Syntax The syntax for logistic regression is as follows: des_obj %>% svyglm( formula = outcomevar ~ x1 + x2 + x3, design = ., na.action = na.omit, df.resid = NULL, family = quasibinomial ) he arguments are: formula: Formula in the form of y~x design: a tbl_svy object created by as_survey na.action: handling of missing data df.resid: degrees of freedom for Wald tests (optional) - defaults to using degf(design)-p where \\(p\\) is the rank of the design matrix family: the error distribution/link function to be used in the model Note svyglm() is the same function used in both ANOVA and linear regression. However, we’ve added the link function quasibinomial. While we can use the binomial link function, it is recommended to use the quasibinomial as our weights may not be integers, and the quasibinomial also allows for overdispersion. The quasibinomial family has a default logit link which is what is specified in the equations above. When specifying the outcome variable, it will likely be specified in one of two ways with survey data: A two level factor variable where the first level of the factor indicates a “failure” and the second level indicates a “success” A numeric variable which is 1 or 0 where 1 indicates a success A logical variable where TRUE indicates a success 7.4.2 Examples Example 1: Logistic Regression with Single Variable In the following example, the ANES data is used, and we are modeling whether someone usually has trust in the government29 by who someone voted for president in 2020. As a reminder, the leading candidates were Biden and Trump though people could vote for someone else not in the Democratic or Republican parties. Those votes are all grouped into an “Other” category. We first create a binary outcome for trusting in the government and plot the data. A scatter plot of the raw data is not useful as it is all 0 and 1 outcomes, so instead, we plot a summary of the data. anes_des_der <- anes_des %>% mutate(TrustGovernmentUsually = case_when( is.na(TrustGovernment) ~ NA, TRUE ~ TrustGovernment %in% c("Always", "Most of the time") )) anes_des_der %>% group_by(VotedPres2020_selection) %>% summarize(pct_trust = survey_mean(TrustGovernmentUsually, na.rm = TRUE, proportion = TRUE, vartype = "ci"), .groups = "drop") %>% filter(complete.cases(.)) %>% ggplot(aes(x = VotedPres2020_selection, y = pct_trust, fill = VotedPres2020_selection)) + geom_bar(stat = "identity") + geom_errorbar(aes(ymin = pct_trust_low, ymax = pct_trust_upp), width = .2) + scale_fill_manual(values = c("#0b3954", "#bfd7ea", "#8d6b94")) + xlab("Election choice (2022)") + ylab("Usually trust the government") + scale_y_continuous(labels = scales::percent) + guides(fill = "none") + theme_minimal() FIGURE 7.3: Relationship between candidate selection and trust in government, ANES 2020 By looking at Figure 7.3 it appears that people who voted for Trump are more likely to say that they usually have trust in the government compared to those who voted for Biden and Other candidates. To determine if this insight is accurate, we next we fit the model. logistic_trust_vote <- anes_des_der %>% svyglm(design = ., formula = TrustGovernmentUsually ~ VotedPres2020_selection, family = quasibinomial) tidy(logistic_trust_vote) ## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -1.96 0.0714 -27.5 2.07e-31 ## 2 VotedPres2020_selectionTrump 0.435 0.0920 4.72 1.98e- 5 ## 3 VotedPres2020_selectionOther -0.655 0.440 -1.49 1.43e- 1 In the output above, we can see the estimated coefficients (estimate), estimated standard errors of the coefficients (std.error), the t-statistic (statistic), and the p-value for each coefficient. This output indicates that respondents who voted for Trump are 0.435 times more likely to usually have trust in the government compared to those who voted for Biden (the reference level). Sometimes it is easier to talk about the odds instead of the likelihood. In this case, we can also see the exponentiated coefficients which illustrates the odds: tidy(logistic_trust_vote, exponentiate = TRUE) %>% select(term, estimate) ## # A tibble: 3 × 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) 0.141 ## 2 VotedPres2020_selectionTrump 1.54 ## 3 VotedPres2020_selectionOther 0.520 We can interpret this as saying that the odds of usually trusting the government for someone who voted for Trump is 154% as likely to trust the government compared to a person who voted for Biden (the reference level). In comparison, a person who voted for neither Biden nor Trump is 52% as likely to trust the government as someone who voted for Biden. As with linear regression, the augment() can be used to predict values. By default, the prediction is the link function and not the probability. To predict the probability, add an argument of type.predict=\"response\" as demonstrated below: logistic_trust_vote %>% augment(type.predict = "response") %>% mutate(.se.fit = sqrt(attr(.fitted, "var")), .fitted = as.numeric(.fitted)) %>% select(TrustGovernmentUsually, VotedPres2020_selection, .fitted, .se.fit) ## # A tibble: 6,212 × 4 ## TrustGovernmentUsually VotedPres2020_selection .fitted .se.fit ## <lgl> <fct> <dbl> <dbl> ## 1 FALSE Other 0.0681 0.0279 ## 2 FALSE Biden 0.123 0.00772 ## 3 FALSE Biden 0.123 0.00772 ## 4 FALSE Trump 0.178 0.00919 ## 5 FALSE Biden 0.123 0.00772 ## 6 FALSE Trump 0.178 0.00919 ## 7 FALSE Biden 0.123 0.00772 ## 8 FALSE Biden 0.123 0.00772 ## 9 TRUE Biden 0.123 0.00772 ## 10 FALSE Biden 0.123 0.00772 ## # ℹ 6,202 more rows Example 2: Interaction Effects Let’s look at another example with interaction effects. If we’re interested in understanding the demographics of people who voted for Biden, we could include Gender and Education in our model. First we need to create an indicator for voted for Biden. Note that this indicator places anyone who did not vote at all into VoteBiden = 0. anes_des_ind <- anes_des %>% mutate(VoteBiden = case_when(VotedPres2020_selection == "Biden"~1, TRUE ~ 0)) Let’s first look at the main effects of gender and education. log_biden_main <- anes_des_ind %>% svyglm(design = ., formula = VoteBiden ~ Gender + Education, family = quasibinomial) tidy(log_biden_main) ## # A tibble: 6 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -1.24 0.191 -6.48 0.0000000545 ## 2 GenderFemale 0.157 0.0763 2.05 0.0458 ## 3 EducationHigh school 0.384 0.202 1.90 0.0631 ## 4 EducationPost HS 0.619 0.186 3.32 0.00175 ## 5 EducationBachelor's 1.20 0.191 6.32 0.0000000961 ## 6 EducationGraduate 1.53 0.211 7.26 0.00000000371 This main effect model indicates that respondents with a graduate degree are 1.53 times more likely to vote for Biden compared to respondents with less than a high school degree. However, we see that gender is not significant. It is possible that there is an interaction between gender and education. To determine this we can create a model that includes the interaction effects: log_biden_int <- anes_des_ind %>% svyglm(design = ., formula = VoteBiden ~ (Gender + Education)^2, family = quasibinomial) tidy(log_biden_int) ## # A tibble: 10 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -0.994 0.260 -3.82 4.32e-4 ## 2 GenderFemale -0.377 0.441 -0.856 3.97e-1 ## 3 EducationHigh school 0.0762 0.290 0.263 7.94e-1 ## 4 EducationPost HS 0.411 0.273 1.51 1.39e-1 ## 5 EducationBachelor's 1.01 0.270 3.75 5.30e-4 ## 6 EducationGraduate 1.13 0.282 4.02 2.36e-4 ## 7 GenderFemale:EducationHigh scho… 0.665 0.490 1.36 1.82e-1 ## 8 GenderFemale:EducationPost HS 0.474 0.452 1.05 3.00e-1 ## 9 GenderFemale:EducationBachelor's 0.436 0.451 0.967 3.39e-1 ## 10 GenderFemale:EducationGraduate 0.844 0.463 1.82 7.56e-2 The results from the interaction model show a single interaction effect that is significant. To better understand what this interaction means, we will want to plot the predicted probabilities. Let’s first obtain the predicted probabilities for each possible combination of variables using the augment() function. log_biden_pred <- log_biden_int %>% augment(type.predict = "response") %>% mutate(.se.fit = sqrt(attr(.fitted, "var")), .fitted = as.numeric(.fitted)) %>% select(VoteBiden, Gender, Education, .fitted, .se.fit) We can then use this information to plot the predicted probabilities to better understand the interaction effects. To create an interaction plot, the y-axis will be the predicted probabilities, and one of our x-variables will be on the x-axis and the other will be represented by multiple lines. Figure 7.4 shows the interaction plot with the gender variable on the x-axis and education represented by the lines. biden_int_plot <- log_biden_pred %>% filter(VoteBiden==1) %>% distinct() %>% arrange(Gender, Education) %>% mutate(Education = fct_reorder2(Education, Gender, .fitted)) %>% ggplot(aes(x = Gender, y = .fitted, group = Education, color = Education, linetype = Education)) + geom_line(linewidth = 1.1) + scale_color_manual(values = book_colors) + ylab("Predicted Probability of Voting for Biden") + guides(fill = "none") + theme_minimal() biden_int_plot FIGURE 7.4: Interaction Plot of Gender and Education Predicting the Probability of Voting for Biden From this plot we can see that respondents who indicated a male gender and had less than a high school education were more likely to vote for Biden than females among those with less than a high school education. Additionally, females with a graduate degree were more likely to vote for Biden than males with a graduate degree. Interactions in models can be difficult to understand from the coefficients alone. Using these interaction plots can help others understand the nuances of the results. 7.5 Exercises The type of housing unit may have an impact on energy expenses. Is there any relationship between housing unit type (HousingUnitType) and total energy expenditure (TOTALDOL)? First, find the average energy expenditure by housing unit type as a descriptive analysis and then do the test. The reference level in the comparison should be the housing unit type that is most common. recs_des %>% group_by(HousingUnitType) %>% summarize(Expense = survey_mean(TOTALDOL, na.rm = TRUE), HUs = survey_total()) %>% arrange(desc(HUs)) ## # A tibble: 5 × 5 ## HousingUnitType Expense Expense_se HUs HUs_se ## <fct> <dbl> <dbl> <dbl> <dbl> ## 1 Single-family detached 2205. 9.36 77067692. 0.00000277 ## 2 Apartment: 5 or more units 1108. 13.7 22835862. 0.000000226 ## 3 Apartment: 2-4 Units 1407. 24.2 9341795. 0.119 ## 4 Single-family attached 1653. 22.3 7451177. 0.114 ## 5 Mobile home 1773. 26.2 6832499. 0.0000000927 exp_unit_out <- recs_des %>% mutate(HousingUnitType = fct_infreq(HousingUnitType, NWEIGHT)) %>% svyglm( design = ., formula = TOTALDOL ~ HousingUnitType, na.action = na.omit ) tidy(exp_unit_out) ## # A tibble: 5 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 2205. 9.36 236. 2.53e-84 ## 2 HousingUnitTypeApartment: 5 or … -1097. 16.5 -66.3 3.52e-54 ## 3 HousingUnitTypeApartment: 2-4 U… -798. 28.0 -28.5 1.37e-34 ## 4 HousingUnitTypeSingle-family at… -551. 25.0 -22.1 5.28e-29 ## 5 HousingUnitTypeMobile home -431. 27.4 -15.7 5.36e-22 # Single-family detached units are most common # There is a significant relationship between energy expenditure and housing unit type Does temperature play a role in energy expenditure? Cooling degree days are a measure of how hot a place is. CDD65 for a given day indicates the number of degrees Fahrenheit warmer than 65°F (18.3°C) it is in a location. On a day that averages 65°F and below, CDD65=0. While a day that averages 85°F would have CDD80=20 because it is 20 degrees warmer. For each day in the year, this is summed to give an indicator of how hot the place is throughout the year. Similarly, HDD65 indicates the days colder than 65°F (18.3°C)30. Can energy expenditure be predicted using these temperature indicators along with square footage? Is there a significant relationship? Include main effects and two-way interactions. temps_sqft_exp <- recs_des %>% svyglm( design = ., formula = DOLLAREL ~ (TOTSQFT_EN + CDD65 + HDD65) ^ 2, na.action = na.omit ) tidy(temps_sqft_exp) ## # A tibble: 7 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 741. 70.5 10.5 1.44e-14 ## 2 TOTSQFT_EN 0.272 0.0471 5.77 4.27e- 7 ## 3 CDD65 0.0293 0.0227 1.29 2.02e- 1 ## 4 HDD65 -0.00111 0.0104 -0.107 9.15e- 1 ## 5 TOTSQFT_EN:CDD65 0.0000459 0.0000154 2.97 4.43e- 3 ## 6 TOTSQFT_EN:HDD65 -0.00000840 0.00000633 -1.33 1.90e- 1 ## 7 CDD65:HDD65 0.00000533 0.00000355 1.50 1.39e- 1 Continuing with our results from question 2, create a plot between the actual and predicted expenditures and a residual plot for the predicted expenditures. temps_sqft_exp_fit <- temps_sqft_exp %>% augment() %>% mutate(.se.fit = sqrt(attr(.fitted, "var")), # extract the variance of the fitted value .fitted = as.numeric(.fitted)) temps_sqft_exp_fit %>% ggplot(aes(x = DOLLAREL, y = .fitted)) + geom_point() + geom_abline(intercept = 0, slope = 1, color = "red") + xlab("Actual expenditures") + ylab("Predicted expenditures") + theme_minimal() FIGURE 7.5: Actual and predicted electricity expenditures temps_sqft_exp_fit %>% ggplot(aes(x = .fitted, y = .resid)) + geom_point() + geom_hline(yintercept = 0, color = "red") + xlab("Predicted expenditure") + ylab("Residual value of expenditure") + theme_minimal() FIGURE 7.6: Residual plot of electric cost model with covariates TOTSQFT_EN, CDD65, and HDD65 Early voting expanded in 202031. Build a logistic model predicting early voting in 2020 (EarlyVote2020) using age (Age), education (Education), and party identification (PartyID). Include two-way interactions. earlyvote_mod <- anes_des %>% filter(!is.na(EarlyVote2020)) %>% svyglm( design = ., formula = EarlyVote2020 ~ (Age + Education + PartyID) ^ 2 , family = quasibinomial ) tidy(earlyvote_mod) %>% arrange(p.value) ## # A tibble: 46 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Age:PartyIDIndependent -0.0587 0.0165 -3.55 0.0121 ## 2 PartyIDIndependent 4.98 1.71 2.92 0.0268 ## 3 Age:PartyIDNot very strong repu… -0.0501 0.0197 -2.54 0.0440 ## 4 PartyIDNot very strong republic… 4.14 1.65 2.51 0.0462 ## 5 (Intercept) 1.47 0.863 1.70 0.139 ## 6 EducationGraduate 1.52 0.954 1.59 0.163 ## 7 PartyIDStrong republican 1.77 1.29 1.37 0.221 ## 8 EducationHigh school:PartyIDStr… -1.37 1.01 -1.35 0.226 ## 9 EducationGraduate:PartyIDStrong… -1.28 1.00 -1.28 0.249 ## 10 EducationPost HS:PartyIDIndepen… -1.47 1.39 -1.06 0.331 ## # ℹ 36 more rows Continuing from Exercise 1, predict the probability of early voting for two people. Both are 28 years old and have a graduate degree, but one person is a strong Democrat, and the other is a strong Republican. add_vote_dat <- anes_2020 %>% select(EarlyVote2020, Age, Education, PartyID) %>% rbind(tibble( EarlyVote2020 = NA, Age = 28, Education = "Graduate", PartyID = c("Strong democrat", "Strong republican") )) %>% tail(2) log_ex_2_out <- earlyvote_mod %>% augment(newdata = add_vote_dat, type.predict = "response") %>% mutate(.se.fit = sqrt(attr(.fitted, "var")), # extract the variance of the fitted value .fitted = as.numeric(.fitted)) References Bollen, Kenneth A., Paul P. Biemer, Alan F. Karr, Stephen Tueller, and Marcus E. Berzofsky. 2016. “Are Survey Weights Needed? A Review of Diagnostic Tests in Regression Analysis.” Annual Review of Statistics and Its Application 3 (1): 375–92. https://doi.org/10.1146/annurev-statistics-011516-012958. Gelman, Andrew. 2007. “Struggles with Survey Weighting and Regression Modeling.” Statistical Science 22 (2): 153–64. https://doi.org/10.1214/088342306000000691. Lumley, Thomas. 2010. Complex Surveys: A Guide to Analysis Using r: A Guide to Analysis Using r. John Wiley; Sons. ———. 2023. Survey: Analysis of Complex Survey Samples. http://r-survey.r-forge.r-project.org/survey/. Use help(formula) or ?formula in R or find the documentation online at https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html↩︎ There is some debate about whether weights should be used in regression (Gelman 2007; Bollen et al. 2016). However, for the purposes of providing complete information on how to analyze complex survey data, this chapter will include weights.↩︎ To change the reference level, reorder the factor before modeling using the function relevel() from {stats} or using one of many factor ordering functions in {forcats} such as fct_relevel() or fct_infreq()↩︎ Question: How often can you trust the federal government in Washington to do what is right?↩︎ https://www.eia.gov/energyexplained/units-and-calculators/degree-days.php↩︎ https://www.npr.org/2020/10/26/927803214/62-million-and-counting-americans-are-breaking-early-voting-records↩︎ "],["c08-communicating-results.html", "Chapter 8 Communicating results 8.1 Introduction 8.2 Describing results through text 8.3 Visualizing data", " Chapter 8 Communicating results Prerequisites For this chapter, load the following packages: library(tidyverse) library(survey) library(srvyr) library(srvyrexploR) library(gt) library(gtsummary) We will be using data from ANES as described in Chapter 4. As a reminder, here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter 4 for more information). targetpop <- 231592693 data(anes_2020) anes_adjwgt <- anes_2020 %>% mutate(Weight = Weight / sum(Weight) * targetpop) anes_des <- anes_adjwgt %>% as_survey_design( weights = Weight, strata = Stratum, ids = VarUnit, nest = TRUE ) 8.1 Introduction After finishing the analysis and modeling, we proceed to the important task of communicating the survey results. Our audience may range from seasoned researchers familiar with our survey data to newcomers encountering the information for the first time. We should aim to explain the methodology and analysis while presenting findings in an accessible way, and it is our responsibility to report information with care. Before beginning any dissemination of results, consider questions such as: How will we present results? Examples include a website, print, or other media. Based on the media type, we might limit or enhance the use of graphical representation. What is the audience’s familiarity with the study and/or data? Audiences can range from the general public to data experts. If we anticipate limited knowledge about the study, we should provide detailed descriptions (we discuss recommendations later in the chapter). What are we trying to communicate? It could be summary statistics, trends, patterns, or other insights. Tables might suit summary statistics, while plots are better at conveying trends and patterns. Is the audience accustomed to interpreting plots? If not, include explanatory text to guide them on how to interpret the plots effectively. What is the audience’s statistical knowledge? If the audience does not have a strong statistics background, provide text on standard errors, confidence intervals, and other estimate types to enhance understanding. 8.2 Describing results through text As analysts, our emphasis is often on the data, and communicating results can sometimes be overlooked. First, we need to identify the appropriate information to share with our audience. Chapters 2 and 3 provide insights into factors we need to consider during analysis, and they remain relevant when presenting results to others. 8.2.1 Methodology If we are using existing data, methodologically-sound surveys will provide documentation about how the survey was fielded, the questionnaires, and other necessary information for analyses. For example, the survey’s methodology reports should include the population of interest, sampling procedures, response rates, questionnaire documentation, weighting, and a general overview of disclosure statements. Many American organizations follow the American Association for Public Opinion Research’s (AAPOR) Transparency Initiative. The AAPOR Transparency Initiative requires organizations to include specific details in their methodology, making it clear how we can and should analyze the results. Being transparent about these methods is vital for the scientific rigor of the field. The details provided in Chapter 2 about the survey process should be shared with the audience when presenting the results. When using publicly-available data, like the examples in this book, we can often link to the methodology report in our final output. We should also provide high-level information for the audience to quickly grasp the context around the findings. For example, we can mention when and where the study was conducted, the population’s age range, or other contextual details. This information helps the audience understand how generalizable the results are. Providing this material is especially important when there’s no methodology report available for the analyzed data. For example, if a researcher conducted a new survey for a specific purpose, we should document and present all the pertinent information during the analysis and reporting process. Adhering to the AAPOR Transparency Initiative guidelines is a reliable method to guarantee that all essential information is communicated to the audience. 8.2.2 Analysis Along with the survey methodology and weight calculations, we should also share our approach to preparing, cleaning, and analyzing the data. For example, in Chapter 6, we compared education distributions from the ANES survey to the American Community Survey (ACS). To make the comparison, we had to collapse education categories provided in the ANES data to match the ACS. The process for this particular example may seem straightforward (like combining Bachelor’s and Graduate Degrees into a single category), but there are multiple ways to deal with the data. Our choice is just one of many. We should document both the original ANES question and response options and the steps we took to match it with ACS data. This transparency helps clarify our analysis to our audience. Missing data is another instance where we want to be unambigious and upfront with our audience. In this book, numerous examples and exercises remove missing data, as this is often the easiest way to handle them. However, there are circumstances where missing data holds substantive importance, and excluding them could introduce bias (see Chapter 11). Being transparent about our handling of missing data is important to maintaining the integrity of our analysis and ensuring a comprehensive understanding of the results. 8.2.3 Results While tables and graphs are commonly used to communicate results, there are instances where text can be more effective in sharing information. Narrative details, such as context around point estimates or model coefficients, can go a long way in improving our communication. We have several strategies to effectively convey the significance of the data to the audience through text. First, we can highlight important data points in a sentence using plain language. For example, if we were looking at election polling data conducted before an election, we could say something like: As of [DATE], an estimated XX% of registered U.S. voters say they will vote for [CANDIDATE NAME] for president in the [YEAR] general election. This sentence provides key pieces of information in a straightforward way: [DATE]: Given that polling data is time-specific, providing the date of reference lets the audience know when this data was valid. Registered U.S. voters: This tells the audience who we surveyed, letting them know the target population. XX%: This part provides the estimated percentage of people voting for a specific candidate for a specific office. [YEAR] general election: As with the bullet above, adding this gives more context about the election type and year. The estimate would take on a different meaning if we changed it to a primary election instead of a general election. We also included the word “estimated.” When presenting aggregate survey results, we have errors around each estimate. We want to convey this uncertainty rather than talk in absolutes. Words like “estimated,” “on average,” or “around” can help communicate this uncertainty to the audience. Instead of saying ‘XX%,’ we can also say ‘XX% (+/- Y%)’ to show the margin of error. Confidence intervals can also be incorporated into the text to assist readers. Second, providing context and discussing the meaning behind a point estimate can help the audience glean some insight into why the data is important. For example, when comparing two values, it can be helpful to highlight if there are statistically significant differences and explain the impact and relevance of this information. This is where we, as analysts, should to do our best to be mindful of biases and present the facts logically. Keep in mind how we discuss these findings can greatly influence how the audience interprets them. If we include speculation, using phrases like “the authors speculate” or “these findings may indicate” relays the uncertainty around the notion while still lending a plausible solution. Additionally, we can present alternative viewpoints or competing discussion points to explain the uncertainty in the results. 8.3 Visualizing data Although discussing key findings in the text is important, presenting large amounts of data is often more digestible for the audience in tables or visualizations. Effectively combining text, tables, and graphs can be powerful in communicating results. This section provides examples of using the {gt}, {gtsummary}, and {ggplot2} packages to enhance the dissemination of results. 8.3.1 Tables Tables are a great way to provide a large amount of data when individual data points need to be examined. However, it is important to present tables in a reader-friendly format. Numbers should align, rows and columns should be easy to follow, and the table size should not compromise readability. Using key visualization techniques, we can create tables that are informative and nice to look at. Many packages create easy-to-read tables (e.g., {kable} + {kableExtra}, {gt}, {gtsummary}, {DT}, {formattable}, {flextable}, {reactable}). While we will focus on {gt} here, we encourage learning about others as they may have additional helpful features. We appreciate the flexibility, ability to use pipes (e.g., %>%), and numerous extensions of the {gt} package. Please note, at this time, {gtsummary} needs additional features to be widely used for survey analysis, particularly due to its lack of ability to work with replicate designs. We provide one example using {gtsummary} and hope it evolves into a more comprehensive tool over time. 8.3.1.1 Transitioning {srvyr} output to a {gt} table Let’s start by using some of the data we calculated earlier in this book. In Chapter 6, we looked at data on trust in government with the proportions calculated below: trust_gov <- anes_des %>% drop_na(TrustGovernment) %>% group_by(TrustGovernment) %>% summarize(trust_gov_p = survey_prop()) trust_gov ## # A tibble: 5 × 3 ## TrustGovernment trust_gov_p trust_gov_p_se ## <fct> <dbl> <dbl> ## 1 Always 0.0155 0.00204 ## 2 Most of the time 0.132 0.00553 ## 3 About half the time 0.309 0.00829 ## 4 Some of the time 0.434 0.00855 ## 5 Never 0.110 0.00566 The default output generated by R may work for initial viewing inside RStudio or when creating basic output in an R Markdown or Quarto document. However, when presenting these results in other publications, such as the print version of this book or with other formal dissemination modes, modifying the display can improve our reader’s experience. Looking at the output from trust_gov, a couple of improvements are obvious: (1) switching to percentages instead of proportions and (2) using the variable names as column headers. The {gt} package is a good tool for implementing better labeling and creating publishable tables. Let’s walk through some code as we make a few changes to improve the table’s usefulness. First, we initiate the table with the gt() function. Next, we use the argument rowname_col() to designate the TrustGovernment column as the labels for each row (called the table “stub”). We apply the cols_label() function to create informative column labels instead of variable names, and then the tab_spanner() function to add a label across multiple columns. In this case, we label all columns except the stub with “Trust in Government, 2020”. We then format the proportions into percentages with the fmt_percent() function and reduce the number of decimals shown with decimals = 1. Finally, the tab_caption() function adds a table title for HTML version of the book. We can use the caption for cross-referencing in R Markdown, Quarto, and bookdown, as well as adding it to the list of tables in the book. trust_gov_gt <- trust_gov %>% gt(rowname_col = "TrustGovernment") %>% cols_label(trust_gov_p = "%", trust_gov_p_se = "s.e. (%)") %>% tab_spanner(label = "Trust in Government, 2020", columns = c(trust_gov_p, trust_gov_p_se)) %>% fmt_percent(decimals = 1) trust_gov_gt %>% tab_caption("Example of gt table with trust in government estimate") #qhqadeditf table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #qhqadeditf thead, #qhqadeditf tbody, #qhqadeditf tfoot, #qhqadeditf tr, #qhqadeditf td, #qhqadeditf th { border-style: none; } #qhqadeditf p { margin: 0; padding: 0; } #qhqadeditf .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #qhqadeditf .gt_caption { padding-top: 4px; padding-bottom: 4px; } #qhqadeditf .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #qhqadeditf .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #qhqadeditf .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #qhqadeditf .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #qhqadeditf .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #qhqadeditf .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #qhqadeditf .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #qhqadeditf .gt_column_spanner_outer:first-child { padding-left: 0; } #qhqadeditf .gt_column_spanner_outer:last-child { padding-right: 0; } #qhqadeditf .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #qhqadeditf .gt_spanner_row { border-bottom-style: hidden; } #qhqadeditf .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #qhqadeditf .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #qhqadeditf .gt_from_md > :first-child { margin-top: 0; } #qhqadeditf .gt_from_md > :last-child { margin-bottom: 0; } #qhqadeditf .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #qhqadeditf .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #qhqadeditf .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #qhqadeditf .gt_row_group_first td { border-top-width: 2px; } #qhqadeditf .gt_row_group_first th { border-top-width: 2px; } #qhqadeditf .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #qhqadeditf .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #qhqadeditf .gt_first_summary_row.thick { border-top-width: 2px; } #qhqadeditf .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #qhqadeditf .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #qhqadeditf .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #qhqadeditf .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #qhqadeditf .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #qhqadeditf .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #qhqadeditf .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #qhqadeditf .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #qhqadeditf .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #qhqadeditf .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #qhqadeditf .gt_left { text-align: left; } #qhqadeditf .gt_center { text-align: center; } #qhqadeditf .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #qhqadeditf .gt_font_normal { font-weight: normal; } #qhqadeditf .gt_font_bold { font-weight: bold; } #qhqadeditf .gt_font_italic { font-style: italic; } #qhqadeditf .gt_super { font-size: 65%; } #qhqadeditf .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #qhqadeditf .gt_asterisk { font-size: 100%; vertical-align: 0; } #qhqadeditf .gt_indent_1 { text-indent: 5px; } #qhqadeditf .gt_indent_2 { text-indent: 10px; } #qhqadeditf .gt_indent_3 { text-indent: 15px; } #qhqadeditf .gt_indent_4 { text-indent: 20px; } #qhqadeditf .gt_indent_5 { text-indent: 25px; } TABLE 8.1: Example of gt table with trust in government estimate Trust in Government, 2020 % s.e. (%) Always 1.6% 0.2% Most of the time 13.2% 0.6% About half the time 30.9% 0.8% Some of the time 43.4% 0.9% Never 11.0% 0.6% We can add a few more enhancements, such as a title, a data source note, and a footnote with the question information, using the functions tab_header(), tab_source_note(), and tab_footnote(). If having the percentage sign in both the header and the cells seems redundant, we can opt for fmt_number() instead of fmt_percent() and scale the number by 100 with scale_by = 100. trust_gov_gt2 <- trust_gov_gt %>% tab_header("American voter's trust in the federal government, 2020") %>% tab_source_note("American National Election Studies, 2020") %>% tab_footnote( "Question text: How often can you trust the federal government in Washington to do what is right?" ) %>% fmt_number(scale_by = 100, decimals = 1) trust_gov_gt2 #tqzxzfmzhm table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #tqzxzfmzhm thead, #tqzxzfmzhm tbody, #tqzxzfmzhm tfoot, #tqzxzfmzhm tr, #tqzxzfmzhm td, #tqzxzfmzhm th { border-style: none; } #tqzxzfmzhm p { margin: 0; padding: 0; } #tqzxzfmzhm .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #tqzxzfmzhm .gt_caption { padding-top: 4px; padding-bottom: 4px; } #tqzxzfmzhm .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #tqzxzfmzhm .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #tqzxzfmzhm .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #tqzxzfmzhm .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #tqzxzfmzhm .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #tqzxzfmzhm .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #tqzxzfmzhm .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #tqzxzfmzhm .gt_column_spanner_outer:first-child { padding-left: 0; } #tqzxzfmzhm .gt_column_spanner_outer:last-child { padding-right: 0; } #tqzxzfmzhm .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #tqzxzfmzhm .gt_spanner_row { border-bottom-style: hidden; } #tqzxzfmzhm .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #tqzxzfmzhm .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #tqzxzfmzhm .gt_from_md > :first-child { margin-top: 0; } #tqzxzfmzhm .gt_from_md > :last-child { margin-bottom: 0; } #tqzxzfmzhm .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #tqzxzfmzhm .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #tqzxzfmzhm .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #tqzxzfmzhm .gt_row_group_first td { border-top-width: 2px; } #tqzxzfmzhm .gt_row_group_first th { border-top-width: 2px; } #tqzxzfmzhm .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #tqzxzfmzhm .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #tqzxzfmzhm .gt_first_summary_row.thick { border-top-width: 2px; } #tqzxzfmzhm .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #tqzxzfmzhm .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #tqzxzfmzhm .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #tqzxzfmzhm .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #tqzxzfmzhm .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #tqzxzfmzhm .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #tqzxzfmzhm .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #tqzxzfmzhm .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #tqzxzfmzhm .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #tqzxzfmzhm .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #tqzxzfmzhm .gt_left { text-align: left; } #tqzxzfmzhm .gt_center { text-align: center; } #tqzxzfmzhm .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #tqzxzfmzhm .gt_font_normal { font-weight: normal; } #tqzxzfmzhm .gt_font_bold { font-weight: bold; } #tqzxzfmzhm .gt_font_italic { font-style: italic; } #tqzxzfmzhm .gt_super { font-size: 65%; } #tqzxzfmzhm .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #tqzxzfmzhm .gt_asterisk { font-size: 100%; vertical-align: 0; } #tqzxzfmzhm .gt_indent_1 { text-indent: 5px; } #tqzxzfmzhm .gt_indent_2 { text-indent: 10px; } #tqzxzfmzhm .gt_indent_3 { text-indent: 15px; } #tqzxzfmzhm .gt_indent_4 { text-indent: 20px; } #tqzxzfmzhm .gt_indent_5 { text-indent: 25px; } TABLE 8.2: Example of gt table with trust in government estimates and additional context American voter's trust in the federal government, 2020 Trust in Government, 2020 % s.e. (%) Always 1.6 0.2 Most of the time 13.2 0.6 About half the time 30.9 0.8 Some of the time 43.4 0.9 Never 11.0 0.6 American National Election Studies, 2020 Question text: How often can you trust the federal government in Washington to do what is right? Expanding tables using {gtsummary} The {gtsummary} package simultaneously summarizes data and creates publication-ready tables. Initially designed for clinical trial data, it has been extended to include survey analysis in certain capacities. At this time, it is only compatible with survey objects using Taylor’s Series Linearization and not replicate methods. While it offers a restricted set of summary statistics, the following are available for categorical variables: {n} frequency {N} denominator, or cohort size {p} percentage {p.std.error} standard error of the sample proportion {deff} design effect of the sample proportion {n_unweighted} unweighted frequency {N_unweighted} unweighted denominator {p_unweighted} unweighted formatted percentage The following summary statistics are available for continuous variables: {median} median {mean} mean {mean.std.error} standard error of the sample mean {deff} design effect of the sample mean {sd} standard deviation {var} variance {min} minimum {max} maximum {p##} any integer percentile, where ## is an integer from 0 to 100 {sum} sum In the following example, we will build a table using {gtsummary}, similar to the table in the {gt} example. The main function we use is tbl_svysummary(). In this function, we include the variables we want to analyze in the include argument and define the statistics we want to display in the statistic argument. To specify the statistics, we apply the syntax from the {glue} package, where we enclose the variables we want to insert within curly brackets. We must specify the desired statistics using the names listed above. For example, to specify that we want the proportion followed by the standard error of the proportion in parentheses, we use {p} ({p.std.error}). anes_des_gtsum <- anes_des %>% tbl_svysummary(include = TrustGovernment, statistic = list(all_categorical() ~ "{p} ({p.std.error})")) anes_des_gtsum #glzxfslwpw table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #glzxfslwpw thead, #glzxfslwpw tbody, #glzxfslwpw tfoot, #glzxfslwpw tr, #glzxfslwpw td, #glzxfslwpw th { border-style: none; } #glzxfslwpw p { margin: 0; padding: 0; } #glzxfslwpw .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #glzxfslwpw .gt_caption { padding-top: 4px; padding-bottom: 4px; } #glzxfslwpw .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #glzxfslwpw .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #glzxfslwpw .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #glzxfslwpw .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #glzxfslwpw .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #glzxfslwpw .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #glzxfslwpw .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #glzxfslwpw .gt_column_spanner_outer:first-child { padding-left: 0; } #glzxfslwpw .gt_column_spanner_outer:last-child { padding-right: 0; } #glzxfslwpw .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #glzxfslwpw .gt_spanner_row { border-bottom-style: hidden; } #glzxfslwpw .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #glzxfslwpw .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #glzxfslwpw .gt_from_md > :first-child { margin-top: 0; } #glzxfslwpw .gt_from_md > :last-child { margin-bottom: 0; } #glzxfslwpw .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #glzxfslwpw .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #glzxfslwpw .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #glzxfslwpw .gt_row_group_first td { border-top-width: 2px; } #glzxfslwpw .gt_row_group_first th { border-top-width: 2px; } #glzxfslwpw .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #glzxfslwpw .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #glzxfslwpw .gt_first_summary_row.thick { border-top-width: 2px; } #glzxfslwpw .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #glzxfslwpw .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #glzxfslwpw .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #glzxfslwpw .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #glzxfslwpw .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #glzxfslwpw .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #glzxfslwpw .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #glzxfslwpw .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #glzxfslwpw .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #glzxfslwpw .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #glzxfslwpw .gt_left { text-align: left; } #glzxfslwpw .gt_center { text-align: center; } #glzxfslwpw .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #glzxfslwpw .gt_font_normal { font-weight: normal; } #glzxfslwpw .gt_font_bold { font-weight: bold; } #glzxfslwpw .gt_font_italic { font-style: italic; } #glzxfslwpw .gt_super { font-size: 65%; } #glzxfslwpw .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #glzxfslwpw .gt_asterisk { font-size: 100%; vertical-align: 0; } #glzxfslwpw .gt_indent_1 { text-indent: 5px; } #glzxfslwpw .gt_indent_2 { text-indent: 10px; } #glzxfslwpw .gt_indent_3 { text-indent: 15px; } #glzxfslwpw .gt_indent_4 { text-indent: 20px; } #glzxfslwpw .gt_indent_5 { text-indent: 25px; } TABLE 8.3: Example of gtsummary table with trust in government estimates Characteristic N = 231,034,1251 PRE: How often trust government in Washington to do what is right [revised]     Always 1.6 (0.00)     Most of the time 13 (0.01)     About half the time 31 (0.01)     Some of the time 43 (0.01)     Never 11 (0.01)     Unknown 673,773 1 % (SE(%)) The default table includes the weighted number of missing (or Unknown) records. The standard error is reported as a proportion, while the proportion is styled as a percentage. In the next step, we remove the Unknown category by setting the missing argument to “no” and format the standard error as a percentage using the digits argument. To improve the table for publication, we provide a more polished label for the “TrustGovernment” variable using the label argument. anes_des_gtsum2 <- anes_des %>% tbl_svysummary( include = TrustGovernment, statistic = list(all_categorical() ~ "{p} ({p.std.error})"), missing = "no", digits = list(TrustGovernment ~ style_percent), label = list(TrustGovernment ~ "Trust in Government, 2020") ) anes_des_gtsum2 #xalzjubttw table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #xalzjubttw thead, #xalzjubttw tbody, #xalzjubttw tfoot, #xalzjubttw tr, #xalzjubttw td, #xalzjubttw th { border-style: none; } #xalzjubttw p { margin: 0; padding: 0; } #xalzjubttw .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #xalzjubttw .gt_caption { padding-top: 4px; padding-bottom: 4px; } #xalzjubttw .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #xalzjubttw .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #xalzjubttw .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #xalzjubttw .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #xalzjubttw .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #xalzjubttw .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #xalzjubttw .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #xalzjubttw .gt_column_spanner_outer:first-child { padding-left: 0; } #xalzjubttw .gt_column_spanner_outer:last-child { padding-right: 0; } #xalzjubttw .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #xalzjubttw .gt_spanner_row { border-bottom-style: hidden; } #xalzjubttw .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #xalzjubttw .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #xalzjubttw .gt_from_md > :first-child { margin-top: 0; } #xalzjubttw .gt_from_md > :last-child { margin-bottom: 0; } #xalzjubttw .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #xalzjubttw .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #xalzjubttw .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #xalzjubttw .gt_row_group_first td { border-top-width: 2px; } #xalzjubttw .gt_row_group_first th { border-top-width: 2px; } #xalzjubttw .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #xalzjubttw .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #xalzjubttw .gt_first_summary_row.thick { border-top-width: 2px; } #xalzjubttw .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #xalzjubttw .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #xalzjubttw .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #xalzjubttw .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #xalzjubttw .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #xalzjubttw .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #xalzjubttw .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #xalzjubttw .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #xalzjubttw .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #xalzjubttw .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #xalzjubttw .gt_left { text-align: left; } #xalzjubttw .gt_center { text-align: center; } #xalzjubttw .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #xalzjubttw .gt_font_normal { font-weight: normal; } #xalzjubttw .gt_font_bold { font-weight: bold; } #xalzjubttw .gt_font_italic { font-style: italic; } #xalzjubttw .gt_super { font-size: 65%; } #xalzjubttw .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #xalzjubttw .gt_asterisk { font-size: 100%; vertical-align: 0; } #xalzjubttw .gt_indent_1 { text-indent: 5px; } #xalzjubttw .gt_indent_2 { text-indent: 10px; } #xalzjubttw .gt_indent_3 { text-indent: 15px; } #xalzjubttw .gt_indent_4 { text-indent: 20px; } #xalzjubttw .gt_indent_5 { text-indent: 25px; } TABLE 8.4: Example of gtsummary table with trust in government estimates with labeling and digits options Characteristic N = 231,034,1251 Trust in Government, 2020     Always 1.6 (0.2)     Most of the time 13 (0.6)     About half the time 31 (0.8)     Some of the time 43 (0.9)     Never 11 (0.6) 1 % (SE(%)) To exclude the term “Characteristic” and the estimated population size, we can modify the header using themodify_header() function to update the label. Further adjustments can be made based on personal preferences, organizational guidelines, or other style guides. If we prefer having the standard error in the header, similar to the {gt} table, instead of in the footnote (the {gtsummary} default), we can make these changes by specifying stat_0 in the modify_header() function. Additionally, using modify_footnote() with update = everything() ~ NA removes the standard error from the footnote. After transforming the object into a gt table using as_gt(), we can add footnotes and a title using the same methods explained in Section 8.3.1.1. anes_des_gtsum3 <- anes_des %>% tbl_svysummary( include = TrustGovernment, statistic = list(all_categorical() ~ "{p} ({p.std.error})"), missing = "no", digits = list(TrustGovernment ~ style_percent), label = list(TrustGovernment ~ "Trust in Government, 2020") ) %>% modify_footnote(update = everything() ~ NA) %>% modify_header(label = " ", stat_0 = "% (s.e.)") %>% as_gt() %>% tab_header("American voter's trust in the federal government, 2020") %>% tab_source_note("American National Election Studies, 2020") %>% tab_footnote( "Question text: How often can you trust the federal government in Washington to do what is right?" ) anes_des_gtsum3 #dskirqhgbe table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #dskirqhgbe thead, #dskirqhgbe tbody, #dskirqhgbe tfoot, #dskirqhgbe tr, #dskirqhgbe td, #dskirqhgbe th { border-style: none; } #dskirqhgbe p { margin: 0; padding: 0; } #dskirqhgbe .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #dskirqhgbe .gt_caption { padding-top: 4px; padding-bottom: 4px; } #dskirqhgbe .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #dskirqhgbe .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #dskirqhgbe .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #dskirqhgbe .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #dskirqhgbe .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #dskirqhgbe .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #dskirqhgbe .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #dskirqhgbe .gt_column_spanner_outer:first-child { padding-left: 0; } #dskirqhgbe .gt_column_spanner_outer:last-child { padding-right: 0; } #dskirqhgbe .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #dskirqhgbe .gt_spanner_row { border-bottom-style: hidden; } #dskirqhgbe .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #dskirqhgbe .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #dskirqhgbe .gt_from_md > :first-child { margin-top: 0; } #dskirqhgbe .gt_from_md > :last-child { margin-bottom: 0; } #dskirqhgbe .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #dskirqhgbe .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #dskirqhgbe .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #dskirqhgbe .gt_row_group_first td { border-top-width: 2px; } #dskirqhgbe .gt_row_group_first th { border-top-width: 2px; } #dskirqhgbe .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #dskirqhgbe .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #dskirqhgbe .gt_first_summary_row.thick { border-top-width: 2px; } #dskirqhgbe .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #dskirqhgbe .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #dskirqhgbe .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #dskirqhgbe .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #dskirqhgbe .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #dskirqhgbe .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #dskirqhgbe .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #dskirqhgbe .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #dskirqhgbe .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #dskirqhgbe .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #dskirqhgbe .gt_left { text-align: left; } #dskirqhgbe .gt_center { text-align: center; } #dskirqhgbe .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #dskirqhgbe .gt_font_normal { font-weight: normal; } #dskirqhgbe .gt_font_bold { font-weight: bold; } #dskirqhgbe .gt_font_italic { font-style: italic; } #dskirqhgbe .gt_super { font-size: 65%; } #dskirqhgbe .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #dskirqhgbe .gt_asterisk { font-size: 100%; vertical-align: 0; } #dskirqhgbe .gt_indent_1 { text-indent: 5px; } #dskirqhgbe .gt_indent_2 { text-indent: 10px; } #dskirqhgbe .gt_indent_3 { text-indent: 15px; } #dskirqhgbe .gt_indent_4 { text-indent: 20px; } #dskirqhgbe .gt_indent_5 { text-indent: 25px; } TABLE 8.5: Example of gtsummary table with trust in government estimates with more labeling options and context American voter's trust in the federal government, 2020 % (s.e.) Trust in Government, 2020     Always 1.6 (0.2)     Most of the time 13 (0.6)     About half the time 31 (0.8)     Some of the time 43 (0.9)     Never 11 (0.6) American National Election Studies, 2020 Question text: How often can you trust the federal government in Washington to do what is right? We can also include continuous variables in the table. Below, we add a summary of the age variable by updating the include, statistic, and digits arguments. anes_des_gtsum4 <- anes_des %>% tbl_svysummary( include = c(TrustGovernment, Age), statistic = list( all_categorical() ~ "{p} ({p.std.error})", all_continuous() ~ "{mean} ({mean.std.error})" ), missing = "no", digits = list(TrustGovernment ~ style_percent, Age ~ c(1, 2)), label = list(TrustGovernment ~ "Trust in Government, 2020") ) %>% modify_footnote(update = everything() ~ NA) %>% modify_header(label = " ", stat_0 = "% (s.e.)") %>% as_gt() %>% tab_header("American voter's trust in the federal government, 2020") %>% tab_source_note("American National Election Studies, 2020") %>% tab_footnote( "Question text: How often can you trust the federal government in Washington to do what is right?" ) %>% tab_caption("Example of gtsummary table with trust in government estimates and average age") anes_des_gtsum4 #emvyahvagh table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #emvyahvagh thead, #emvyahvagh tbody, #emvyahvagh tfoot, #emvyahvagh tr, #emvyahvagh td, #emvyahvagh th { border-style: none; } #emvyahvagh p { margin: 0; padding: 0; } #emvyahvagh .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #emvyahvagh .gt_caption { padding-top: 4px; padding-bottom: 4px; } #emvyahvagh .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #emvyahvagh .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #emvyahvagh .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #emvyahvagh .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #emvyahvagh .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #emvyahvagh .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #emvyahvagh .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #emvyahvagh .gt_column_spanner_outer:first-child { padding-left: 0; } #emvyahvagh .gt_column_spanner_outer:last-child { padding-right: 0; } #emvyahvagh .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #emvyahvagh .gt_spanner_row { border-bottom-style: hidden; } #emvyahvagh .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #emvyahvagh .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #emvyahvagh .gt_from_md > :first-child { margin-top: 0; } #emvyahvagh .gt_from_md > :last-child { margin-bottom: 0; } #emvyahvagh .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #emvyahvagh .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #emvyahvagh .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #emvyahvagh .gt_row_group_first td { border-top-width: 2px; } #emvyahvagh .gt_row_group_first th { border-top-width: 2px; } #emvyahvagh .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #emvyahvagh .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #emvyahvagh .gt_first_summary_row.thick { border-top-width: 2px; } #emvyahvagh .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #emvyahvagh .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #emvyahvagh .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #emvyahvagh .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #emvyahvagh .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #emvyahvagh .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #emvyahvagh .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #emvyahvagh .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #emvyahvagh .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #emvyahvagh .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #emvyahvagh .gt_left { text-align: left; } #emvyahvagh .gt_center { text-align: center; } #emvyahvagh .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #emvyahvagh .gt_font_normal { font-weight: normal; } #emvyahvagh .gt_font_bold { font-weight: bold; } #emvyahvagh .gt_font_italic { font-style: italic; } #emvyahvagh .gt_super { font-size: 65%; } #emvyahvagh .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #emvyahvagh .gt_asterisk { font-size: 100%; vertical-align: 0; } #emvyahvagh .gt_indent_1 { text-indent: 5px; } #emvyahvagh .gt_indent_2 { text-indent: 10px; } #emvyahvagh .gt_indent_3 { text-indent: 15px; } #emvyahvagh .gt_indent_4 { text-indent: 20px; } #emvyahvagh .gt_indent_5 { text-indent: 25px; } TABLE 8.6: Example of gtsummary table with trust in government estimates and average age American voter's trust in the federal government, 2020 % (s.e.) Trust in Government, 2020     Always 1.6 (0.2)     Most of the time 13 (0.6)     About half the time 31 (0.8)     Some of the time 43 (0.9)     Never 11 (0.6) PRE: SUMMARY: Respondent age 47.3 (0.36) American National Election Studies, 2020 Question text: How often can you trust the federal government in Washington to do what is right? With {gtsummary}, we can also calculate statistics by different groups. Let’s modify the previous example to analyze data on whether a respondent voted for president in 2020. We update the by argument and refine the header. anes_des_gtsum5 <- anes_des %>% drop_na(VotedPres2020) %>% tbl_svysummary( include = TrustGovernment, statistic = list(all_categorical() ~ "{p} ({p.std.error})"), missing = "no", digits = list(TrustGovernment ~ style_percent), label = list(TrustGovernment ~ "Trust in Government, 2020"), by = VotedPres2020 ) %>% modify_footnote(update = everything() ~ NA) %>% modify_header(label = " ", stat_1 = "Voted", stat_2 = "Didn't vote") %>% as_gt() %>% tab_header( "American voter's trust in the federal government by whether they voted in the 2020 presidential election" ) %>% tab_source_note("American National Election Studies, 2020") %>% tab_footnote( "Question text: How often can you trust the federal government in Washington to do what is right?" ) anes_des_gtsum5 #xlfmurritx table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #xlfmurritx thead, #xlfmurritx tbody, #xlfmurritx tfoot, #xlfmurritx tr, #xlfmurritx td, #xlfmurritx th { border-style: none; } #xlfmurritx p { margin: 0; padding: 0; } #xlfmurritx .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #xlfmurritx .gt_caption { padding-top: 4px; padding-bottom: 4px; } #xlfmurritx .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #xlfmurritx .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #xlfmurritx .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #xlfmurritx .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #xlfmurritx .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #xlfmurritx .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #xlfmurritx .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #xlfmurritx .gt_column_spanner_outer:first-child { padding-left: 0; } #xlfmurritx .gt_column_spanner_outer:last-child { padding-right: 0; } #xlfmurritx .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #xlfmurritx .gt_spanner_row { border-bottom-style: hidden; } #xlfmurritx .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #xlfmurritx .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #xlfmurritx .gt_from_md > :first-child { margin-top: 0; } #xlfmurritx .gt_from_md > :last-child { margin-bottom: 0; } #xlfmurritx .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #xlfmurritx .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #xlfmurritx .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #xlfmurritx .gt_row_group_first td { border-top-width: 2px; } #xlfmurritx .gt_row_group_first th { border-top-width: 2px; } #xlfmurritx .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #xlfmurritx .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #xlfmurritx .gt_first_summary_row.thick { border-top-width: 2px; } #xlfmurritx .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #xlfmurritx .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #xlfmurritx .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #xlfmurritx .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #xlfmurritx .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #xlfmurritx .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #xlfmurritx .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #xlfmurritx .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #xlfmurritx .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #xlfmurritx .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #xlfmurritx .gt_left { text-align: left; } #xlfmurritx .gt_center { text-align: center; } #xlfmurritx .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #xlfmurritx .gt_font_normal { font-weight: normal; } #xlfmurritx .gt_font_bold { font-weight: bold; } #xlfmurritx .gt_font_italic { font-style: italic; } #xlfmurritx .gt_super { font-size: 65%; } #xlfmurritx .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #xlfmurritx .gt_asterisk { font-size: 100%; vertical-align: 0; } #xlfmurritx .gt_indent_1 { text-indent: 5px; } #xlfmurritx .gt_indent_2 { text-indent: 10px; } #xlfmurritx .gt_indent_3 { text-indent: 15px; } #xlfmurritx .gt_indent_4 { text-indent: 20px; } #xlfmurritx .gt_indent_5 { text-indent: 25px; } TABLE 8.7: Example of gtsummary table with trust in government estimates by voting status American voter's trust in the federal government by whether they voted in the 2020 presidential election Voted Didn’t vote Trust in Government, 2020     Always 1.1 (0.2) 1.0 (1.1)     Most of the time 13 (0.6) 19 (5.6)     About half the time 31 (0.9) 27 (7.4)     Some of the time 45 (0.9) 47 (7.7)     Never 9.0 (0.7) 5.1 (2.3) American National Election Studies, 2020 Question text: How often can you trust the federal government in Washington to do what is right? 8.3.2 Charts and plots Survey analysis can yield an abundance of printed summary statistics and models. Even with the most careful analysis, interpreting the results can be overwhelming. This is where charts and plots play a key role in our work. By transforming complex data into a visual representation, we can recognize patterns, relationships, and trends with greater ease. R has numerous packages for creating compelling and insightful charts. In this section, we will focus on {ggplot2}, a member of the {tidyverse} collection of packages. Known for its power and flexibility, {ggplot2} is an invaluable tool for creating a wide range of data visualizations. The {ggplot2} package follows the “grammar of graphics,” a framework that incrementally adds layers of chart components. This approach allows us to customize visual elements such as scales, colors, labels, and annotations to enhance the clarity of our results. After creating the survey design object, we can modify it to include additional outcomes and calculate estimates for our desired data points. Below, we create a binary variable TrustGovernmentUsually, which is TRUE when TrustGovernment is “Always” or “Most of the time” and FALSE otherwise. Then, we calculate the percentage of people who usually trust the government based on their vote in the 2020 presidential election (VotedPres2020_selection). We remove the cases where people did not vote or did not indicate their choice. anes_des_der <- anes_des %>% mutate(TrustGovernmentUsually = case_when( is.na(TrustGovernment) ~ NA, TRUE ~ TrustGovernment %in% c("Always", "Most of the time") )) %>% drop_na(VotedPres2020_selection) %>% group_by(VotedPres2020_selection) %>% summarize( pct_trust = survey_mean( TrustGovernmentUsually, na.rm = TRUE, proportion = TRUE, vartype = "ci" ), .groups = "drop" ) anes_des_der ## # A tibble: 3 × 4 ## VotedPres2020_selection pct_trust pct_trust_low pct_trust_upp ## <fct> <dbl> <dbl> <dbl> ## 1 Biden 0.123 0.109 0.140 ## 2 Trump 0.178 0.161 0.198 ## 3 Other 0.0681 0.0290 0.152 Now, we can begin creating our chart with {ggplot2}. First, we set up our plot with ggplot(). Next, we define the data points to be displayed using aesthetics, or aes. Aesthetics represent the visual properties of the objects in the plot. In the example below, we map the x variable to VotedPres2020_selection from the dataset and the y variable to pct_trust. Finally, we specify the type of plot with geom_*(), in this case, geom_bar(). The resulting plot is displayed in Figure 8.1. p <- anes_des_der %>% ggplot(aes(x = VotedPres2020_selection, y = pct_trust)) + geom_bar(stat = "identity") p FIGURE 8.1: Bar chart of trust in government by chosen 2020 presidential candidate This is a great starting point: we observe that a higher percentage of people stating they usually trust the government among those who voted for Trump compared to those who voted for Biden or other candidates. Now, what if we want to introduce color to better differentiate the three groups? We can add fill under aesthetics, indicating that we want to use distinct values of VotedPres2020_selection to color the bars. In this instance, Biden and Trump will be displayed in different colors. pcolor <- anes_des_der %>% ggplot(aes(x = VotedPres2020_selection, y = pct_trust, fill = VotedPres2020_selection)) + geom_bar(stat = "identity") pcolor FIGURE 8.2: Bar chart of trust in government by chosen 2020 presidential candidate with colors Let’s say we wanted to follow proper statistical analysis practice and incorporate variability in our plot. We can add another geom, geom_errorbar(), to display the confidence intervals on top of our existing geom_bar() layer. We can add the layer using a plus sign +. pcol_error <- anes_des_der %>% ggplot(aes(x = VotedPres2020_selection, y = pct_trust, fill = VotedPres2020_selection)) + geom_bar(stat = "identity") + geom_errorbar(aes(ymin = pct_trust_low, ymax = pct_trust_upp), width = .2) pcol_error FIGURE 8.3: Bar chart of trust in government by chosen 2020 presidential candidate with colors and error bars We can continue adding to our plot until we achieve our desired look. For example, we can eliminate the color legend as it doesn’t contribute meaningful information with guides(fill = \"none\"). We can specify specific colors for fill using scale_fill_manual(). Inside the function, we provide a vector of values corresponding to the colors in our plot. These values are hexadecimal (hex) color codes, denoted by a leading pound sign # followed by six letters or numbers. The hex code #0b3954 used below is a dark blue. There are many tools online that help pick hex codes, such as htmlcolorcodes.com/. pfull <- anes_des_der %>% ggplot(aes(x = VotedPres2020_selection, y = pct_trust, fill = VotedPres2020_selection)) + geom_bar(stat = "identity") + geom_errorbar(aes(ymin = pct_trust_low, ymax = pct_trust_upp), width = .2) + scale_fill_manual(values = c("#0b3954", "#bfd7ea", "#8d6b94")) + xlab("Election choice (2020)") + ylab("Usually trust the government") + scale_y_continuous(labels = scales::percent) + guides(fill = "none") + labs(title = "Percent of voters who usually trust the government by chosen 2020 presidential candidate", caption = "Source: American National Election Studies, 2020") pfull FIGURE 8.4: Bar chart of trust in government by chosen 2020 presidential candidate with colors, labels, error bars, and title What we’ve explored in this section are just the foundational aspects of {ggplot2}, and the capabilities of this package extend far beyond what we’ve covered. Advanced features such as annotation, faceting, and theming allow for more sophisticated and customized visualizations. The book Wickham (2023) is a comprehensive guide to learning more about this powerful tool. References ———. 2023. Ggplot2: Elegant Graphics for Data Analysis. 3rd Edition. https://ggplot2-book.org/; Springer. "],["c09-reprex-data.html", "Chapter 9 Reproducible research 9.1 Introduction 9.2 Project-based workflows 9.3 Functions and packages 9.4 Version control with Git 9.5 Package management with {renv} 9.6 R environments with Docker 9.7 Workflow management with {targets} 9.8 Documentation with Quarto and R Markdown 9.9 Other tips for reproducibility 9.10 Summary", " Chapter 9 Reproducible research 9.1 Introduction Reproducing a data analysis’s results is a crucial aspect of any research. First, reproducibility serves as a form of quality assurance. If we pass an analysis project to another person, they should be able to run the entire project from start to finish and obtain the same results. They can critically assess the methodology and code while detecting potential errors. Another goal of reproducibility is enabling the verification of our analysis. When someone else is able to check our results, it ensures the integrity of the analyses by determining that the conclusions are not dependent on a particular person running the code or workflow on a particular day or in a particular environment. Not only is reproducibility a key component in ethical and accurate research, but it is also a requirement for many scientific journals. For example, the Journal of Survey Statistics and Methodology (JSSAM) and Public Opinion Quarterly (POQ) require authors to make code, data, and methodology transparent and accessible to other researchers who wish to verify or build on existing work. Reproducible research requires that the key components of analysis are available, discoverable, documented, and shared with others. The four main components that we should consider are: Code: source code used for data cleaning, analysis, modeling, and reporting Data: raw data used in the workflow, or if data is sensitive or proprietary, as much data as possible that would allow others to run our workflow (e.g., access to a restricted use file (RUF)) Environment: environment of the project, including the R version, packages, operating system, and other dependencies used in the analysis Methodology: analysis methodology, including rationale behind decisions, interpretations, and assumptions In Chapter 8, we briefly mention how each of these is important to include in the methodology report and when communicating the findings of a study. However, to be transparent and effective researchers, we need to ensure we not only discuss these through text but also provide files and additional information when requested. Often, when starting a project, analysts will dive into the data and make decisions as they go without full documentation, which can be challenging if we need to go back and make changes or understand even what we did a few months ago. It benefits other analysts and potentially our future selves to better document everything from the start. The good news is that many tools, practices, and project management techniques make survey analysis projects easy to reproduce. For best results, analysts should decide which techniques and tools will be used before starting a project (or very early on). This chapter covers some of our suggestions for tools and techniques we can use in projects. This list is not comprehensive but aims to provide a starting point for those looking to create a reproducible workflow. 9.2 Project-based workflows We recommend a project-based workflow for analysis projects as described by Wickham, Çetinkaya-Rundel, and Grolemund (2023). A project-based workflow maintains a “source of truth” for our analyses. It helps with file system discipline by putting everything related to a project in a designated folder. Since all associated files are in a single location, they are easy to find and organize. When we reopen the project, we can recreate the environment in which we originally ran the code to reproduce our results. The RStudio IDE has built-in support for projects. When we create a project in RStudio, it creates a .Rproj file that store settings specific to that project. Once we have created a project, we can create folders that help us organize our workflow. For example, a project directory could look like this: | anes_analysis/ | anes_analysis.Rproj | README.md | codebooks | codebook2020.pdf | codebook2016.pdf | rawdata | anes2020_raw.csv | anes2016_raw.csv | scripts | data-prep.R | data | anes2020_clean.csv | anes2016_clean.csv | report | anes_report.Rmd | anes_report.html | anes_report.pdf In a project-based workflow, all paths are relative and, by default, relative to the project’s folder. By using relative paths, others can open and run our files even if their directory configuration differs from ours. The {here} package enables easy file referencing, and we can start with using the here::here() function to build the path for loading or saving data. Below, we ask R to read the CSV file anes_2020.csv in the project directory’s data folder: anes <- read_csv(here::here("data", "anes2020_clean.csv")) The combination of projects and the {here} package keep all associated files in an organized manner. This workflow makes it more likely that our analyses can be reproduced by us or our colleagues. 9.3 Functions and packages We may find ourselves repeating ourselves in our script, and the chances of errors increases whenever we copy and paste our code. By creating a function, we can create a consistent set of commands that reduce the likelihood of mistakes. Functions also organize our code, improve the code readability, and allow others to execute the same commands. Throughout this book, we have created functions, such as in Chapter 13, to run sequences of rename, filter, group_by, and summarize statements across different variables. The function helps us avoid overlooking necessary steps. A package is made up of a collection of functions. If we find ourselves sharing functions with others to replicate the same series of commands in a separate project, creating a package can be a useful tool for sharing the code along with data and documentation. 9.4 Version control with Git Often, a survey analysis project produces a lot of code. Keeping track of the latest version can become challenging as files evolve throughout a project. If a team of analysts is working on the same script, someone may use an outdated version, resulting in incorrect results or redundant work. Version control systems like Git can help alleviate these pains. Git is a system that helps track changes in computer files. Analysts can use Git to follow code evaluation and manage asynchronous work. With Git, it is easy to see any changes made in a script, revert changes, and resolve differences between code versions (called conflicts). Services such as GitHub or GitLab provide hosting and sharing of files as well as version control with Git. For example, we can visit the GitHub repository for this book (https://github.com/tidy-survey-r/tidy-survey-book) and see the files that build the book, when they were committed to the repository, and the history of modifications over time. In addition to code scripts, platforms like GitHub can store data and documentation. They provide a way to maintain a history of data modifications through versioning and timestamps. By saving the data and documentation alongside the code, it becomes easier for others to refer to and access everything they need in one place. Using version control in analysis projects makes collaboration and maintenance more manageable. For connecting Git with R, we recommend the book Happy Git and GitHub for the useR (Bryan and Hester 2023). 9.5 Package management with {renv} Ensuring reproducibility involves not only using version control of code, but also managing the versions of packages. If two people run the same code but use different versions of a package, the results might differ because of changes in those packages. For example, this book currently uses a version of the {srvyr} package from GitHub and not from CRAN. This is because the version of {srvyr} on CRAN has some bugs (errors) that result in incorrect calculations. The version on GitHub has corrected these errors, so we have asked readers to install the GitHub version to obtain the same results. One way to handle different package versions is with the {renv} package. This package allows researchers to set the versions for each package used and manage package dependencies. Specifically, {renv} creates isolated, project-specific environments that record the packages and their versions used in the code. When initiated by a new user, {renv} checks whether the installed packages are consistent with the recorded version for the project. If not, it installs the appropriate versions so that others can replicate the project’s environment to rerun the code and obtain consistent results. 9.6 R environments with Docker Just as different versions of packages can introduce discrepancies or compatibility issues, the version of R can also prevent reproducibility. Tools such as Docker can help with this potential issue by creating isolated environments that define the version of R being used, along with other dependencies and configurations. The entire environment is bundled in a container. The container, defined by a Dockerfile, can be shared so anybody, regardless of their local setup, can run the R code in the same environment. 9.7 Workflow management with {targets} With complex studies involving multiple code files and dependencies, it is important to ensures each step is executed in the intended sequence. We can do this manually, e.g., numbering files to indicate the order or providing detailed documentation on the order. Alternatively, we can automate the process so the code flows sequentially. Making sure that the code runs in the correct order helps ensure that the research is reproducible. Anyone should be able to pick up the set of scripts and get the same results by following the workflow. The {targets} package is growing as a popular workflow manager that documents, automates, and executes complex data workflows with multiple steps and dependencies. With this package, we first define the order of execution for our code, and then it will consistently execute the code in that order each time it is run. One beneficial feature of {targets} is that if you change code later in the workflow, only the affected code and its downstream targets (i.e., the subsequent code files) are re-executed when we change a script. The {targets} package also provides interactive progress monitoring and reporting, allowing us to track the status and progress of our analysis pipeline. 9.8 Documentation with Quarto and R Markdown Tools like Quarto and R Markdown aid in reproducibility by creating documents that weave together code, text, and results. We can present analysis results alongside the report’s narrative, so there’s no need to copy and paste code output into the final documentation. By eliminating manual steps, we can reduce the chances of errors in the final output. Quarto and R Markdown documents also allow users to re-execute the underlying code when needed. Another analyst can see the steps we took, follow the scripts, and recreate the report. We can include details about our work in one place thanks to the combination of text and code, making our work transparent and easier to verify. 9.8.1 Parameterization Another useful feature of Quarto and R Markdown is the ability to reduce repetitive code by parameterizing the files. Parameters can control various aspects of the analysis, such as dates, geography, or other analysis variables. We can define and modify these parameters to explore different scenarios or inputs. For example, suppose we start by creating a document that provides survey analysis results for North Carolina but then later decide we want to look at another state. In that case, we can define a state parameter and rerun the same analysis for a state like Washington without having to edit the code throughout the document. Parameters can be defined in the header or code chunks of our Quarto or R Markdown documents and easily be modified and documented. We reduce errors that may occur by manually editing code throughout the script, and offer a flexible way for others to replicate the analysis and explore variations. 9.9 Other tips for reproducibility 9.9.1 Random number seeds Some tasks in survey analysis require randomness, such as imputation, model training, or creating random samples. By default, the random numbers generated by R change each time we rerun the code, making it difficult to reproduce the same results. By “setting the seed,” we can control the randomness and ensure that the random numbers remain consistent whenever we rerun the code. Others can use the same seed value to reproduce our random numbers and achieve the same results. In R, we can use the set.seed() function to control the randomness in our code. Set a seed value by providing an integer to the function: set.seed(999) runif(5) The runif() function generates five random numbers from a uniform distribution. Since the seed is set to 999, running runif() multiple times will always produce the same sequence: [1] 0.38907138 0.58306072 0.09466569 0.85263123 0.78674676 The choice of the seed number is up to the analyst. For example, this could be the date (20240102) or time of day (1056) when the analysis was first conducted, a phone number (8675309), or the first few numbers that come to mind (369). As long as the seed is set for a given analysis, the actual number is up to the analyst to decide. It is important to note that set.seed() should be used before random number generation. Run it once per program, and the seed will be applied to the entire script. We recommend setting the seed at the beginning of a script, where libraries are loaded. 9.9.2 Descriptive names and labels Using descriptive variable names or labeling data can also assist with reproducible research. For example, in the ANES data, the variable names in the raw data all start with V20 and are a string of numbers. To make things easier to reproduce, we opted to change the variable names to be more descriptive of what they contained (e.g., Age). This can also be done with the data values themselves. One way to accomplish this is by creating factors for categorical data, which can ensure that we know that a value of 1 really means Female, for example. There are other ways of handling this, such as attaching labels to the data instead of recoding variables to be descriptive (see Chapter 11). As with random number seeds, the exact method is up to the analyst, but providing this information can help ensure our research is reproducible. 9.10 Summary We can promote accuracy and verification of results by making our analysis reproducible. There are various tools and guides available to help you achieve reproducibility in your work, a few of which were described in this chapter. Here are additional resources to explore: R for Data Science chapter on project-based workflows: https://r4ds.hadley.nz/workflow-scripts.html#projects Building reproducible analytical pipelines with R by Bruno Rodrigues: https://raps-with-r.dev/ Posit Solutions Site page on reproducible environments: https://solutions.posit.co/envs-pkgs/environments/ References Bryan, Jenny, and Jim Hester. 2023. Happy Git and GitHub for the useR. https://happygitwithr.com/. Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2rd Edition. https://r4ds.hadley.nz/; O’Reilly Media. "],["c10-specifying-sample-designs.html", "Chapter 10 Specifying sample designs and replicate weights in {srvyr} 10.1 Introduction 10.2 Common sampling designs 10.3 Combining sampling methods 10.4 Replicate weights 10.5 Exercises", " Chapter 10 Specifying sample designs and replicate weights in {srvyr} Prerequisites For this chapter, load the following packages: library(tidyverse) library(survey) library(srvyr) library(srvyrexploR) To help explain the different types of sample designs, this chapter will use the api and scd data that are included in the {survey} package: data(api) data(scd) This chapter uses data from the Residential Energy Consumption Survey (RECS) - both 2015 and 2020, so we will use the following code to load the RECS data from the {srvyr.data} package: data(recs_2015) data(recs_2020) 10.1 Introduction The primary reason for using packages like {survey} and {srvyr} is to account for the sampling design or replicate weights into estimates. By incorporating the sampling design or replicate weights, precision estimates (e.g., standard errors and confidence intervals) are appropriately calculated. In this chapter, we will introduce common sampling designs and common types of replicate weights, the mathematical methods for calculating estimates and standard errors for a given sampling design, and the R syntax to specify the sampling design or replicate weights. While we will show the math behind the estimates, the functions in these packages will do the calculation. To deeply understand the math and the derivation, refer to Penn State (2019), Särndal, Swensson, and Wretman (2003), Wolter (2007), or Fuller (2011) (these are listed in order of increasing statistical rigorousness). The general process for estimation in the {srvyr} package is to: Create a tbl_svy object (a survey object) using: as_survey_design() or as_survey_rep() Subset data (if needed) using filter() (subpopulations) Specify domains of analysis using group_by() Within summarize(), specify variables to calculate, including means, totals, proportions, quantiles, and more This chapter includes details on the first step - creating the survey object. Once this survey object is created, it can be used in the other steps (detailed in chapters 5 through 7) to account for the complex survey design. 10.2 Common sampling designs A sampling design is the method used to draw a sample. Both logistical and statistical elements are considered when developing a sampling design. When specifying a sampling design in R, the levels of sampling are specified along with the weights. The weight for each record is constructed so that the particular record represents that many units in the population. For example, in a survey of 6th-grade students in the United States, the weight associated with each responding student reflects how many 6th grade students across the country that record represents. Generally, the weights represent the inverse of the probability of selection such that the sum of the weights corresponds to the total population size, although some studies may have the sum of the weights equal to the number of respondent records. Some common terminology across the designs are: sample size, generally denoted as \\(n\\), is the number of units selected to be sampled population size, generally denoted as \\(N\\), is the number of units in the target population sampling frame, the list of units from which the sample is drawn (see Chapter 2 for more information) 10.2.1 Simple random sample without replacement The simple random sample (SRS) without replacement is a sampling design where a fixed sample size is selected from a sampling frame, and every possible subsample has an equal probability of selection. Without replacement refers to the fact that once a sampling unit has been selected, it is removed from the sample frame and cannot be selected again. Requirements: The sampling frame must include the entire population. Advantages: SRS requires no information about the units apart from contact information. Disadvantages: The sampling frame may not be available for the entire population. Example: Randomly select students in a university from a roster provided by the registrar’s office. The math The estimate for the population mean of variable \\(y\\) is: \\[\\bar{y}=\\frac{1}{n}\\sum_{i=1}^n y_i\\] where \\(\\bar{y}\\) represents the sample mean, \\(n\\) is the total number of respondents (or observations), and \\(y_i\\) is each individual value of \\(y\\). The estimate of the standard error of the mean is: \\[se(\\bar{y})=\\sqrt{\\frac{s^2}{n}\\left( 1-\\frac{n}{N} \\right)}\\] where \\[s^2=\\frac{1}{n-1}\\sum_{i=1}^n\\left(y_i-\\bar{y}\\right)^2.\\] and \\(N\\) is the population size. This standard error estimate might look very similar to equations in other applications except for the part on the right side of the equation: \\(1-\\frac{n}{N}\\). This is called the finite population correction (FPC) factor. If the size of the frame, \\(N\\), is very large in comparison to the sample, the FPC is negligible, so it is often ignored. A common guideline is if the sample is less than 10% of the population, the FPC is negligible. To estimate proportions, we define \\(x_i\\) as the indicator if the outcome is observed. That is, \\(x_i=1\\) if the outcome is observed, and \\(x_i=0\\) if the outcome is not observed for respondent \\(i\\). Then the estimated proportion from an SRS design is: \\[\\hat{p}=\\frac{1}{n}\\sum_{i=1}^n x_i \\] and the estimated standard error of the proportion is: \\[se(\\hat{p})=\\sqrt{\\frac{\\hat{p}(1-\\hat{p})}{n-1}\\left(1-\\frac{n}{N}\\right)} \\] The syntax If a sample was drawn through SRS and had no nonresponse or other weighting adjustments, in R, specify this design as: srs1_des <- dat %>% as_survey_design(fpc = fpcvar) where dat is a tibble or data.frame with the survey data, and fpcvar is a variable in the data indicating the sampling frame’s size (this variable will have the same value for all cases in an SRS design). If the frame is very large, sometimes the frame size is not provided. In that case, the FPC is not needed, and specify the design as: srs2_des <- dat %>% as_survey_design() If some post-survey adjustments were implemented and the weights are not all equal, specify the design as: srs3_des <- dat %>% as_survey_design(weights = wtvar, fpc = fpcvar) where wtvar is a variable in the data indicating the weight for each case. Again, the FPC can be omitted if it is unnecessary because the frame is large compared to the sample size. Example The {survey} package in R provides some example datasets that we will use throughout this chapter. The documentation provides detailed information about the variables. One of the example datasets we will use is from the Academic Performance Index (API). The API was a program administered by the California Department of Education, and the {survey} package includes a population file (sample frame) of all schools with at least 100 students and several different samples pulled from that data using different sampling methods. For this first example, we will use the apisrs dataset, which contains an SRS of 200 schools. For printing purposes, we create a new dataset called apisrs_slim, which sorts the data by the school district and school ID and subsets the data to only a few columns. The SRS sample data is illustrated below: apisrs_slim <- apisrs %>% as_tibble() %>% arrange(dnum, snum) %>% select(cds, dnum, snum, dname, sname, fpc, pw) apisrs_slim ## # A tibble: 200 × 7 ## cds dnum snum dname sname fpc pw ## <chr> <int> <dbl> <chr> <chr> <dbl> <dbl> ## 1 19642126061220 1 1121 ABC Unified Haske… 6194 31.0 ## 2 19642126066716 1 1124 ABC Unified Stowe… 6194 31.0 ## 3 36675876035174 5 3895 Adelanto Elementary Adela… 6194 31.0 ## 4 33669776031512 19 3347 Alvord Unified Arlan… 6194 31.0 ## 5 33669776031595 19 3352 Alvord Unified Wells… 6194 31.0 ## 6 31667876031033 39 3271 Auburn Union Elementary Cain … 6194 31.0 ## 7 19642876011407 42 1169 Baldwin Park Unified Deanz… 6194 31.0 ## 8 19642876011464 42 1175 Baldwin Park Unified Heath… 6194 31.0 ## 9 19642956011589 48 1187 Bassett Unified Erwin… 6194 31.0 ## 10 41688586043392 49 4948 Bayshore Elementary Baysh… 6194 31.0 ## # ℹ 190 more rows Table 10.1 provides details on all the variables in this dataset. TABLE 10.1: Overview of Variables in api Data Variable Name Description cds Unique identifier for each school dnum School district identifier within county snum School identifier within district dname District Name sname School Name fpc Finite population correction factor (FPC) pw Weight To create the tbl_survey object for this SRS data, the design should be specified as follows: apisrs_des <- apisrs_slim %>% as_survey_design(weights = pw, fpc = fpc) apisrs_des ## Independent Sampling design ## Called via srvyr ## Sampling variables: ## - ids: `1` ## - fpc: fpc ## - weights: pw ## Data variables: ## - cds (chr), dnum (int), snum (dbl), dname (chr), sname (chr), fpc ## (dbl), pw (dbl) In the printed design object above, the design is described as an “Independent Sampling design,” which is another term for SRS. The ids are specified as 1, which means there is no clustering (a topic described in Section 10.2.4), the FPC variable is indicated, and the weights are indicated. We can also look at the summary of the design object, and see the distribution of the probabilities (inverse of the weights) along with the population size and a list of the variables in the dataset. summary(apisrs_des) ## Independent Sampling design ## Called via srvyr ## Probabilities: ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0323 0.0323 0.0323 0.0323 0.0323 0.0323 ## Population size (PSUs): 6194 ## Data variables: ## [1] "cds" "dnum" "snum" "dname" "sname" "fpc" "pw" 10.2.2 Simple random sample with replacement Similar to the SRS design, the simple random sample with replacement (SRSWR) design randomly selects the sample from the entire sampling frame. However, while SRS removes sampled units before selecting again, the SRSWR instead replaces each sampled unit before drawing again, so units can be selected more than once. Requirements: The sampling frame must include the entire population. Advantages: SRSWR requires no information about the units apart from contact information. Disadvantages: The sampling frame may not be available for the entire population. Units can be selected more than once, resulting in a smaller realized sample size because receiving duplicate information from a single respondent does not provide additional information. For small populations, SRSWR has larger standard errors than SRS designs. Example: A professor puts all students’ names on paper slips and selects them randomly to ask students questions, but the professor replaces the paper after calling on the student so they can be selected again at any time. In general for surveys, using an SRS design (without replacement) is preferred as we do not want respondents to answer a survey more than once. The math The estimate for the population mean of variable \\(y\\) is: \\[\\bar{y}=\\frac{1}{n}\\sum_{i=1}^n y_i\\] and the estimate of the standard error of mean is: \\[se(\\bar{y})=\\sqrt{\\frac{s^2}{n}}\\] where \\[s^2=\\frac{1}{n-1}\\sum_{i=1}^n\\left(y_i-\\bar{y}\\right)^2.\\] To calculate the estimated proportion, we define \\(x_i\\) as the indicator that the outcome is observed (as we did with SRS): \\[\\hat{p}=\\frac{1}{n}\\sum_{i=1}^n x_i \\] and the estimated standard error of the proportion is: \\[se(\\hat{p})=\\sqrt{\\frac{\\hat{p}(1-\\hat{p})}{n}} \\] The syntax If we had a sample that was drawn through SRSWR and had no nonresponse or other weighting adjustments, in R, we should specify this design as: srswr1_des <- dat %>% as_survey_design() where dat is a tibble or data.frame containing our survey data. This syntax is the same as a SRS design, except a finite population correction (FPC) is not included. This is because when you claculate a sample with replacement, the population pool to select from is no longer finite, so a correction is not needed. Therefore, with large populations where the FPC is negligble, the underlying formulas for SRS and SRSWR designs are the same. If some post-survey adjustments were implemented and the weights are not all equal, specify the design as: srswr2_des <- dat %>% as_survey_design(weights = wtvar) where wtvar is the variable for the weight on the data. Example The {survey} package does not include an example of SRSWR, so to illustrate this design we need to create an example. We use the api population data provided by the {survey} package apipop and select a sample of 200 cases using the slice_sample() function from the tidyverse. One of the arguments in the slice_sample() function is replace. If replace=TRUE, then we are conducting a SRSWR. We then calculate selection weights as the inverse of the probability of selection and call this new dataset apisrswr. set.seed(409963) apisrswr <- apipop %>% as_tibble() %>% slice_sample(n = 200, replace = TRUE) %>% select(cds, dnum, snum, dname, sname) %>% mutate( weight = nrow(apipop)/200 ) head(apisrswr) ## # A tibble: 6 × 6 ## cds dnum snum dname sname weight ## <chr> <int> <dbl> <chr> <chr> <dbl> ## 1 43696416060065 533 5348 Palo Alto Unified Jordan (Da… 31.0 ## 2 07618046005060 650 509 San Ramon Valley Unified Alamo Elem… 31.0 ## 3 19648086085674 457 2134 Montebello Unified La Merced … 31.0 ## 4 07617056003719 346 377 Knightsen Elementary Knightsen … 31.0 ## 5 19650606023022 744 2351 Torrance Unified Carr (Evel… 31.0 ## 6 01611196090120 6 13 Alameda City Unified Paden (Wil… 31.0 Because this is a SRS design with replacement, there will be duplicates in the data. It is important to keep the duplicates in the data for proper estimation, but for reference we can view the duplicates in the example data we just created. apisrswr %>% group_by(cds) %>% filter(n()>1) %>% arrange(cds) ## # A tibble: 4 × 6 ## # Groups: cds [2] ## cds dnum snum dname sname weight ## <chr> <int> <dbl> <chr> <chr> <dbl> ## 1 15633216008841 41 869 Bakersfield City Elem Chipman Junio… 31.0 ## 2 15633216008841 41 869 Bakersfield City Elem Chipman Junio… 31.0 ## 3 39686766042782 716 4880 Stockton City Unified Tyler Skills … 31.0 ## 4 39686766042782 716 4880 Stockton City Unified Tyler Skills … 31.0 We created a weight variable in this example data, which is the inverse of the probability of selection. To specify the sampling design for apisrswr, the following syntax should be used: apisrswr_des <- apisrswr %>% as_survey_design(weights = weight) apisrswr_des ## Independent Sampling design (with replacement) ## Called via srvyr ## Sampling variables: ## - ids: `1` ## - weights: weight ## Data variables: ## - cds (chr), dnum (int), snum (dbl), dname (chr), sname (chr), weight ## (dbl) summary(apisrswr_des) ## Independent Sampling design (with replacement) ## Called via srvyr ## Probabilities: ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0323 0.0323 0.0323 0.0323 0.0323 0.0323 ## Data variables: ## [1] "cds" "dnum" "snum" "dname" "sname" "weight" In the output above, the design object and the object summary are shown. Both note that the sampling is done “with replacement” because no FPC was specified. The probabilities, which are derived from the weights, are summarized in the summary. 10.2.3 Stratified sampling Stratified sampling occurs when a population is divided into mutually exclusive subpopulations (strata), and then samples are selected independently within each stratum. Requirements: The sampling frame must include the information to divide the population into groups for every unit. Advantages: This design ensures sample representation in all subpopulations. If the strata are correlated with survey outcomes, a stratified sample has smaller standard errors compared to a SRS sample of the same size. This results in a more efficient design. Disadvantages: Auxiliary data may not exist to divide the sampling frame into groups, or the data may be outdated. Examples: Example 1: A population of North Carolina residents could be separated (stratified) into urban and rural areas, and then a SRS of residents from both rural and urban areas is selected independently. This ensures there are residents from both areas in the sample. Example 2: Law enforcement agencies could be separated (stratified) into the three primary general-purpose categories in the US: local police, sheriff’s departments, and state police. A SRS of agencies from each of the three types is then selected independently to ensure all three types of agencies are represented. The math Let \\(\\bar{y}_h\\) be the sample mean for stratum \\(h\\), \\(N_h\\) be the population size of stratum \\(h\\), and \\(n_h\\) be the sample size of stratum \\(h\\). Then the estimate for the population mean under stratified SRS sampling is: \\[\\bar{y}=\\frac{1}{N}\\sum_{h=1}^H N_h\\bar{y}_h\\] and the estimate of the standard error of \\(\\bar{y}\\) is: \\[se(\\bar{y})=\\sqrt{\\frac{1}{N^2} \\sum_{h=1}^H N_h^2 \\frac{s_h^2}{n_h}\\left(1-\\frac{n_h}{N_h}\\right)} \\] where \\[s_h^2=\\frac{1}{n_h-1}\\sum_{i=1}^{n_h}\\left(y_{i,h}-\\bar{y}_h\\right)^2.\\] For estimates of proportions, let \\(\\hat{p}_h\\) be the estimated proportion in stratum \\(h\\). Then the population proportion estimate is: \\[\\hat{p}= \\frac{1}{N}\\sum_{h=1}^H N_h \\hat{p}_h\\] where \\(H\\) is the total number of strata. The standard error of the proportion is: \\[se(\\hat{p}) = \\frac{1}{N} \\sqrt{ \\sum_{h=1}^H N_h^2 \\frac{\\hat{p}_h(1-\\hat{p}_h)}{n_h-1} \\left(1-\\frac{n_h}{N_h}\\right)}\\] The syntax In addition to the fpc and weights arguments discussed in the types above, stratified designs requires the addition of the strata argument. For example, to specify a stratified SRS design in {srvyr} when using the FPC, that is, where the population sizes of the strata are not too large and are known, specify the design as: stsrs1_des <- dat %>% as_survey_design(fpc = fpcvar, strata = stratvar) where fpcvar is a variable on our data that indicates \\(N_h\\) for each row, and stratavar is a variable indicating the stratum for each row. You can omit the FPC if it is not applicable. Additionally, we can indicate the weight variable if it is present where wtvar is a variable on our data with a numeric weight. stsrs2_des <- dat %>% as_survey_design(weights = wtvar, strata = stratvar) Example In the example API data, apistrat is a stratified random sample, stratified by school type (stype) with three levels: E for elementary school, M for middle school, and H for high school. As with the SRS example above, we sort and select specific variables for use in printing. The data are illustrated below, including a count of the number of cases per stratum: apistrat_slim <- apistrat %>% as_tibble() %>% arrange(dnum, snum) %>% select(cds, dnum, snum, dname, sname, stype, fpc, pw) apistrat_slim %>% count(stype, fpc) ## # A tibble: 3 × 3 ## stype fpc n ## <fct> <dbl> <int> ## 1 E 4421 100 ## 2 H 755 50 ## 3 M 1018 50 The FPC is the same for each case within each stratum. This output also shows that 100 elementary schools, 50 middle schools, and 50 high schools were sampled. It is often common for the number of units sampled from each strata to be different based on the goals of the project, or to mirror the size of each strata in the population. This design should be specified as follows: apistrat_des <- apistrat_slim %>% as_survey_design(strata = stype, weights = pw, fpc = fpc) apistrat_des ## Stratified Independent Sampling design ## Called via srvyr ## Sampling variables: ## - ids: `1` ## - strata: stype ## - fpc: fpc ## - weights: pw ## Data variables: ## - cds (chr), dnum (int), snum (dbl), dname (chr), sname (chr), stype ## (fct), fpc (dbl), pw (dbl) summary(apistrat_des) ## Stratified Independent Sampling design ## Called via srvyr ## Probabilities: ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0226 0.0226 0.0359 0.0401 0.0534 0.0662 ## Stratum Sizes: ## E H M ## obs 100 50 50 ## design.PSU 100 50 50 ## actual.PSU 100 50 50 ## Population stratum sizes (PSUs): ## E H M ## 4421 755 1018 ## Data variables: ## [1] "cds" "dnum" "snum" "dname" "sname" "stype" "fpc" "pw" When printing the object, it is specified as a “Stratified Independent Sampling design,” also known as a stratified SRS, and the strata variable is included. Printing the summary we see a distribution of probabilities, as we saw with SRS, but we also see the sample and populations sizes by stratum. 10.2.4 Clustered sampling Clustered sampling occurs when a population is divided into mutually exclusive subgroups called clusters or primary sampling units (PSUs). A random selection of PSUs is sampled, and then another level of sampling is done within these clusters. There can be multiple levels of this selection. Clustered sampling is often used when a list of the entire population is not available, or data collection involves interviewers needing direct contact with respondents. Requirements: There must be a way to divide the population into clusters. Clusters are commonly structural such as institutions (e.g., schools, prisons) or geography (e.g., states, counties). Advantages: Clustered sampling is advantageous when data collection is done in person, so interviewers are sent to specific sampled areas rather than completely at random across a country. With clustered sampling, a list of the entire population is not necessary. For example, if sampling students, we do not need a list of all students but only a list of all schools. Once the schools are sampled, lists of students can be obtained within the sampled schools. Disadvantages: Compared to a simple random sample for the same sample size, clustered samples generally have larger standard errors of estimates. Examples: Example 1: Consider a study needing a sample of 6th-grade students in the United States, no list likely exists of all these students. However, it is more likely to obtain a list of schools that have 6th graders, so a study design could select a random sample of schools that have 6th graders. The selected schools can then provide a list of students to do a second stage of sampling where 6th-grade students are randomly sampled within each of the sampled schools. This is a one-stage sample design (the one representing the number of clusters) and will be the type of design we will discuss in the formulas below. Example 2: Consider a study sending interviewers to households for a survey. This is a more complicated example that requires two levels of clustering (two-stage sample design) to efficiently use interviewers in geographic clusters. First, in the U.S., counties could be selected as the PSU, then Census block groups within counties could be selected as the secondary sampling unit (SSU). Households could then be randomly sampled within the block groups. This type of design is popular for in-person surveys as it reduces the travel necessary for interviewers. The math Consider a survey where a sample of \\(a\\) clusters are sampled from a population of \\(A\\) clusters via SRS. Units within each sampled cluster are sampled via SRS as well. Within each sampled cluster, \\(i\\), there are \\(B_i\\) units and \\(b_i\\) units are sampled via SRS. Let \\(\\bar{y}_{i}\\) be the sample mean of cluster \\(i\\). Then, a ratio estimator of the population mean is: \\[\\bar{y}=\\frac{\\sum_{i=1}^a B_i \\bar{y}_{i}}{ \\sum_{i=1}^a B_i}\\] Note this is a consistent but biased estimator. Often the population size is not known, so this is a method to estimate a mean without knowing the population size. The estimated standard error of the mean is: \\[se(\\bar{y})= \\frac{1}{\\hat{N}}\\sqrt{\\left(1-\\frac{a}{A}\\right)\\frac{s_a^2}{a} + \\frac{A}{a} \\sum_{i=1}^a \\left(1-\\frac{b_i}{B_i}\\right) \\frac{s_i^2}{b_i} }\\] where \\(\\hat{N}\\) is the estimated population size, \\(s_a^2\\) is the between-cluster variance and \\(s_i^2\\) is the within-cluster variance. The formula for the between-cluster variance (\\(s_a^2\\)) is: \\[s_a^2=\\frac{1}{a-1}\\sum_{i=1}^a \\left( \\hat{y}_i - \\frac{\\sum_{i=1}^a \\hat{y}_{i} }{a}\\right)^2\\] where \\(\\hat{y}_i =B_i\\bar{y_i}\\) . The formula for the within-cluster variance (\\(s_i^2\\)) is: \\[s_b^2=\\frac{1}{a(b_i-1)} \\sum_{j=1}^{b_i} \\left(y_{ij}-\\bar{y}_i\\right)^2\\] where \\(y_{ij}\\) is the outcome for sampled unit \\(j\\) within cluster \\(i\\). The syntax Clustered sampling designs require the addition of the ids argument which specifies what variables are the cluster levels. To specify a two-stage clustered design without replacement, use the following syntax: clus2_des <- dat %>% as_survey_design(weights = wtvar, ids = c(PSU, SSU), fpc = c(A, B)) where PSU and SSU are the variables indicating the PSU and SSU identifiers, and A and B are the variables indicating the population sizes for each level (i.e., A is the number of clusters, and B is the number of units within each cluster). Note that A will be the same for all records (within a strata), and B will be the same for all records within the same cluster. If clusters were sampled with replacement or from a very large population, a FPC is unnecessary. Additionally, only the first stage of selection is necessary regardless of whether the units were selected with replacement at any stage. The subsequent stages of selection are ignored in computation as their contribution to the variance is overpowered by the first stage (see Särndal, Swensson, and Wretman (2003) or Wolter (2007) for a more in-depth discussion). Therefore, the syntax below will yield the same estimates in the end: clus2wra_des <- dat %>% as_survey_design(weights = wtvar, ids = c(PSU, SSU)) clus2wrb_des <- dat %>% as_survey_design(weights = wtvar, ids = PSU) Note that there is one additional argument that is sometimes necessary which is nest = TRUE. This option relabels cluster IDs to enforce nesting within strata. Sometimes, as an example, there may be a cluster 1 and a cluster 2 within each stratum but these are actually different clusters. This option indicates that the repeated use of numbering does not mean it is the same cluster. If this option is not used and there are repeated cluster IDs across different strata, an error will be generated. Example The survey package includes a two-stage cluster sample data, apiclus2, in which school districts were sampled, and then a random sample of five schools was selected within each district. For districts with fewer than five schools, all schools were sampled. School districts are identified by dnum, and schools are identified by snum. The variable fpc1 indicates how many districts there are in California (A), and fpc2 indicates how many schools were in a given district with at least 100 students (B). The data has a row for each school. In the data printed below, there are 757 school districts, as indicated by fpc1, and there are nine schools in District 731, one school in District 742, two schools in District 768, and so on as indicated by fpc2. For illustration purposes, the object apiclus2_slim has been created from apiclus2, which subsets the data to only the necessary columns and sorts data. apiclus2_slim <- apiclus2 %>% as_tibble() %>% arrange(desc(dnum), snum) %>% select(cds, dnum, snum, fpc1, fpc2, pw) apiclus2_slim ## # A tibble: 126 × 6 ## cds dnum snum fpc1 fpc2 pw ## <chr> <int> <dbl> <dbl> <int[1d]> <dbl> ## 1 47704826050942 795 5552 757 1 18.9 ## 2 07618126005169 781 530 757 6 22.7 ## 3 07618126005177 781 531 757 6 22.7 ## 4 07618126005185 781 532 757 6 22.7 ## 5 07618126005193 781 533 757 6 22.7 ## 6 07618126005243 781 535 757 6 22.7 ## 7 19650786023337 768 2371 757 2 18.9 ## 8 19650786023345 768 2372 757 2 18.9 ## 9 54722076054423 742 5898 757 1 18.9 ## 10 50712906053086 731 5781 757 9 34.1 ## # ℹ 116 more rows To specify this design in R, the following syntax should be used: apiclus2_des <- apiclus2_slim %>% as_survey_design(ids = c(dnum, snum), fpc = c(fpc1, fpc2), weights = pw) apiclus2_des ## 2 - level Cluster Sampling design ## With (40, 126) clusters. ## Called via srvyr ## Sampling variables: ## - ids: `dnum + snum` ## - fpc: `fpc1 + fpc2` ## - weights: pw ## Data variables: ## - cds (chr), dnum (int), snum (dbl), fpc1 (dbl), fpc2 (int[1d]), pw ## (dbl) summary(apiclus2_des) ## 2 - level Cluster Sampling design ## With (40, 126) clusters. ## Called via srvyr ## Probabilities: ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.00367 0.03774 0.05284 0.04239 0.05284 0.05284 ## Population size (PSUs): 757 ## Data variables: ## [1] "cds" "dnum" "snum" "fpc1" "fpc2" "pw" The design objects are described as “2 - level Cluster Sampling design” and include the ids (cluster), FPC, and weight variables. The summary notes that the sample includes 40 first-level clusters (PSUs), which are school districts, and 126 second-level clusters (SSUs), which are schools. Additionally, the summary includes a numeric summary of the probabilities of selection and the population size (number of PSUs) as 757. 10.3 Combining sampling methods SRS, stratified, and clustered designs are the backbone of sampling designs, and the features are often combined in one design. Additionally, rather than using SRS for selection, other sampling mechanisms are commonly used, such as probability proportional to size (PPS), systematic sampling, or selection with unequal probabilities, which are briefly described here. In PPS sampling, a size measure is constructed for each unit (e.g., the population of the PSU or the number of occupied housing units) and then units with larger size measures are more likely to be sampled. Systematic sampling is commonly used to ensure representation across a population. Units are sorted by a feature and then every \\(k\\) units are selected from a random start point so the sample is spread across the population. In addition to PPS, other unequal probabilities of selection may be used. For example, in a study of establishments (e.g., businesses or public institutions) that conducts a survey every year, an establishment that recently participated (e.g., participated last year) may have a reduced chance of selection in a subsequent round to reduce the burden on the establishment. To learn more about sampling designs, refer to Valliant, Dever, and Kreuter (2013), Cox et al. (2011), Cochran (1977), and Deming (1991). A common method of sampling is to stratify PSUs, select PSUs within the stratum using PPS selection, and then select units within the PSUs either with SRS or PPS. Reading survey documentation is an important first step in survey analysis to understand the design of the survey we are using and variables necessary to specify the design. Good documentation will highlight the variables necessary to specify the design. This is often found in User’s Guides, methodology, analysis guides, or technical documentation (see Chapter 3 for more details). Example For example, the 2017-2019 National Survey of Family Growth (NSFG)32 had a stratified multi-stage area probability sample: 1. In the first stage, PSUs are counties or collections of counties and are stratified by Census region/division, size (population), and MSA status. Within each stratum, PSUs were selected via PPS. 2. In the second stage, neighborhoods were selected within the sampled PSUs using PPS selection. 3. In the third stage, housing units were selected within the sampled neighborhoods. 4. In the fourth stage, a person was randomly chosen within the selected housing units among eligible persons using unequal probabilities based on the person’s age and sex. The public use file does not include all these levels of selection and instead has pseudo-strata and pseudo-clusters, which are the variables used in R to specify the design. As specified on page 4 of the documentation, the stratum variable is SEST, the cluster variable is SECU, and the weight variable is WGT2017_2019. Thus, to specify this design in R, use the following syntax: nsfg_des <- nsfgdata %>% as_survey_design(ids = SECU, strata = SEST, weights = WGT2017_2019) 10.4 Replicate weights Replicate weights are often included on analysis files instead of, or in addition to, the design variables (strata and PSUs). Replicate weights are used as another method to estimate variability. Often researchers choose to use replicate weights to avoid publishing design variables (strata or clustering variables) as a measure to reduce the risk of disclosure. There are several types of replicate weights, including balanced repeated replication (BRR), Fay’s BRR, jackknife, and bootstrap methods. An overview of the process for using replicate weights is as follows: Divide the sample into subsample replicates that mirror the design of the sample Calculate weights for each replicate using the same procedures for the full-sample weight (i.e., nonresponse and post-stratification) Calculate estimates for each replicate using the same method as the full-sample estimate Calculate the estimated variance, which will be proportional to the variance of the replicate estimates The different types of replicate weights largely differ between step 1 (how the sample is divided into subsamples) and step 4 (which multiplication factors (scales) are used to multiply the variance). The general format for the standard error is: \\[ \\sqrt{\\alpha \\sum_{r=1}^R \\alpha_r (\\hat{\\theta}_r - \\hat{\\theta})^2 }\\] where \\(R\\) is the number of replicates, \\(\\alpha\\) is a constant that depends on the replication method, \\(\\alpha_r\\) is a factor associated with each replicate, \\(\\hat{\\theta}\\) is the weighted estimate based on the full sample, and \\(\\hat{\\theta}_r\\) is the weighted estimate of \\(\\theta\\) based on the \\(r^{\\text{th}}\\) replicate. To create the design object for surveys with replicate weights, we use as_survey_rep() instead of as_survey_design() that we use for the common sampling designs in the sections above. 10.4.1 Balanced Repeated Replication (BRR) method The BRR method requires a stratified sample design with two PSUs in each stratum. Each replicate is constructed by deleting one PSU per stratum using a Hadamard matrix. For the PSU that is included, the weight is generally multiplied by two but may have other adjustments, such as post-stratification. A Hadamard matrix is a special square matrix with entries of +1 or -1 with mutually orthogonal rows. Hadamard matrices must have one row, two rows, or a multiple of four rows. The size of the Hadamard matrix is determined by the first multiple of 4 greater than or equal to the number of strata. For example, if a survey had 7 strata, the Hadamard matrix would be an \\(8\\times8\\) matrix. Additionally, a survey with 8 strata would also have an \\(8\\times8\\) Hadamard matrix. The columns in the matrix specify the strata and the rows specify the replicate. In each replicate (row), a +1 means to use the first PSU and a -1 means to use the second PSU in the estimate. For example, here is a \\(4\\times4\\) Hadamard matrix: \\[ \\begin{array}{rrrr} +1 &+1 &+1 &+1\\\\ +1&-1&+1&-1\\\\ +1&+1&-1&-1\\\\ +1 &-1&-1&+1 \\end{array} \\] In the first replicate (row), all the values are +1, so in each stratum, the first PSU would be used in the estimate. In the second replicate, the first PSU would be used in stratum 1 and 3, while the second PSU would be used in stratum 2 and 4. In the third replicate, the first PSU would be used in stratum 1 and 2, while the second PSU would be used in strata 3 and 4. Finally, in the fourth replicate, the first PSU would be used in strata 1 and 4, while the second PSU would be used in strata 2 and 3. For more information about Hadamard matrices see Wolter (2007). Note that supplied BRR weights from a data provider will already incorporate this adjustment, and the {survey} package generates the Hadamard matrix, if necessary for calculating BRR weights so an analyst will not need to provide the matrix. The math A weighted estimate for the full sample is calculated as \\(\\hat{\\theta}\\), and then a weighted estimate for each replicate is calculated as \\(\\hat{\\theta}_r\\) for \\(R\\) replicates. Using the generic notation above, \\(\\alpha=\\frac{1}{R}\\) and \\(\\alpha_r=1\\) for each \\(r\\). The standard error of the estimate is calculated as follows: \\[se(\\hat{\\theta})=\\sqrt{\\frac{1}{R} \\sum_{r=1}^R \\left( \\hat{\\theta}_r-\\hat{\\theta}\\right)^2}\\] Specifying replicate weights in R requires specifying the type of replicate weights, the main weight variable, the replicate weight variables, and other options. One of the key options is for the mean squared error (MSE). If mse=TRUE, variances are computed around the point estimate \\((\\hat{\\theta})\\), whereas if mse=FALSE, variances are computed around the mean of the replicates \\((\\bar{\\theta})\\) instead which looks like this: \\[se(\\hat{\\theta})=\\sqrt{\\frac{1}{R} \\sum_{r=1}^R \\left( \\hat{\\theta}_r-\\bar{\\theta}\\right)^2}\\] where \\[\\bar{\\theta}=\\frac{1}{R}\\sum_{r=1}^R \\hat{\\theta}_r\\] The default option for mse is to use the global option of “survey.replicates.mse” which is set to FALSE initially unless a user changes it. To determine if mse should be set to TRUE or FALSE, read the survey documentation. If there is no indication in the survey documentation, for BRR, we recommend setting mse to TRUE as this is the default in other software (e.g., SAS, SUDAAN). The syntax Replicate weights generally come in groups and are sequentially numbered, such as PWGTP1, PWGTP2, …, PWGTP80 for the person weights in the American Community Survey (ACS) (U.S. Census Bureau 2021) or BRRWT1, BRRWT2, …, BRRWT96 in the 2015 Residential Energy Consumption Survey (RECS) (U.S. Energy Information Administration 2017). This makes it easy to use some of the tidy selection33 functions in R. To specify a BRR design, we need to specify the weight variable (weights), the replicate weight variables (repweights), the type of replicate weights is BRR (type = BRR), and whether the mean squared error should be used (mse = TRUE) or not (mse = FALSE). For example, if a dataset had WT0 for the main weight and had 20 BRR weights indicated WT1, WT2, …, WT20, we can use the following syntax (both are equivalent): brr_des <- dat %>% as_survey_rep(weights = WT0, repweights = all_of(str_c("WT", 1:20)), type = "BRR", mse = TRUE) brr_des <- dat %>% as_survey_rep(weights = WT0, repweights = num_range("WT", 1:20), type = "BRR", mse = TRUE) If a dataset had WT for the main weight and had 20 BRR weights indicated REPWT1, REPWT2, …, REPWT20, the following syntax could be used (both are equivalent): brr_des <- dat %>% as_survey_rep(weights = WT, repweights = all_of(str_c("REPWT", 1:20)), type = "BRR", mse = TRUE) brr_des <- dat %>% as_survey_rep(weights = WT, repweights = starts_with("REPWT"), type = "BRR", mse = TRUE) If the replicate weight variables are in the file consecutively, the following syntax can also be used: brr_des <- dat %>% as_survey_rep(weights = WT, repweights = REPWT1:REPWT20, type = "BRR", mse = TRUE) Typically, each replicate weight sums to a value similar to the main weight, as both the replicate weights and the main weight are supposed to provide population estimates. Rarely, an alternative method will be used where the replicate weights have values of 0 or 2 in the case of BRR weights. This would be indicated in the documentation (see Chapter 3 for more information on how to understand the provided documentation). In this case, the replicate weights are not combined, and the option combined_weights = FALSE should be indicated, as the default value for this argument is TRUE. This specific syntax is shown below: brr_des <- dat %>% as_survey_rep(weights = WT, repweights = starts_with("REPWT"), type = "BRR", combined_weights = FALSE, mse = TRUE) Example The {survey} package includes a data example from Section 12.2 of Levy and Lemeshow (2013). In this fictional data, two out of five ambulance stations were sampled from each of three emergency service areas (ESAs), thus BRR weights are appropriate with 2 PSUs (stations) sampled in each stratum (ESA). In the code below, BRR weights are created as was done by Levy and Lemeshow (2013). scdbrr <- scd %>% as_tibble() %>% mutate(wt = 5 / 2, rep1 = 2 * c(1, 0, 1, 0, 1, 0), rep2 = 2 * c(1, 0, 0, 1, 0, 1), rep3 = 2 * c(0, 1, 1, 0, 0, 1), rep4 = 2 * c(0, 1, 0, 1, 1, 0)) scdbrr ## # A tibble: 6 × 9 ## ESA ambulance arrests alive wt rep1 rep2 rep3 rep4 ## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 1 120 25 2.5 2 2 0 0 ## 2 1 2 78 24 2.5 0 0 2 2 ## 3 2 1 185 30 2.5 2 0 2 0 ## 4 2 2 228 49 2.5 0 2 0 2 ## 5 3 1 670 80 2.5 2 0 0 2 ## 6 3 2 530 70 2.5 0 2 2 0 To specify the BRR weights, the following syntax is used: scdbrr_des <- scdbrr %>% as_survey_rep(type = "BRR", repweights = starts_with("rep"), combined_weights = FALSE, weight = wt) scdbrr_des ## Call: Called via srvyr ## Balanced Repeated Replicates with 4 replicates. ## Sampling variables: ## - repweights: `rep1 + rep2 + rep3 + rep4` ## - weights: wt ## Data variables: ## - ESA (int), ambulance (int), arrests (dbl), alive (dbl), wt (dbl), ## rep1 (dbl), rep2 (dbl), rep3 (dbl), rep4 (dbl) summary(scdbrr_des) ## Call: Called via srvyr ## Balanced Repeated Replicates with 4 replicates. ## Sampling variables: ## - repweights: `rep1 + rep2 + rep3 + rep4` ## - weights: wt ## Data variables: ## - ESA (int), ambulance (int), arrests (dbl), alive (dbl), wt (dbl), ## rep1 (dbl), rep2 (dbl), rep3 (dbl), rep4 (dbl) ## Variables: ## [1] "ESA" "ambulance" "arrests" "alive" "wt" ## [6] "rep1" "rep2" "rep3" "rep4" Note that combined_weights was specified as FALSE because these weights are simply specified as 0 and 2 and do not incorporate the overall weight. When printing the object, the type of replication is noted as Balanced Repeated Replicates, and the replicate weights and the weight variable are specified. Additionally, the summary lists the variables included. 10.4.2 Fay’s BRR method Fay’s BRR method for replicate weights is similar to the BRR method in that it uses a Hadamard matrix to construct replicate weights. However, rather than deleting PSUs for each replicate, with Fay’s BRR half of the PSUs have a replicate weight which is the main weight multiplied by \\(\\rho\\), and the other half have the main weight multiplied by \\((2-\\rho)\\) where \\(0 \\le \\rho < 1\\). Note that when \\(\\rho=0\\), this is equivalent to the standard BRR weights, and as \\(\\rho\\) becomes closer to 1, this method is more similar to jackknife discussed in the next section. To obtain the value of \\(\\rho\\), it is necessary to read the survey documentation (see Chapter 3). The math The standard error estimate for \\(\\hat{\\theta}\\) is slightly different than the BRR, due to the addition of the multiplier of \\(\\rho\\). Using the generic notation above, \\(\\alpha=\\frac{1}{R \\left(1-\\rho\\right)^2}\\) and \\(\\alpha_r=1 \\text{ for all } r\\). The standard error is calculated as: \\[se(\\hat{\\theta})=\\sqrt{\\frac{1}{R (1-\\rho)^2} \\sum_{r=1}^R \\left( \\hat{\\theta}_r-\\hat{\\theta}\\right)^2}\\] The syntax The syntax is very similar for BRR and Fay’s BRR. To specify a Fay’s BRR design, we need to specify the weight variable (weights), the replicate weight variables (repweights), the type of replicate weights is Fay’s BRR (type = Fay), whether the mean squared error should be used (mse = TRUE) or not (mse = FALSE), and Fay’s multiplier (rho). For example, if a dataset had WT0 for the main weight and had 20 BRR weights indicated as WT1, WT2, …, WT20, and Fay’s multiplier is 0.3, use the following syntax: fay_des <- dat %>% as_survey_rep(weights = WT0, repweights = num_range("WT", 1:20), type = "Fay", mse = TRUE, rho = 0.3) Example The 2015 RECS (U.S. Energy Information Administration 2017) uses Fay’s BRR weights with the final weight as NWEIGHT and replicate weights as BRRWT1 - BRRWT96 and the documentation specifies a Fay’s multiplier of 0.5. On the file, DOEID is a unique identifier for each respondent, TOTALDOL is the total cost of energy, TOTSQFT_EN is the total square footage of the residence, and REGOINC is the Census region. We have already pulled in the 2015 RECS data from the {srvyrexploR} package that provides data for this book. To specify the design for the recs_2015 data, use the following syntax: recs_2015_des <- recs_2015 %>% as_survey_rep(weights = NWEIGHT, repweights = BRRWT1:BRRWT96, type = "Fay", rho = 0.5, mse = TRUE, variables = c(DOEID, TOTALDOL, TOTSQFT_EN, REGIONC)) recs_2015_des ## Call: Called via srvyr ## Fay's variance method (rho= 0.5 ) with 96 replicates and MSE variances. ## Sampling variables: ## - repweights: `BRRWT1 + BRRWT2 + BRRWT3 + BRRWT4 + BRRWT5 + BRRWT6 + ## BRRWT7 + BRRWT8 + BRRWT9 + BRRWT10 + BRRWT11 + BRRWT12 + BRRWT13 + ## BRRWT14 + BRRWT15 + BRRWT16 + BRRWT17 + BRRWT18 + BRRWT19 + BRRWT20 ## + BRRWT21 + BRRWT22 + BRRWT23 + BRRWT24 + BRRWT25 + BRRWT26 + ## BRRWT27 + BRRWT28 + BRRWT29 + BRRWT30 + BRRWT31 + BRRWT32 + BRRWT33 ## + BRRWT34 + BRRWT35 + BRRWT36 + BRRWT37 + BRRWT38 + BRRWT39 + ## BRRWT40 + BRRWT41 + BRRWT42 + BRRWT43 + BRRWT44 + BRRWT45 + BRRWT46 ## + BRRWT47 + BRRWT48 + BRRWT49 + BRRWT50 + BRRWT51 + BRRWT52 + ## BRRWT53 + BRRWT54 + BRRWT55 + BRRWT56 + BRRWT57 + BRRWT58 + BRRWT59 ## + BRRWT60 + BRRWT61 + BRRWT62 + BRRWT63 + BRRWT64 + BRRWT65 + ## BRRWT66 + BRRWT67 + BRRWT68 + BRRWT69 + BRRWT70 + BRRWT71 + BRRWT72 ## + BRRWT73 + BRRWT74 + BRRWT75 + BRRWT76 + BRRWT77 + BRRWT78 + ## BRRWT79 + BRRWT80 + BRRWT81 + BRRWT82 + BRRWT83 + BRRWT84 + BRRWT85 ## + BRRWT86 + BRRWT87 + BRRWT88 + BRRWT89 + BRRWT90 + BRRWT91 + ## BRRWT92 + BRRWT93 + BRRWT94 + BRRWT95 + BRRWT96` ## - weights: NWEIGHT ## Data variables: ## - DOEID (dbl), TOTALDOL (dbl), TOTSQFT_EN (dbl), REGIONC (dbl) summary(recs_2015_des) ## Call: Called via srvyr ## Fay's variance method (rho= 0.5 ) with 96 replicates and MSE variances. ## Sampling variables: ## - repweights: `BRRWT1 + BRRWT2 + BRRWT3 + BRRWT4 + BRRWT5 + BRRWT6 + ## BRRWT7 + BRRWT8 + BRRWT9 + BRRWT10 + BRRWT11 + BRRWT12 + BRRWT13 + ## BRRWT14 + BRRWT15 + BRRWT16 + BRRWT17 + BRRWT18 + BRRWT19 + BRRWT20 ## + BRRWT21 + BRRWT22 + BRRWT23 + BRRWT24 + BRRWT25 + BRRWT26 + ## BRRWT27 + BRRWT28 + BRRWT29 + BRRWT30 + BRRWT31 + BRRWT32 + BRRWT33 ## + BRRWT34 + BRRWT35 + BRRWT36 + BRRWT37 + BRRWT38 + BRRWT39 + ## BRRWT40 + BRRWT41 + BRRWT42 + BRRWT43 + BRRWT44 + BRRWT45 + BRRWT46 ## + BRRWT47 + BRRWT48 + BRRWT49 + BRRWT50 + BRRWT51 + BRRWT52 + ## BRRWT53 + BRRWT54 + BRRWT55 + BRRWT56 + BRRWT57 + BRRWT58 + BRRWT59 ## + BRRWT60 + BRRWT61 + BRRWT62 + BRRWT63 + BRRWT64 + BRRWT65 + ## BRRWT66 + BRRWT67 + BRRWT68 + BRRWT69 + BRRWT70 + BRRWT71 + BRRWT72 ## + BRRWT73 + BRRWT74 + BRRWT75 + BRRWT76 + BRRWT77 + BRRWT78 + ## BRRWT79 + BRRWT80 + BRRWT81 + BRRWT82 + BRRWT83 + BRRWT84 + BRRWT85 ## + BRRWT86 + BRRWT87 + BRRWT88 + BRRWT89 + BRRWT90 + BRRWT91 + ## BRRWT92 + BRRWT93 + BRRWT94 + BRRWT95 + BRRWT96` ## - weights: NWEIGHT ## Data variables: ## - DOEID (dbl), TOTALDOL (dbl), TOTSQFT_EN (dbl), REGIONC (dbl) ## Variables: ## [1] "DOEID" "TOTALDOL" "TOTSQFT_EN" "REGIONC" In specifying the design, the variables option was also used to include which variables might be used in analyses. This is optional but can make our object smaller and easier to work with. When printing the design object or looking at the summary, the replicate weight type is re-iterated as Fay's variance method (rho= 0.5) with 96 replicates and MSE variances, and the variables are included. No weight or probability summary is included in this output as we have seen in some other design objects. 10.4.3 Jackknife method There are three jackknife estimators implemented in {srvyr} - jackknife 1 (JK1), jackknife n (JKn), and jackknife 2 (JK2). The JK1 method can be used for unstratified designs, and replicates are created by removing one PSU at a time so the number of replicates is the same as the number of PSUs. If there is no clustering, then the PSU is the ultimate sampling unit (e.g., unit). The JKn method is used for stratified designs and requires two or more PSUs per stratum. In this case, each replicate is created by deleting one PSU from a single stratum, so the number of replicates is the number of total PSUs across all strata. The JK2 method is a special case of JKn when there are exactly 2 PSUs sampled per stratum. For variance estimation, scaling constants must also be specified. The math Using the generic notation above, \\(\\alpha=\\frac{R-1}{R}\\) and \\(\\alpha_r=1 \\text{ for all } r\\). For the JK1 method, the standard error estimate for \\(\\hat{\\theta}\\) is calculated as: \\[se(\\hat{\\theta})=\\sqrt{\\frac{R-1}{R} \\sum_{r=1}^R \\left( \\hat{\\theta}_r-\\hat{\\theta}\\right)^2}\\] The JKn method is a bit more complex, but the coefficients are generally provided with restricted and public-use files. For each replicate, one stratum has a PSU removed, and the weights are adjusted by \\(n_h/(n_h-1)\\) where \\(n_h\\) is the number of PSUs in stratum \\(h\\). The coefficients in other strata are set to 1. Denote the coefficient that results from this process for replicate \\(r\\) as \\(\\alpha_r\\), then the standard error estimate for \\(\\hat{\\theta}\\) is calculated as: \\[se(\\hat{\\theta})=\\sqrt{\\sum_{r=1}^R \\alpha_r \\left( \\hat{\\theta}_r-\\hat{\\theta}\\right)^2}\\] The syntax To specify the jackknife method, we use the survey documentation to understand the type of jackknife (1, n, or 2) and the multiplier. In the syntax we need to specify the weight variable (weights), the replicate weight variables (repweights), the type of replicate weights as jackknife 1 (type = \"JK1\"), n (type = \"JKN\"), or 2 (type = \"JK2\"), whether the mean squared error should be used (mse = TRUE) or not (mse = FALSE), and the multiplier (scale). For example, if the survey is a jackknife 1 method with a multiplier of \\(\\alpha_r=(R-1)/R=19/20=0.95\\), the dataset has WT0 for the main weight and 20 replicate weights indicated as WT1, WT2, …, WT20, use the following syntax: jk1_des <- dat %>% as_survey_rep(weights = WT0, repweights= num_range("WT", 1:20), type="JK1", mse=TRUE, scale=0.95) For a jackknife n method, we need to specify the multiplier for all replicates. In this case we use the rscales argument to specify each one. The documentation will provide details on what the multipliers (\\(\\alpha_r\\)) are, and they may be the same for all replicates. For example, consider a case where \\(\\alpha_r=0.1\\) for all replicates and the dataset had WT0 for the main weight and had 20 replicate weights indicated as WT1, WT2, …, WT20. We specify the type as type = \"JKN\", and the multiplier as rscales=rep(0.1,20): jkn_des <- dat %>% as_survey_rep(weights = WT0, repweights= num_range("WT", 1:20), type="JKN", mse=TRUE, rscales=rep(0.1, 20)) Example The 2020 RECS (U.S. Energy Information Administration 2023b) uses jackknife weights with the final weight as NWEIGHT and replicate weights as NWEIGHT1 - NWEIGHT60 with a scale of \\((R-1)/R=59/60\\). On the file, DOEID is a unique identifier for each respondent, TOTALDOL is the total cost of energy, TOTSQFT_EN is the total square footage of the residence, and REGOINC is the Census region. We have already read in the RECS data and created a dataset called recs_2020 above in the prerequisites. To specify this design, use the following syntax: recs_des <- recs_2020 %>% as_survey_rep( weights = NWEIGHT, repweights = NWEIGHT1:NWEIGHT60, type = "JK1", scale = 59/60, mse = TRUE, variables = c(DOEID, TOTALDOL, TOTSQFT_EN, REGIONC) ) recs_des ## Call: Called via srvyr ## Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances. ## Sampling variables: ## - repweights: `NWEIGHT1 + NWEIGHT2 + NWEIGHT3 + NWEIGHT4 + NWEIGHT5 + ## NWEIGHT6 + NWEIGHT7 + NWEIGHT8 + NWEIGHT9 + NWEIGHT10 + NWEIGHT11 + ## NWEIGHT12 + NWEIGHT13 + NWEIGHT14 + NWEIGHT15 + NWEIGHT16 + ## NWEIGHT17 + NWEIGHT18 + NWEIGHT19 + NWEIGHT20 + NWEIGHT21 + ## NWEIGHT22 + NWEIGHT23 + NWEIGHT24 + NWEIGHT25 + NWEIGHT26 + ## NWEIGHT27 + NWEIGHT28 + NWEIGHT29 + NWEIGHT30 + NWEIGHT31 + ## NWEIGHT32 + NWEIGHT33 + NWEIGHT34 + NWEIGHT35 + NWEIGHT36 + ## NWEIGHT37 + NWEIGHT38 + NWEIGHT39 + NWEIGHT40 + NWEIGHT41 + ## NWEIGHT42 + NWEIGHT43 + NWEIGHT44 + NWEIGHT45 + NWEIGHT46 + ## NWEIGHT47 + NWEIGHT48 + NWEIGHT49 + NWEIGHT50 + NWEIGHT51 + ## NWEIGHT52 + NWEIGHT53 + NWEIGHT54 + NWEIGHT55 + NWEIGHT56 + ## NWEIGHT57 + NWEIGHT58 + NWEIGHT59 + NWEIGHT60` ## - weights: NWEIGHT ## Data variables: ## - DOEID (dbl), TOTALDOL (dbl), TOTSQFT_EN (dbl), REGIONC (chr) summary(recs_des) ## Call: Called via srvyr ## Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances. ## Sampling variables: ## - repweights: `NWEIGHT1 + NWEIGHT2 + NWEIGHT3 + NWEIGHT4 + NWEIGHT5 + ## NWEIGHT6 + NWEIGHT7 + NWEIGHT8 + NWEIGHT9 + NWEIGHT10 + NWEIGHT11 + ## NWEIGHT12 + NWEIGHT13 + NWEIGHT14 + NWEIGHT15 + NWEIGHT16 + ## NWEIGHT17 + NWEIGHT18 + NWEIGHT19 + NWEIGHT20 + NWEIGHT21 + ## NWEIGHT22 + NWEIGHT23 + NWEIGHT24 + NWEIGHT25 + NWEIGHT26 + ## NWEIGHT27 + NWEIGHT28 + NWEIGHT29 + NWEIGHT30 + NWEIGHT31 + ## NWEIGHT32 + NWEIGHT33 + NWEIGHT34 + NWEIGHT35 + NWEIGHT36 + ## NWEIGHT37 + NWEIGHT38 + NWEIGHT39 + NWEIGHT40 + NWEIGHT41 + ## NWEIGHT42 + NWEIGHT43 + NWEIGHT44 + NWEIGHT45 + NWEIGHT46 + ## NWEIGHT47 + NWEIGHT48 + NWEIGHT49 + NWEIGHT50 + NWEIGHT51 + ## NWEIGHT52 + NWEIGHT53 + NWEIGHT54 + NWEIGHT55 + NWEIGHT56 + ## NWEIGHT57 + NWEIGHT58 + NWEIGHT59 + NWEIGHT60` ## - weights: NWEIGHT ## Data variables: ## - DOEID (dbl), TOTALDOL (dbl), TOTSQFT_EN (dbl), REGIONC (chr) ## Variables: ## [1] "DOEID" "TOTALDOL" "TOTSQFT_EN" "REGIONC" When printing the design object or looking at the summary, the replicate weight type is re-iterated as Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances, and the variables are included. No weight or probability summary is included. 10.4.4 Bootstrap method In bootstrap resampling, replicates are created by selecting random samples of the PSUs with replacement (SRSWR). If there are \\(M\\) PSUs in the sample, then each replicate will be created by selecting a random sample of \\(M\\) PSUs with replacement. Each replicate is created independently, and the weights for each replicate are adjusted to reflect the population, generally using the same method as how the analysis weight was adjusted. The math A weighted estimate for the full sample is calculated as \\(\\hat{\\theta}\\), and then a weighted estimate for each replicate is calculated as \\(\\hat{\\theta}_r\\) for \\(R\\) replicates. Then the standard error of the estimate is calculated as follows: \\[se(\\hat{\\theta})=\\sqrt{\\alpha \\sum_{r=1}^R \\left( \\hat{\\theta}_r-\\hat{\\theta}\\right)^2}\\] where \\(\\alpha\\) is the scaling constant. Note that the scaling constant (\\(\\alpha\\)) is provided in the survey documentation as there are many types of bootstrap methods which generate custom scaling constants. The syntax To specify a bootstrap method, we need to specify the weight variable (weights), the replicate weight variables (repweights), the type of replicate weights as bootstrap (type = \"bootstrap\"), whether the mean squared error should be used (mse = TRUE) or not (mse = FALSE), and the multiplier (scale). For example, if a dataset had WT0 for the main weight, 20 bootstrap weights indicated WT1, WT2, …, WT20, and a multiplier of \\(\\alpha=.02\\), use the following syntax: bs_des <- dat %>% as_survey_rep(weights = WT0, repweights= num_range("WT", 1:20), type="bootstrap", mse=TRUE, scale=.02) Example Returning to the api example, we are going to create a dataset with bootstrap weights to use as an example. In this example, we construct a one-cluster design with fifty replicate weights.34 apiclus1_slim <- apiclus1 %>% as_tibble() %>% arrange(dnum) %>% select(cds, dnum, fpc, pw) set.seed(662152) apibw <- bootweights(psu = apiclus1_slim$dnum, strata = rep(1, nrow(apiclus1_slim)), fpc = apiclus1_slim$fpc, replicates = 50) bwmata <- apibw$repweights$weights[apibw$repweights$index,] * apiclus1_slim$pw apiclus1_slim <- bwmata %>% as.data.frame() %>% set_names(str_c("pw", 1:50)) %>% cbind(apiclus1_slim) %>% as_tibble() %>% select(cds, dnum, fpc, pw, everything()) apiclus1_slim ## # A tibble: 183 × 54 ## cds dnum fpc pw pw1 pw2 pw3 pw4 pw5 pw6 pw7 ## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 43693776… 61 757 33.8 33.8 0 0 33.8 0 33.8 0 ## 2 43693776… 61 757 33.8 33.8 0 0 33.8 0 33.8 0 ## 3 43693776… 61 757 33.8 33.8 0 0 33.8 0 33.8 0 ## 4 43693776… 61 757 33.8 33.8 0 0 33.8 0 33.8 0 ## 5 43693776… 61 757 33.8 33.8 0 0 33.8 0 33.8 0 ## 6 43693776… 61 757 33.8 33.8 0 0 33.8 0 33.8 0 ## 7 43693776… 61 757 33.8 33.8 0 0 33.8 0 33.8 0 ## 8 43693776… 61 757 33.8 33.8 0 0 33.8 0 33.8 0 ## 9 43693776… 61 757 33.8 33.8 0 0 33.8 0 33.8 0 ## 10 43693776… 61 757 33.8 33.8 0 0 33.8 0 33.8 0 ## # ℹ 173 more rows ## # ℹ 43 more variables: pw8 <dbl>, pw9 <dbl>, pw10 <dbl>, pw11 <dbl>, ## # pw12 <dbl>, pw13 <dbl>, pw14 <dbl>, pw15 <dbl>, pw16 <dbl>, ## # pw17 <dbl>, pw18 <dbl>, pw19 <dbl>, pw20 <dbl>, pw21 <dbl>, ## # pw22 <dbl>, pw23 <dbl>, pw24 <dbl>, pw25 <dbl>, pw26 <dbl>, ## # pw27 <dbl>, pw28 <dbl>, pw29 <dbl>, pw30 <dbl>, pw31 <dbl>, ## # pw32 <dbl>, pw33 <dbl>, pw34 <dbl>, pw35 <dbl>, pw36 <dbl>, … The output of apiclus1_slim includes the same variables we have seen in other api examples (see Table 10.1), but now additionally includes bootstrap weights pw1, …, pw50. When creating the survey design object, we use the bootstrap weights as the replicate weights. Additionally, with replicate weights we need to include the scale (\\(\\alpha\\)). For this example we created, \\[\\alpha=\\frac{M}{(M-1)(R-1)}=\\frac{15}{(15-1)*(50-1)}=0.02186589\\] where \\(M\\) is the average number of PSUs per strata and \\(R\\) is the number of replicates. There is only 1 stratum and the number of clusters/PSUs is 15 so \\(M=15\\). api1_bs_des <- apiclus1_slim %>% as_survey_rep(weights = pw, repweights = pw1:pw50, type = "bootstrap", scale = 0.02186589, mse = TRUE) api1_bs_des ## Call: Called via srvyr ## Survey bootstrap with 50 replicates and MSE variances. ## Sampling variables: ## - repweights: `pw1 + pw2 + pw3 + pw4 + pw5 + pw6 + pw7 + pw8 + pw9 + ## pw10 + pw11 + pw12 + pw13 + pw14 + pw15 + pw16 + pw17 + pw18 + pw19 ## + pw20 + pw21 + pw22 + pw23 + pw24 + pw25 + pw26 + pw27 + pw28 + ## pw29 + pw30 + pw31 + pw32 + pw33 + pw34 + pw35 + pw36 + pw37 + pw38 ## + pw39 + pw40 + pw41 + pw42 + pw43 + pw44 + pw45 + pw46 + pw47 + ## pw48 + pw49 + pw50` ## - weights: pw ## Data variables: ## - cds (chr), dnum (int), fpc (dbl), pw (dbl), pw1 (dbl), pw2 (dbl), ## pw3 (dbl), pw4 (dbl), pw5 (dbl), pw6 (dbl), pw7 (dbl), pw8 (dbl), ## pw9 (dbl), pw10 (dbl), pw11 (dbl), pw12 (dbl), pw13 (dbl), pw14 ## (dbl), pw15 (dbl), pw16 (dbl), pw17 (dbl), pw18 (dbl), pw19 (dbl), ## pw20 (dbl), pw21 (dbl), pw22 (dbl), pw23 (dbl), pw24 (dbl), pw25 ## (dbl), pw26 (dbl), pw27 (dbl), pw28 (dbl), pw29 (dbl), pw30 (dbl), ## pw31 (dbl), pw32 (dbl), pw33 (dbl), pw34 (dbl), pw35 (dbl), pw36 ## (dbl), pw37 (dbl), pw38 (dbl), pw39 (dbl), pw40 (dbl), pw41 (dbl), ## pw42 (dbl), pw43 (dbl), pw44 (dbl), pw45 (dbl), pw46 (dbl), pw47 ## (dbl), pw48 (dbl), pw49 (dbl), pw50 (dbl) summary(api1_bs_des) ## Call: Called via srvyr ## Survey bootstrap with 50 replicates and MSE variances. ## Sampling variables: ## - repweights: `pw1 + pw2 + pw3 + pw4 + pw5 + pw6 + pw7 + pw8 + pw9 + ## pw10 + pw11 + pw12 + pw13 + pw14 + pw15 + pw16 + pw17 + pw18 + pw19 ## + pw20 + pw21 + pw22 + pw23 + pw24 + pw25 + pw26 + pw27 + pw28 + ## pw29 + pw30 + pw31 + pw32 + pw33 + pw34 + pw35 + pw36 + pw37 + pw38 ## + pw39 + pw40 + pw41 + pw42 + pw43 + pw44 + pw45 + pw46 + pw47 + ## pw48 + pw49 + pw50` ## - weights: pw ## Data variables: ## - cds (chr), dnum (int), fpc (dbl), pw (dbl), pw1 (dbl), pw2 (dbl), ## pw3 (dbl), pw4 (dbl), pw5 (dbl), pw6 (dbl), pw7 (dbl), pw8 (dbl), ## pw9 (dbl), pw10 (dbl), pw11 (dbl), pw12 (dbl), pw13 (dbl), pw14 ## (dbl), pw15 (dbl), pw16 (dbl), pw17 (dbl), pw18 (dbl), pw19 (dbl), ## pw20 (dbl), pw21 (dbl), pw22 (dbl), pw23 (dbl), pw24 (dbl), pw25 ## (dbl), pw26 (dbl), pw27 (dbl), pw28 (dbl), pw29 (dbl), pw30 (dbl), ## pw31 (dbl), pw32 (dbl), pw33 (dbl), pw34 (dbl), pw35 (dbl), pw36 ## (dbl), pw37 (dbl), pw38 (dbl), pw39 (dbl), pw40 (dbl), pw41 (dbl), ## pw42 (dbl), pw43 (dbl), pw44 (dbl), pw45 (dbl), pw46 (dbl), pw47 ## (dbl), pw48 (dbl), pw49 (dbl), pw50 (dbl) ## Variables: ## [1] "cds" "dnum" "fpc" "pw" "pw1" "pw2" "pw3" "pw4" "pw5" ## [10] "pw6" "pw7" "pw8" "pw9" "pw10" "pw11" "pw12" "pw13" "pw14" ## [19] "pw15" "pw16" "pw17" "pw18" "pw19" "pw20" "pw21" "pw22" "pw23" ## [28] "pw24" "pw25" "pw26" "pw27" "pw28" "pw29" "pw30" "pw31" "pw32" ## [37] "pw33" "pw34" "pw35" "pw36" "pw37" "pw38" "pw39" "pw40" "pw41" ## [46] "pw42" "pw43" "pw44" "pw45" "pw46" "pw47" "pw48" "pw49" "pw50" As with other replicate design objects, when printing the object or looking at the summary, the replicate weights are provided along with the data variables. 10.5 Exercises The National Health Interview Survey (NHIS) is an annual household survey conducted by the National Center for Health Statistics (NCHS). The NHIS includes a wide variety of health topics for adults including health status and conditions, functioning and disability, health care access and health service utilization, health-related behaviors, health promotion, mental health, barriers to care, and community engagement. Like many national in-person surveys, the sampling design is a stratified clustered design with details included in the Survey Description35. The Survey Description provides information on setting up syntax in SUDAAN, Stata, SPSS, SAS, and R ({survey} package implementation). How would you specify the design using {srvyr} using either as_survey_design or as_survey_rep()? nhis_adult_des <- nhis_adult_data %>% as_survey_design(ids=PPSU, strata=PSTRAT, nest=TRUE, weights=WTFA_A) The General Social Survey is a survey that has been administered since 1972 on social, behavioral, and attitudinal topics. The 2016-2020 GSS Panel codebook36 provides examples of setting up syntax in SAS and Stata but not R. How would you specify the design in R? gss_des <- gss_data %>% as_survey_design(ids = VPSU_2, strata = VSTRAT_2, weights = WTSSNR_2) References Cochran, William G. 1977. Sampling Techniques. John Wiley & Sons. Cox, Brenda G, David A Binder, B Nanjamma Chinnappa, Anders Christianson, Michael J Colledge, and Phillip S Kott. 2011. Business Survey Methods. John Wiley & Sons. Deming, W Edwards. 1991. Sample Design in Business Research. Vol. 23. John Wiley & Sons. Fuller, Wayne A. 2011. Sampling Statistics. John Wiley & Sons. Levy, Paul S, and Stanley Lemeshow. 2013. Sampling of Populations: Methods and Applications. John Wiley & Sons. Penn State. 2019. “STAT 506: Sampling Theory and Methods [Online Course].” https://online.stat.psu.edu/stat506/. Särndal, Carl-Erik, Bengt Swensson, and Jan Wretman. 2003. Model Assisted Survey Sampling. Springer Science & Business Media. U.S. Census Bureau. 2021. “Understanding and Using the American Community Survey Public Use Microdata Sample Files What Data Users Need to Know.” U.S. Government Printing Office; https://www.census.gov/content/dam/Census/library/publications/2021/acs/acs_pums_handbook_2021.pdf. U.S. Energy Information Administration. 2017. “Residential Energy Consumption Survey (RECS): Using the 2015 microdata file to compute estimates and standard errors (RSEs).” https://www.eia.gov/consumption/residential/data/2015/pdf/microdata_v3.pdf. ———. 2023b. “2020 Residential Energy Consumption Survey: Using the microdata file to compute estimates and relative standard errors (RSEs).” https://www.eia.gov/consumption/residential/data/2020/pdf/microdata-guide.pdf. Valliant, Richard, Jill A Dever, and Frauke Kreuter. 2013. Practical Tools for Designing and Weighting Survey Samples. Vol. 1. Springer. Wolter, Kirk M. 2007. Introduction to Variance Estimation. Vol. 53. Springer. 2017-2019 National Survey of Family Growth (NSFG): Sample Design Documentation - https://www.cdc.gov/nchs/data/nsfg/NSFG-2017-2019-Sample-Design-Documentation-508.pdf↩︎ dplyr documentation on tidy-select: https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html↩︎ We provide the code here for you to replicate this example, but are not focusing on the creation of the weights as that is outside the scope of this book. We recommend you reference Wolter (2007) for more information on creating bootstrap weights.↩︎ 2022 National Health Interview Survey (NHIS) Survey Description: https://www.cdc.gov/nchs/nhis/2022nhis.htm↩︎ 2016-2020 GSS Panel Codebook Release 1a: https://gss.norc.org/Documents/codebook/2016-2020%20GSS%20Panel%20Codebook%20-%20R1a.pdf↩︎ "],["c11-missing-data.html", "Chapter 11 Missing data 11.1 Introduction 11.2 Missing data mechanisms 11.3 Assessing missing data 11.4 Analysis with missing data", " Chapter 11 Missing data Prerequisites For this chapter, load the following packages: library(tidyverse) library(survey) library(srvyr) library(srvyrexploR) library(naniar) library(haven) library(gt) We will be using data from ANES and RECS. Here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter 3 for more information). targetpop <- 231592693 data(anes_2020) anes_adjwgt <- anes_2020 %>% mutate(Weight = Weight / sum(Weight) * targetpop) anes_des <- anes_adjwgt %>% as_survey_design( weights = Weight, strata = Stratum, ids = VarUnit, nest = TRUE ) For RECS, details are included in the RECS documentation and Chapter 10. data(recs_2020) recs_des <- recs_2020 %>% as_survey_rep( weights = NWEIGHT, repweights = NWEIGHT1:NWEIGHT60, type = "JK1", scale = 59/60, mse = TRUE ) 11.1 Introduction Missing data in surveys refers to situations where participants do not provide complete responses to survey questions. Respondents may not have seen a question by design. Or, they may not respond to a question for various other reasons, such as not wanting to answer a particular question, not understanding the question, or simply forgetting to answer. Missing data is important to consider and account for, as it can introduce bias and reduce the representativeness of the data. This chapter provides an overview of the types of missing data, how to assess missing data in surveys, and how to conduct analysis when missing data is present. Understanding this complex topic can help ensure accurate reporting of survey results and can provide insight into potential changes to the survey design for the future. 11.2 Missing data mechanisms There are two main categories that missing data typically fall into: missing by design or unintentional missing data. Missing by design is part of the survey plan and can be more easily incorporated into weights and analyses. Unintentional missing data on the other hand, can lead to bias in survey estimates if not correctly accounted for. Below we provide more information on the types of missing data. Missing by design/questionnaire skip logic: This type of missingness occurs when certain respondents are intentionally directed to skip specific questions based on their previous responses or characteristics. For example, in a survey about employment, if a respondent indicates that they are not employed, they may be directed to skip questions related to their job responsibilities. Additionally, some surveys randomize questions or modules so that not all participants respond to all questions. In these instances, respondents would have missing data for the modules not randomly assigned to them. Unintentional missing data: This type of missingness occurs when researchers do not intend for there to be missing data on a particular question, for example, if respondents did not finish the survey or refused to answer individual questions. There are three main types of unintentional missing data that each should be considered and handled differently (Mack, Su, and Westreich 2018; Schafer and Graham 2002): Missing completely at random (MCAR): The missing data is unrelated to both observed and unobserved data, and the probability of being missing is the same across all cases. For example, if a respondent missed a question because they had to leave the survey early due to an emergency. Missing at random (MAR): The missing data is related to observed data but not unobserved data, and the probability of being missing is the same within groups. For example, if older respondents choose not to answer specific questions but younger respondents do answer them and we know the respondent’s age. Missing not at random (MNAR): The missing data is related to unobserved data, and the probability of being missing varies for reasons we are not measuring. For example, if respondents with depression do not answer a question about depression severity. 11.3 Assessing missing data Before beginning analysis, we should explore the data to determine if there is missing data and what types of missing data are present. Conducting this descriptive analysis can help with analysis and reporting of survey data (see Section 12), and can inform the survey design in future studies. For example, large amounts of unexpected missing data may indicate the questions were unclear or difficult to recall. There are several ways to explore missing data which we walk through below. When assessing the missing data, we recommend using a data.frame object and not the survey object as most of the analysis is about patterns of records and weights are not necessary. 11.3.1 Summarize data A very rudimentary first exploration is to use the summary() function to summarize the data which will illuminate NA values in the data. Let’s look at a few analytic variables on the ANES 2020 data using summary(): anes_2020 %>% select(V202051:EarlyVote2020) %>% summary() ## V202051 V202066 V202072 VotedPres2020 ## Min. :-9.000 Min. :-9.0 Min. :-9.000 Yes :5952 ## 1st Qu.:-1.000 1st Qu.: 4.0 1st Qu.: 1.000 No : 77 ## Median :-1.000 Median : 4.0 Median : 1.000 NA's:1424 ## Mean :-0.726 Mean : 3.4 Mean : 0.623 ## 3rd Qu.:-1.000 3rd Qu.: 4.0 3rd Qu.: 1.000 ## Max. : 3.000 Max. : 4.0 Max. : 2.000 ## V202073 V202109x V202110x ## Min. :-9.000 Min. :-2.000 Min. :-9.00 ## 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.00 ## Median : 1.000 Median : 1.000 Median : 1.00 ## Mean : 0.942 Mean : 0.858 Mean : 0.99 ## 3rd Qu.: 2.000 3rd Qu.: 1.000 3rd Qu.: 2.00 ## Max. :12.000 Max. : 1.000 Max. : 5.00 ## VotedPres2020_selection EarlyVote2020 ## Biden:3509 Yes : 371 ## Trump:2567 No :5949 ## Other: 158 NA's:1133 ## NA's :1219 ## ## We see that there are NA values in several of the derived variables (those not beginning with “V”) and negative values in the original variables (those beginning with “V”). We can also use the count() function to get an understanding of the different types of missing data on the original variables. For example, let’s look at the count of data for V202072, which corresponds to our VotedPres2020 variable. anes_2020 %>% count(VotedPres2020,V202072) ## # A tibble: 5 × 3 ## VotedPres2020 V202072 n ## <fct> <dbl+lbl> <int> ## 1 Yes 1 [1. Yes, voted for President] 5952 ## 2 No 2 [2. No, didn't vote for President] 77 ## 3 <NA> -9 [-9. Refused] 2 ## 4 <NA> -6 [-6. No post-election interview] 4 ## 5 <NA> -1 [-1. Inapplicable] 1418 Here we can see that there are three types of missing data, and that the majority of them fall under the “Inapplicable” category. This is usually a term associated with data missing due to skip patterns and is considered to be missing data by design. Based on the documentation from ANES (DeBell 2010), we can see that this question was only asked to respondents who voted in the election. 11.3.2 Visualization of missing data It can be challenging to look at tables for every variable, and instead may be more efficient to view missing data in a graphical format to help narrow in on patterns or unique variables. The {naniar} package is very useful in exploring missing data visually. It provides quick graphics to explore the missingness patterns in the data. We can use the vis_miss() function available in both {visdat} and {naniar} packages to view the amount of missing data by variable. anes_2020_derived<-anes_2020 %>% select(!starts_with("V2"),-CaseID,-InterviewMode,-Weight,-Stratum,-VarUnit) anes_2020_derived %>% vis_miss(cluster= TRUE, show_perc = FALSE) + scale_fill_manual(values = book_colors[c(3,1)], labels = c("Present","Missing"), name = "") FIGURE 11.1: Visual depiction of missing data in the ANES 2020 data From this visualization, we can start to get a picture of what questions may be related to each other in terms of missing data. Even if we did not have the informative variable names, we could be able to deduce that VotedPres2020, VotedPres2020_selection, and EarlyVote2020 are likely related since their missing data patterns are similar. Additionally, we can also look at VotedPres2016_selection and see that there is a lot of missing data in that variable. Most likely this is due to a skip pattern, and we can look at further graphics to see how it might be related to other variables. The {naniar} package has multiple visualization functions that can help dive deeper such as the gg_miss_fct() function which looks at missing data for all variables by levels of another variable. anes_2020_derived %>% gg_miss_fct(VotedPres2016) + scale_fill_gradientn( guide = "colorbar", name = "% Miss", colors = book_colors[c(3, 2, 1)] ) + ylab("Variable") + xlab("Voted for President in 2016") ## Scale for fill is already present. ## Adding another scale for fill, which will replace the existing scale. FIGURE 11.2: Missingness in variables for each level of VotedPres2016 in the ANES 2020 data In this case, we can see that if they did not vote for president in 2016 or did not answer that question, then they were not asked about who they voted for in 2016 (the percentage of missing data if 100%). Additionally, we can see with this graphic, that there is more missing data across all questions if they did not provide an answer to VotedPres2016. There are other graphics that work well with numeric data. For example, in the RECS 2020 data we can plot two continuous variables and the missing data associated with it to see if there are any patterns to the missingness. To do this, we can use the bind_shadow() function from the {naniar} package. This creates a nabular (combination of “na” with “tabular”), which features the original columns followed by the same number of columns with a specific NA format. These NA columns are indicators of if the value in the original data is missing or not. The example printed below shows how most levels of HeatingBehavior are not missing !NA in the NA variable of HeatingBehavior_NA, but those missing in HeatingBehavior are also missing in HeatingBehavior_NA. recs_2020_shadow <- recs_2020 %>% bind_shadow() ncol(recs_2020) ## [1] 118 ncol(recs_2020_shadow) ## [1] 236 recs_2020_shadow %>% count(HeatingBehavior,HeatingBehavior_NA) ## # A tibble: 7 × 3 ## HeatingBehavior HeatingBehavior_NA n ## <fct> <fct> <int> ## 1 Set one temp and leave it !NA 7806 ## 2 Manually adjust at night/no one home !NA 4654 ## 3 Programmable or smart thermostat automatical… !NA 3310 ## 4 Turn on or off as needed !NA 1491 ## 5 No control !NA 438 ## 6 Other !NA 46 ## 7 <NA> NA 751 We can then use these new variables to plot the missing data along side the actual data. For example, let’s plot a histogram of the total electric bill grouped by those that are missing and not missing by heating behavior. recs_2020_shadow %>% filter(TOTALDOL < 5000) %>% ggplot(aes(x=TOTALDOL,fill=HeatingBehavior_NA)) + geom_histogram() + scale_fill_manual(values = book_colors[c(3, 1)], labels = c("Present", "Missing"), name = "Heating Behavior") + theme_minimal() + xlab("Total Energy Cost (Truncated at $5000)") + ylab("Number of Households") + labs(title = "Histogram of Energy Cost by Heating Behavior Missing Data") ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. FIGURE 11.3: Histogram of Energy Cost by Heating Behavior Missing Data This plot indicates that respondents who did not provide a response for the heating behavior question may have a different distribution of total energy cost compared to respondents who did provide a response. This view of the raw data and missingness could indicate some bias in the data. Researchers take these different bias aspects into account when calculating weights and we need to make sure that the weights are incorporated when analyzing the data. There are many other visualizations that can be helpful in reviewing the data, and we recommend reviewing the {naniar} documentation for more information (Tierney and Cook 2023). 11.4 Analysis with missing data Once we understand the types of missingness, we can begin the analysis of the data. Different missingness types may be handled in different ways. In most publicly available datasets, researchers will have already calculated weights and imputed missing values if deemed necessary. Those interested in learning more about how to calculate weights and impute data for different missing data mechanisms, we recommended Kim and Shao (2021) and Valliant and Dever (2018). Even with weights and imputation, missing data will still most likely exist in the data and need to be accounted for in analysis. This section provides an overview on how to recode missing data in R, and how to account for skip patterns in analysis. 11.4.1 Recoding missing data Even within a variable, there can be different reasons for missing data. In publicly released data negative values are often present to provide different meaning for values. For example, in the ANES 2020 data they have the following negative values to represent different types of missing data: * -9: Refused * -8: Don’t Know * -7: No post-election data, deleted due to incomplete interview * -6: No post-election interview * -5: Interview breakoff (sufficient partial IW) * -4: Technical error * -3: Restricted * -2: Other missing reason (question specific) * -1: Inapplicable When we created the derived variables for use in this book, we coded all negative values as NA and proceeded to analyze the data. For most cases this is an appropriate approach as long as you filter the data appropriately to account for skip patterns (see next section). However, the {nanair} package does have the option to code special missing values. For example, if we wanted to have two NA values, one that indicated the question was missing by design (e.g., due to skip patterns) and one for the other missing categories we can use the nabular format to incorporate these with the recode_shadow() function. anes_2020_shadow<-anes_2020 %>% select(starts_with("V2")) %>% mutate(across(everything(),~case_when(.x < -1 ~ NA, TRUE~.x))) %>% bind_shadow() %>% recode_shadow(V201103 = .where(V201103==-1~"skip")) anes_2020_shadow %>% count(V201103,V201103_NA) ## # A tibble: 5 × 3 ## V201103 V201103_NA n ## <dbl+lbl> <fct> <int> ## 1 -1 [-1. Inapplicable] NA_skip 1643 ## 2 1 [1. Hillary Clinton] !NA 2911 ## 3 2 [2. Donald Trump] !NA 2466 ## 4 5 [5. Other {SPECIFY}] !NA 390 ## 5 NA NA 43 However it is important to note that at the time of publication, there is no easy way to implement recode_shadow() to multiple variables at once (e.g., we cannot use the tidyverse feature of across()). The example code above only implements this for a single variable, so this would have to be done to all variables of interest manually or in a loop. 11.4.2 Accounting for skip patterns When questions are skipped by design in a survey, it is meaningful that the data is later missing. For example the RECS survey asks people how they control the heat in their home in the winter (HeatingBehavior). This is only among those who have heat in their home (SpaceHeatingUsed). If no there is no heating equipment used, the value of HeatingBehavior is missing. One has several choices when analyzing this data which include 1) only including those with a valid value of HeatingBehavior and specifying the universe as those with heat or 2) including those who do not have heat. It is important to specify what population an analysis generalizes to. Here is example code where we only include those with a valid value of HeatingBehavior (choice 1). Note that we use the design object (recs_des) then filter to those that are not missing on HeatingBehavior. heat_cntl_1 <- recs_des %>% filter(!is.na(HeatingBehavior)) %>% group_by(HeatingBehavior) %>% summarize( p=survey_prop() ) heat_cntl_1 ## # A tibble: 6 × 3 ## HeatingBehavior p p_se ## <fct> <dbl> <dbl> ## 1 Set one temp and leave it 0.430 4.69e-3 ## 2 Manually adjust at night/no one home 0.264 4.54e-3 ## 3 Programmable or smart thermostat automatically adjust… 0.168 3.12e-3 ## 4 Turn on or off as needed 0.102 2.89e-3 ## 5 No control 0.0333 1.70e-3 ## 6 Other 0.00208 3.59e-4 Here is example code where we include those that do not have heat (choice 2). To help understand what we are looking at we have included the output to show both variables SpaceHeatingUsed and HeatingBehavior. heat_cntl_2 <- recs_des %>% group_by(interact(SpaceHeatingUsed, HeatingBehavior)) %>% summarize( p=survey_prop() ) heat_cntl_2 ## # A tibble: 7 × 4 ## SpaceHeatingUsed HeatingBehavior p p_se ## <lgl> <fct> <dbl> <dbl> ## 1 FALSE <NA> 0.0469 2.07e-3 ## 2 TRUE Set one temp and leave it 0.410 4.60e-3 ## 3 TRUE Manually adjust at night/no one home 0.251 4.36e-3 ## 4 TRUE Programmable or smart thermostat aut… 0.160 2.95e-3 ## 5 TRUE Turn on or off as needed 0.0976 2.79e-3 ## 6 TRUE No control 0.0317 1.62e-3 ## 7 TRUE Other 0.00198 3.41e-4 If we ran the first analysis, we would say that 16.8% of households with heat use a programmable or smart thermostat for the heating of their home. While if we used the results from the second analysis, we could say that 16% of households use a programmable or smart thermostat for the heating of their home. The distinction of the two statements is bolded for emphasis. Skip patterns often change the universe that we are talking about and need to be carefully examined. Filtering to the correct universe is important when handling these types of missing data. The nabular we created above can also help with this. If we have NA_skip values in the shadow, we can make sure that we filter out all of these values and only include relevant missing. To do this with survey data we could first create the nabular, then create the design object on that data, and then use the shadow variables to assist with filtering the data. Let’s use the nabular we created above for ANES 2020 (anes_2020_shadow) to create the design object. anes_adjwgt_shadow <- anes_2020_shadow %>% mutate(V200010b = V200010b/sum(V200010b)*targetpop) anes_des_shadow <- anes_adjwgt_shadow %>% as_survey_design( weights = V200010b, strata = V200010d, ids = V200010c, nest = TRUE ) Then we can use this design object to look at the percent of the population that voted for each candidate in 2016 (V201103). First, let’s look at the percentages without removing any cases: pres16_select1<-anes_des_shadow %>% group_by(V201103) %>% summarize( All_Missing=survey_prop() ) pres16_select1 ## # A tibble: 5 × 3 ## V201103 All_Missing All_Missing_se ## <dbl+lbl> <dbl> <dbl> ## 1 -1 [-1. Inapplicable] 0.324 0.00933 ## 2 1 [1. Hillary Clinton] 0.330 0.00728 ## 3 2 [2. Donald Trump] 0.299 0.00728 ## 4 5 [5. Other {SPECIFY}] 0.0409 0.00230 ## 5 NA 0.00627 0.00121 Next, we will look at the percentages removing only those that were missing due to skip patterns (i.e., they did not receive this question). pres16_select2<-anes_des_shadow %>% filter(V201103_NA!="NA_skip") %>% group_by(V201103) %>% summarize( No_Skip_Missing=survey_prop() ) pres16_select2 ## # A tibble: 4 × 3 ## V201103 No_Skip_Missing No_Skip_Missing_se ## <dbl+lbl> <dbl> <dbl> ## 1 1 [1. Hillary Clinton] 0.488 0.00870 ## 2 2 [2. Donald Trump] 0.443 0.00856 ## 3 5 [5. Other {SPECIFY}] 0.0606 0.00330 ## 4 NA 0.00928 0.00178 Finally, we will look at the percentages removing all missing values both due to skip patterns and due to those who refused to answer the question. pres16_select3<-anes_des_shadow %>% filter(V201103_NA=="!NA") %>% group_by(V201103) %>% summarize( No_Missing=survey_prop() ) pres16_select3 ## # A tibble: 3 × 3 ## V201103 No_Missing No_Missing_se ## <dbl+lbl> <dbl> <dbl> ## 1 1 [1. Hillary Clinton] 0.492 0.00875 ## 2 2 [2. Donald Trump] 0.447 0.00861 ## 3 5 [5. Other {SPECIFY}] 0.0611 0.00332 #edxahdlkim table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #edxahdlkim thead, #edxahdlkim tbody, #edxahdlkim tfoot, #edxahdlkim tr, #edxahdlkim td, #edxahdlkim th { border-style: none; } #edxahdlkim p { margin: 0; padding: 0; } #edxahdlkim .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #edxahdlkim .gt_caption { padding-top: 4px; padding-bottom: 4px; } #edxahdlkim .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #edxahdlkim .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #edxahdlkim .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #edxahdlkim .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #edxahdlkim .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #edxahdlkim .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #edxahdlkim .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #edxahdlkim .gt_column_spanner_outer:first-child { padding-left: 0; } #edxahdlkim .gt_column_spanner_outer:last-child { padding-right: 0; } #edxahdlkim .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #edxahdlkim .gt_spanner_row { border-bottom-style: hidden; } #edxahdlkim .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #edxahdlkim .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #edxahdlkim .gt_from_md > :first-child { margin-top: 0; } #edxahdlkim .gt_from_md > :last-child { margin-bottom: 0; } #edxahdlkim .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #edxahdlkim .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #edxahdlkim .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #edxahdlkim .gt_row_group_first td { border-top-width: 2px; } #edxahdlkim .gt_row_group_first th { border-top-width: 2px; } #edxahdlkim .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #edxahdlkim .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #edxahdlkim .gt_first_summary_row.thick { border-top-width: 2px; } #edxahdlkim .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #edxahdlkim .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #edxahdlkim .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #edxahdlkim .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #edxahdlkim .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #edxahdlkim .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #edxahdlkim .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #edxahdlkim .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #edxahdlkim .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #edxahdlkim .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #edxahdlkim .gt_left { text-align: left; } #edxahdlkim .gt_center { text-align: center; } #edxahdlkim .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #edxahdlkim .gt_font_normal { font-weight: normal; } #edxahdlkim .gt_font_bold { font-weight: bold; } #edxahdlkim .gt_font_italic { font-style: italic; } #edxahdlkim .gt_super { font-size: 65%; } #edxahdlkim .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #edxahdlkim .gt_asterisk { font-size: 100%; vertical-align: 0; } #edxahdlkim .gt_indent_1 { text-indent: 5px; } #edxahdlkim .gt_indent_2 { text-indent: 10px; } #edxahdlkim .gt_indent_3 { text-indent: 15px; } #edxahdlkim .gt_indent_4 { text-indent: 20px; } #edxahdlkim .gt_indent_5 { text-indent: 25px; } TABLE 11.1: Percentage of Votes by Candidate for Different Missing Data Inclusions Candidate Including All Missing Data Removing Skip Patterns Only Removing All Missing Data % s.e. (%) % s.e. (%) % s.e. (%) Did not Vote for President in 2016 32.4% 0.9% NA NA NA NA Hillary Clinton 33.0% 0.7% 48.8% 0.9% 49.2% 0.9% Donald Trump 29.9% 0.7% 44.3% 0.9% 44.7% 0.9% Other Candidate 4.1% 0.2% 6.1% 0.3% 6.1% 0.3% Missing 0.6% 0.1% 0.9% 0.2% NA NA As Table 11.1 shows, the results can vary greatly depending on which type of missing data that are removed. If we remove only the skip patterns the margin between the Clinton and Trump is 4.5 percentage points, but if we include all data even including those that did not vote in 2016, the margin is 3.1 percentage points. How we handle the different types of missing values is important for interpretation of the data. References DeBell, Matthew. 2010. “How to Analyze ANES Survey Data.” ANES Technical Report Series nes012492. Palo Alto, CA: Stanford University; Ann Arbor, MI: the University of Michigan; https://electionstudies.org/wp-content/uploads/2018/05/HowToAnalyzeANESData.pdf. Kim, Jae Kwang, and Jun Shao. 2021. Statistical Methods for Handling Incomplete Data. Chapman & Hall/CRC Press. Mack, Christina, Zhaohui Su, and Daniel Westreich. 2018. “Types of Missing Data.” In Managing Missing Data in Patient Registries: Addendum to Registries for Evaluating Patient Outcomes: A User’s Guide, Third Edition [Internet]. Rockville (MD): Agency for Healthcare Research; Quality (US); https://www.ncbi.nlm.nih.gov/books/NBK493614/. Schafer, Joseph L, and John W Graham. 2002. “Missing Data: Our View of the State of the Art.” Psychological Methods 7: 147–77. https://doi.org/10.1037//1082-989X.7.2.147. Tierney, Nicholas, and Dianne Cook. 2023. “Expanding Tidy Data Principles to Facilitate Missing Data Exploration, Visualization and Assessment of Imputations.” Journal of Statistical Software 105 (7): 1–31. https://doi.org/10.18637/jss.v105.i07. Valliant, Richard, and Jill A. Dever. 2018. Survey Weights: A Step-by-Step Guide to Calculation. Stata Press. "],["c12-pitfalls.html", "Chapter 12 Common pitfalls", " Chapter 12 Common pitfalls "],["c13-ncvs-vignette.html", "Chapter 13 National Crime Victimization Survey Vignette 13.1 Introduction 13.2 Data Structure 13.3 Survey Notation 13.4 Data File Preparation 13.5 Survey Design Objects 13.6 Calculating Estimates 13.7 Statistical testing 13.8 Exercises", " Chapter 13 National Crime Victimization Survey Vignette Prerequisites For this chapter, load the following packages: library(tidyverse) library(survey) library(srvyr) library(srvyrexploR) library(gt) We will use data from the United States National Crime Victimization Survey (NCVS). Here is the code to read in the three datasets from the {srvyrexploR} package: data(ncvs_2021_incident) data(ncvs_2021_household) data(ncvs_2021_person) 13.1 Introduction The NCVS is a household survey sponsored by the Bureau of Justice Statistics (BJS), which collects data on criminal victimization, including characteristics of the crimes, offenders, and victims. Crime types include both household and personal crimes, as well as violent and non-violent crimes. The target population of this survey is all people in the United States age 12 and older living in housing units and noninstitutional group quarters. The NCVS has been ongoing since 1992. An earlier survey, the National Crime Survey, was run from 1972 to 1991 (Bureau of Justice Statistics 2017). The survey is administered using a rotating panel. When an address enters the sample, the residents of that address are interviewed every six months for a total of seven interviews. If the initial residents move away from the address during the period, the new residents are included in the survey, as people are not followed when they move. NCVS data is publicly available and distributed by Inter-university Consortium for Political and Social Research (ICPSR)37, with data going back to 1992. The vignette in this book will include data from 2021 (United States. Bureau of Justice Statistics 2022). The NCVS data structure is complicated, and the User’s Guide contains examples for analysis in SAS, SUDAAN, SPSS, and Stata, but not R (Shook-Sa, Bonnie, Couzens, G. Lance, and Berzofsky, Marcus 2015). This vignette will adapt those examples for R. 13.2 Data Structure The data from ICPSR is distributed with five files, each having its unique identifier indicated: Address Record - YEARQ, IDHH Household Record - YEARQ, IDHH Person Record - YEARQ, IDHH, IDPER Incident Record - YEARQ, IDHH, IDPER 2021 Collection Year Incident - YEARQ, IDHH, IDPER We will focus on the household, person, and incident files. From these files, we selected a subset of columns for examples to use in this vignette. We have included data in the {srvyexploR} package with a subset of columns, but you can download the complete files at ICPSR38. 13.3 Survey Notation The NCVS User Guide (Shook-Sa, Bonnie, Couzens, G. Lance, and Berzofsky, Marcus 2015) uses the following notation: \\(i\\) represents NCVS households, identified on the household-level file with the household identification number IDHH. \\(j\\) represents NCVS individual respondents within households \\(i\\), identified on the person-level file with the person identification number IDPER. \\(k\\) represents reporting periods (i.e., YEARQ) for households \\(i\\) and individual respondent \\(j\\). \\(l\\) represents victimization records for respondent \\(j\\) in household \\(i\\) and reporting period \\(k\\). Each record on the NCVS incident-level file is associated with a victimization record \\(l\\). \\(D\\) represents one or more domain characteristics of interest in the calculation of NCVS estimates. For victimization totals and proportions, domains can be defined on the basis of crime types (e.g., violent crimes, property crimes), characteristics of victims (e.g., age, sex, household income), or characteristics of the victimizations (e.g., victimizations reported to police, victimizations committed with a weapon present). Domains could also be a combination of all of these types of characteristics. For example, in the calculation of victimization rates, domains are defined on the basis of the characteristics of the victims. \\(A_a\\) represents the level \\(a\\) of covariate \\(A\\). Covariate \\(A\\) is defined in the calculation of victimization proportions and represents the characteristic for which the analyst wants to obtain the distribution of victimizations in domain \\(D\\). \\(C\\) represents the personal or property crime for which we want to obtain a victimization rate. In this vignette, we will discuss four estimates: Victimization totals estimate the number of criminal victimizations with a given characteristic. As demonstrated below, these can be calculated from any of the data files. The estimated victimization total, \\(\\hat{t}_D\\) for domain \\(D\\) is estimated as \\[ \\hat{t}_D = \\sum_{ijkl \\in D} v_{ijkl}\\] where \\(v_{ijkl}\\) is the series-adjusted victimization weight for household \\(i\\), respondent \\(j\\), reporting period \\(k\\), and victimization \\(l\\), that is WGTVICCY. Victimization proportions estimate characteristics among victimizations or victims. Victimization proportions are calculated using the incident data file. The estimated victimization proportion for domain \\(D\\) across level \\(a\\) of covariate \\(A\\), \\(\\hat{p}_{A_a,D}\\) is \\[ \\hat{p}_{A_a,D} =\\frac{\\sum_{ijkl \\in A_a, D} v_{ijkl}}{\\sum_{ijkl \\in D} v_{ijkl}}.\\] The numerator is the number of incidents with a particular characteristic in a domain, and the denominator is the number of incidents in a domain. Victimization rates are estimates of the number of victimizations per 1,000 persons or households in the population39. Victimization rates are calculated using the household or person-level data files. The estimated victimization rate for crime \\(C\\) in domain \\(D\\) is \\[\\hat{VR}_{C,D}= \\frac{\\sum_{ijkl \\in C,D} v_{ijkl}}{\\sum_{ijk \\in D} w_{ijk}}\\times 1000\\] where \\(w_{ijk}\\) is the person weight (WGTPERCY) or household weight (WGTHHCY) for personal and household crimes, respectively. The numerator is the number of incidents in a domain, and the denominator is the number of persons or households in a domain. Notice that the weights in the numerator and denominator are different - this is important, and in the syntax and examples below, we will discuss how to make an estimate that involves two weights. Prevalence rates are estimates of the percentage of the population (persons or households) who are victims of a crime. These are estimated using the household or person-level data files. The estimated prevalence rate for crime \\(C\\) in domain \\(D\\) is \\[ \\hat{PR}_{C, D}= \\frac{\\sum_{ijk \\in {C,D}} I_{ij}w_{ijk}}{\\sum_{ijk \\in D} w_{ijk}} \\times 100\\] where \\(I_{ij}\\) is an indicator that a person or household in domain \\(D\\) was a victim of crime \\(C\\) at any time in the year. The numerator is the number of victims in domain \\(D\\) for crime \\(C\\), and the denominator is the number of people or households in the population. 13.4 Data File Preparation Some work is necessary to prepare the files before analysis. The design variables indicating pseudostratum (V2117) and half-sample code (V2118) are only included on the household file, so they must be added to the person and incident files for any analysis. For victimization rates, we need to know the victimization status for both victims and non-victims. Therefore, the incident file must be summarized and merged onto the household or person files for household-level and person-level crimes, respectively. We begin this vignette by discussing how to create these incident summary files. This is following Section 2.2 of the NCVS User’s Guide (Shook-Sa, Bonnie, Couzens, G. Lance, and Berzofsky, Marcus 2015). 13.4.1 Preparing Files for Estimation of Victimization Rates Each record on the incident file represents one victimization, which is not the same as one incident. Some victimizations have several instances that make it difficult for the victim to differentiate the details of these incidents, labeled as “series crimes”. Appendix A of the User’s Guide indicates how to calculate the series weight in other statistical languages. Here, we adapt that code for R. Essentially, if a victimization is a series crime, its series weight is top-coded at 10 based on the number of actual victimizations, that is that even if the crime repeatedly occurred more than 10 times, it is counted as 10 times to reduce the influence of extreme outliers. If an incident is a series crime, but the number of occurrences is unknown, the series weight is set to 6. A description of the variables used to create indicators of series and the associated weights is included in Table 13.1. TABLE 13.1: Codebook for incident variables - related to series weight Description Value Label V4016 How many times incident occur last 6 mos 1-996 Number of times 997 Don’t know V4017 How many incidents 1 1-5 incidents (not a “series”) 2 6 or more incidents 8 Residue (invalid data) V4018 Incidents similar in detail 1 Similar 2 Different (not in a “series”) 8 Residue (invalid data) V4019 Enough detail to distinguish incidents 1 Yes (not a “series”) 2 No (is a “series”) 8 Residue (invalid data) WGTVICCY Adjusted victimization weight Numeric We want to create four variables to indicate if an incident is a series crime. First, we create a variable called series using V4017, V4018, and V4019 where an incident is considered a series crime if there are 6 or more incidents (V4107), the incidents are similar in detail (V4018), or there is not enough detail to distinguish the incidents (V4019). Next, we top-code the number of incidents (V4016) by creating a variable n10v4016 which is set to 10 if V4016 > 10. Finally, we create the series weight using our new top-coded variable and the existing weight. inc_series <- ncvs_2021_incident %>% mutate( series = case_when(V4017 %in% c(1, 8) ~ 1, V4018 %in% c(2, 8) ~ 1, V4019 %in% c(1, 8) ~ 1, TRUE ~ 2 ), n10v4016 = case_when(V4016 %in% c(997, 998) ~ NA_real_, V4016 > 10 ~ 10, TRUE ~ V4016), serieswgt = case_when(series == 2 & is.na(n10v4016) ~ 6, series == 2 ~ n10v4016, TRUE ~ 1), NEWWGT = WGTVICCY * serieswgt ) The next step in preparing the files for estimation is to create indicators on the victimization file for characteristics of interest. Almost all BJS publications limit the analysis to records where the victimization occurred in the United States, where V4022 is not equal to 1, and we will do this for all estimates as well. A brief codebook of variables for this task is located in Table 13.2 TABLE 13.2: Codebook for incident variables - crime type indicators and characteristics Variable Description Value Label V4022 In what city/town/village 1 Outside U.S. 2 Not inside a city/town/village 3 Same city/town/village as present residence 4 Different city/town/village as present residence 5 Don’t know 6 Don’t know if 2, 4, or 5 V4049 Did offender have weapon 1 Yes 2 No 3 Don’t know V4050 What was weapon 1 At least one good entry 3 Indicates “Yes-Type Weapon-NA” 7 Indicates “Gun Type Unknown” 8 No good entry V4051 Hand gun 0 No 1 Yes V4052 Other gun 0 No 1 Yes V4053 Knife 0 No 1 Yes V4399 Reported to police 1 Yes 2 No 3 Don’t know V4529 Type of crime code 01 Completed rape 02 Attempted rape 03 Sexual attack with serious assault 04 Sexual attack with minor assault 05 Completed robbery with injury from serious assault 06 Completed robbery with injury from minor assault 07 Completed robbery without injury from minor assault 08 Attempted robbery with injury from serious assault 09 Attempted robbery with injury from minor assault 10 Attempted robbery without injury 11 Completed aggravated assault with injury 12 Attempted aggravated assault with weapon 13 Threatened assault with weapon 14 Simple assault completed with injury 15 Sexual assault without injury 16 Unwanted sexual contact without force 17 Assault without weapon without injury 18 Verbal threat of rape 19 Verbal threat of sexual assault 20 Verbal threat of assault 21 Completed purse snatching 22 Attempted purse snatching 23 Pocket picking (completed only) 31 Completed burglary, forcible entry 32 Completed burglary, unlawful entry without force 33 Attempted forcible entry 40 Completed motor vehicle theft 41 Attempted motor vehicle theft 54 Completed theft less than $10 55 Completed theft $10 to $49 56 Completed theft $50 to $249 57 Completed theft $250 or greater 58 Completed theft value NA 59 Attempted theft Using these variables, we will create the following indicators: Property crime V4529 >= 31 Variable: Property Violent crime V4529 <= 20 Variable: Violent Property crime reported to the police V4529 >= 31 and V4399=1 Variable: Property_ReportPolice Violent crime reported to the police V4529 < 31 and V4399=1 Variable: Violent_ReportPolice Aggravated assault without a weapon V4529 in 11:12 and V4049=2 Variable: AAST_NoWeap Aggravated assault with a firearm V4529 in 11:12 and V4049=1 and (V4051=1 or V4052=1 or V4050=7) Variable: AAST_Firearm Aggravated assault with a knife or sharp object V4529 in 11:12 and V4049=1 and (V4053=1 or V4054=1) Variable: AAST_Knife Aggravated assault with another type of weapon V4529 in 11:12 and V4049=1 and V4050=1 and not firearm or knife Variable: AAST_Other inc_ind <- inc_series %>% filter(V4022 != 1) %>% mutate( WeapCat = case_when( is.na(V4049) ~ NA_character_, V4049 == 2 ~ "NoWeap", V4049 == 3 ~ "UnkWeapUse", V4050 == 3 ~ "Other", V4051 == 1 | V4052 == 1 | V4050 == 7 ~ "Firearm", V4053 == 1 | V4054 == 1 ~ "Knife", TRUE ~ "Other" ), V4529_num = parse_number(as.character(V4529)), ReportPolice = V4399 == 1, Property = V4529_num >= 31, Violent = V4529_num <= 20, Property_ReportPolice = Property & ReportPolice, Violent_ReportPolice = Violent & ReportPolice, AAST = V4529_num %in% 11:13, AAST_NoWeap = AAST & WeapCat == "NoWeap", AAST_Firearm = AAST & WeapCat == "Firearm", AAST_Knife = AAST & WeapCat == "Knife", AAST_Other = AAST & WeapCat == "Other" ) This is a good point to pause to look at the output of crosswalks between an original variable and a derived one to check that the logic was programmed correctly and that everything ends up in the expected category. inc_series %>% count(V4022) ## # A tibble: 6 × 2 ## V4022 n ## <fct> <int> ## 1 1 34 ## 2 2 65 ## 3 3 7697 ## 4 4 1143 ## 5 5 39 ## 6 8 4 inc_ind %>% count(V4022) ## # A tibble: 5 × 2 ## V4022 n ## <fct> <int> ## 1 2 65 ## 2 3 7697 ## 3 4 1143 ## 4 5 39 ## 5 8 4 inc_ind %>% count(WeapCat, V4049, V4050, V4051, V4052, V4052, V4053, V4054) ## # A tibble: 13 × 8 ## WeapCat V4049 V4050 V4051 V4052 V4053 V4054 n ## <chr> <fct> <fct> <fct> <fct> <fct> <fct> <int> ## 1 Firearm 1 1 0 1 0 0 15 ## 2 Firearm 1 1 0 1 1 1 1 ## 3 Firearm 1 1 1 0 0 0 125 ## 4 Firearm 1 1 1 0 1 0 2 ## 5 Firearm 1 1 1 1 0 0 3 ## 6 Firearm 1 7 0 0 0 0 3 ## 7 Knife 1 1 0 0 0 1 14 ## 8 Knife 1 1 0 0 1 0 71 ## 9 NoWeap 2 <NA> <NA> <NA> <NA> <NA> 1794 ## 10 Other 1 1 0 0 0 0 147 ## 11 Other 1 3 0 0 0 0 26 ## 12 UnkWeapUse 3 <NA> <NA> <NA> <NA> <NA> 519 ## 13 <NA> <NA> <NA> <NA> <NA> <NA> <NA> 6228 inc_ind %>% count(V4529, Property, Violent, AAST) %>% print(n = 40) ## # A tibble: 34 × 5 ## V4529 Property Violent AAST n ## <fct> <lgl> <lgl> <lgl> <int> ## 1 1 FALSE TRUE FALSE 45 ## 2 2 FALSE TRUE FALSE 20 ## 3 3 FALSE TRUE FALSE 11 ## 4 4 FALSE TRUE FALSE 3 ## 5 5 FALSE TRUE FALSE 24 ## 6 6 FALSE TRUE FALSE 26 ## 7 7 FALSE TRUE FALSE 59 ## 8 8 FALSE TRUE FALSE 5 ## 9 9 FALSE TRUE FALSE 7 ## 10 10 FALSE TRUE FALSE 57 ## 11 11 FALSE TRUE TRUE 97 ## 12 12 FALSE TRUE TRUE 91 ## 13 13 FALSE TRUE TRUE 163 ## 14 14 FALSE TRUE FALSE 165 ## 15 15 FALSE TRUE FALSE 24 ## 16 16 FALSE TRUE FALSE 12 ## 17 17 FALSE TRUE FALSE 357 ## 18 18 FALSE TRUE FALSE 14 ## 19 19 FALSE TRUE FALSE 3 ## 20 20 FALSE TRUE FALSE 607 ## 21 21 FALSE FALSE FALSE 2 ## 22 22 FALSE FALSE FALSE 2 ## 23 23 FALSE FALSE FALSE 19 ## 24 31 TRUE FALSE FALSE 248 ## 25 32 TRUE FALSE FALSE 634 ## 26 33 TRUE FALSE FALSE 188 ## 27 40 TRUE FALSE FALSE 256 ## 28 41 TRUE FALSE FALSE 97 ## 29 54 TRUE FALSE FALSE 407 ## 30 55 TRUE FALSE FALSE 1006 ## 31 56 TRUE FALSE FALSE 1686 ## 32 57 TRUE FALSE FALSE 1420 ## 33 58 TRUE FALSE FALSE 798 ## 34 59 TRUE FALSE FALSE 395 inc_ind %>% count(ReportPolice, V4399) ## # A tibble: 4 × 3 ## ReportPolice V4399 n ## <lgl> <fct> <int> ## 1 FALSE 2 5670 ## 2 FALSE 3 103 ## 3 FALSE 8 12 ## 4 TRUE 1 3163 inc_ind %>% count(AAST, WeapCat, AAST_NoWeap, AAST_Firearm, AAST_Knife, AAST_Other) ## # A tibble: 11 × 7 ## AAST WeapCat AAST_NoWeap AAST_Firearm AAST_Knife AAST_Other n ## <lgl> <chr> <lgl> <lgl> <lgl> <lgl> <int> ## 1 FALSE Firearm FALSE FALSE FALSE FALSE 34 ## 2 FALSE Knife FALSE FALSE FALSE FALSE 23 ## 3 FALSE NoWeap FALSE FALSE FALSE FALSE 1769 ## 4 FALSE Other FALSE FALSE FALSE FALSE 27 ## 5 FALSE UnkWeapUse FALSE FALSE FALSE FALSE 516 ## 6 FALSE <NA> FALSE FALSE FALSE FALSE 6228 ## 7 TRUE Firearm FALSE TRUE FALSE FALSE 115 ## 8 TRUE Knife FALSE FALSE TRUE FALSE 62 ## 9 TRUE NoWeap TRUE FALSE FALSE FALSE 25 ## 10 TRUE Other FALSE FALSE FALSE TRUE 146 ## 11 TRUE UnkWeapUse FALSE FALSE FALSE FALSE 3 After creating indicators of victimization types and characteristics, the file is summarized, and crimes are summed across persons or households by YEARQ. Property crimes (i.e., crimes committed against households, such as household burglary or motor vehicle theft) are summed across households, and personal crimes (i.e., crimes committed against an individual, such as assault, robbery, and personal theft) are summed across persons. The indicators are summed using the serieswgt, and the variable WGTVICCY needs to be retained for later analysis. inc_hh_sums <- inc_ind %>% filter(V4529_num > 23) %>% # restrict to household crimes group_by(YEARQ, IDHH) %>% summarize(WGTVICCY = WGTVICCY[1], across(starts_with("Property"), ~ sum(. * serieswgt), .names = "{.col}"), .groups = "drop") inc_pers_sums <- inc_ind %>% filter(V4529_num <= 23) %>% # restrict to person crimes group_by(YEARQ, IDHH, IDPER) %>% summarize(WGTVICCY = WGTVICCY[1], across(c(starts_with("Violent"), starts_with("AAST")), ~ sum(. * serieswgt), .names = "{.col}"), .groups = "drop") Now, we merge the victimization summary files into the appropriate files. For any record on the household or person file that is not on the victimization file, the victimization counts are set to 0 after merging. In this step, we will also create the victimization adjustment factor. See 2.2.4 in the User’s Guide for details of why this adjustment is created (Shook-Sa, Bonnie, Couzens, G. Lance, and Berzofsky, Marcus (2015)). It is calculated as follows: \\[ A_{ijk}=\\frac{v_{ijk}}{w_{ijk}}\\] where \\(w_{ijk}\\) is the person weight (WGTPERCY) for personal crimes or the household weight (WGTHHCY) for household crimes, and \\(v_{ijk}\\) is the victimization weight (WGTVICCY) for household \\(i\\), respondent \\(j\\), in reporting period \\(k\\). The adjustment factor is set to 0 if no incidents are reported. # Set up a list of 0s for each crime type/characteristic to replace NA's hh_z_list <- rep(0, ncol(inc_hh_sums) - 3) %>% as.list() %>% setNames(names(inc_hh_sums)[-(1:3)]) pers_z_list <- rep(0, ncol(inc_pers_sums) - 4) %>% as.list() %>% setNames(names(inc_pers_sums)[-(1:4)]) hh_vsum <- ncvs_2021_household %>% full_join(inc_hh_sums, by = c("YEARQ", "IDHH")) %>% replace_na(hh_z_list) %>% mutate(ADJINC_WT = if_else(is.na(WGTVICCY), 0, WGTVICCY / WGTHHCY)) pers_vsum <- ncvs_2021_person %>% full_join(inc_pers_sums, by = c("YEARQ", "IDHH", "IDPER")) %>% replace_na(pers_z_list) %>% mutate(ADJINC_WT = if_else(is.na(WGTVICCY), 0, WGTVICCY / WGTPERCY)) 13.4.2 Derived Demographic Variables A final step in file preparation for the household and person files is creating any derived variables on the household and person files, such as income categories or age categories, for subgroup analysis. We can do this step before or after merging the victimization counts. 13.4.2.1 Household Variables For the household file, we create categories for tenure (rental status), urbanicity, income, place size, and region. A codebook of the household variables are located in Table 13.3. TABLE 13.3: Codebook for household variables Variable Description Value Label V2015 Tenure 1 Owned or being bought 2 Rented for cash 3 No cash rent SC214A Household Income 01 Less than $5,000 02 $5,000 to $7,499 03 $7,500 to $9,999 04 $10,000 to $12,499 05 $12,500 to $14,999 06 $15,000 to $17,499 07 $17,500 to $19,999 08 $20,000 to $24,999 09 $25,000 to $29,999 10 $30,000 to $34,999 11 $35,000 to $39,999 12 $40,000 to $49,999 13 $50,000 to $74,999 15 $75,000 to $99,999 16 $100,000-$149,999 17 $150,000-$199,999 18 $200,000 or more V2126B Place Size Code 00 Not in a place 13 Under 10,000 16 10,000-49,999 17 50,000-99,999 18 100,000-249,999 19 250,000-499,999 20 500,000-999,999 21 1,000,000-2,499,999 22 2,500,000-4,999,999 23 5,000,000 or more V2127B Region 1 Northeast 2 Midwest 3 South 4 West V2143 Urbanicity 1 Urban 2 Suburban 3 Rural hh_vsum_der <- hh_vsum %>% mutate( Tenure = factor(case_when(V2015 == 1 ~ "Owned", !is.na(V2015) ~ "Rented"), levels = c("Owned", "Rented")), Urbanicity = factor(case_when(V2143 == 1 ~ "Urban", V2143 == 2 ~ "Suburban", V2143 == 3 ~ "Rural"), levels = c("Urban", "Suburban", "Rural")), SC214A_num = as.numeric(as.character(SC214A)), Income = case_when(SC214A_num <= 8 ~ "Less than $25,000", SC214A_num <= 12 ~ "$25,000-49,999", SC214A_num <= 15 ~ "$50,000-99,999", SC214A_num <= 17 ~ "$100,000-199,999", SC214A_num <= 18 ~ "$200,000 or more"), Income = fct_reorder(Income, SC214A_num, .na_rm = FALSE), PlaceSize = case_match(as.numeric(as.character(V2126B)), 0 ~ "Not in a place", 13 ~ "Under 10,000", 16 ~ "10,000-49,999", 17 ~ "50,000-99,999", 18 ~ "100,000-249,999", 19 ~ "250,000-499,999", 20 ~ "500,000-999,999", c(21, 22, 23) ~ "1,000,000 or more"), PlaceSize = fct_reorder(PlaceSize, as.numeric(V2126B)), Region = case_match(as.numeric(V2127B), 1 ~ "Northeast", 2 ~ "Midwest", 3 ~ "South", 4 ~ "West"), Region = fct_reorder(Region, as.numeric(V2127B)) ) As before, we want to check to make sure the recoded variables we create match the existing data as expected. hh_vsum_der %>% count(Tenure, V2015) ## # A tibble: 4 × 3 ## Tenure V2015 n ## <fct> <fct> <int> ## 1 Owned 1 101944 ## 2 Rented 2 46269 ## 3 Rented 3 1925 ## 4 <NA> <NA> 106322 hh_vsum_der %>% count(Urbanicity, V2143) ## # A tibble: 3 × 3 ## Urbanicity V2143 n ## <fct> <fct> <int> ## 1 Urban 1 26878 ## 2 Suburban 2 173491 ## 3 Rural 3 56091 hh_vsum_der %>% count(Income, SC214A) ## # A tibble: 18 × 3 ## Income SC214A n ## <fct> <fct> <int> ## 1 Less than $25,000 1 7841 ## 2 Less than $25,000 2 2626 ## 3 Less than $25,000 3 3949 ## 4 Less than $25,000 4 5546 ## 5 Less than $25,000 5 5445 ## 6 Less than $25,000 6 4821 ## 7 Less than $25,000 7 5038 ## 8 Less than $25,000 8 11887 ## 9 $25,000-49,999 9 11550 ## 10 $25,000-49,999 10 13689 ## 11 $25,000-49,999 11 13655 ## 12 $25,000-49,999 12 23282 ## 13 $50,000-99,999 13 44601 ## 14 $50,000-99,999 15 33353 ## 15 $100,000-199,999 16 34287 ## 16 $100,000-199,999 17 15317 ## 17 $200,000 or more 18 16892 ## 18 <NA> <NA> 2681 hh_vsum_der %>% count(PlaceSize, V2126B) ## # A tibble: 10 × 3 ## PlaceSize V2126B n ## <fct> <fct> <int> ## 1 Not in a place 0 69484 ## 2 Under 10,000 13 39873 ## 3 10,000-49,999 16 53002 ## 4 50,000-99,999 17 27205 ## 5 100,000-249,999 18 24461 ## 6 250,000-499,999 19 13111 ## 7 500,000-999,999 20 15194 ## 8 1,000,000 or more 21 6167 ## 9 1,000,000 or more 22 3857 ## 10 1,000,000 or more 23 4106 hh_vsum_der %>% count(Region, V2127B) ## # A tibble: 4 × 3 ## Region V2127B n ## <fct> <fct> <int> ## 1 Northeast 1 41585 ## 2 Midwest 2 74666 ## 3 South 3 87783 ## 4 West 4 52426 13.4.2.2 Person Variables For the person file, we create categories for sex, race/Hispanic origin, age categories, and marital status. A codebook of the household variables is located in Table 13.4. We also merge the household demographics to the person file as well as the design variables (V2117 and V2118). TABLE 13.4: Codebook for person variables Variable Description Value Label V3014 Age 12 through 90 V3015 Current Marital Status 1 Married 2 Widowed 3 Divorced 4 Separated 5 Never married V3018 Sex 1 Male 2 Female V3023A Race 01 White only 02 Black only 03 American Indian, Alaska native only 04 Asian only 05 Hawaiian/Pacific Islander only 06 White-Black 07 White-American Indian 08 White-Asian 09 White-Hawaiian 10 Black-American Indian 11 Black-Asian 12 Black-Hawaiian/Pacific Islander 13 American Indian-Asian 14 Asian-Hawaiian/Pacific Islander 15 White-Black-American Indian 16 White-Black-Asian 17 White-American Indian-Asian 18 White-Asian-Hawaiian 19 2 or 3 races 20 4 or 5 races V3024 Hispanic Origin 1 Yes 2 No # Set label for usage later NHOPI <- "Native Hawaiian or Other Pacific Islander" pers_vsum_der <- pers_vsum %>% mutate( Sex = factor(case_when(V3018 == 1 ~ "Male", V3018 == 2 ~ "Female")), RaceHispOrigin = factor(case_when(V3024 == 1 ~ "Hispanic", V3023A == 1 ~ "White", V3023A == 2 ~ "Black", V3023A == 4 ~ "Asian", V3023A == 5 ~ NHOPI, TRUE ~ "Other"), levels = c("White", "Black", "Hispanic", "Asian", NHOPI, "Other")), V3014_num = as.numeric(as.character(V3014)), AgeGroup = case_when(V3014_num <= 17 ~ "12-17", V3014_num <= 24 ~ "18-24", V3014_num <= 34 ~ "25-34", V3014_num <= 49 ~ "35-49", V3014_num <= 64 ~ "50-64", V3014_num <= 90 ~ "65 or older"), AgeGroup = fct_reorder(AgeGroup, V3014_num), MaritalStatus = factor(case_when(V3015 == 1 ~ "Married", V3015 == 2 ~ "Widowed", V3015 == 3 ~ "Divorced", V3015 == 4 ~ "Separated", V3015 == 5 ~ "Never married"), levels = c("Never married", "Married", "Widowed","Divorced", "Separated")) ) %>% left_join(hh_vsum_der %>% select(YEARQ, IDHH, V2117, V2118, Tenure:Region), by = c("YEARQ", "IDHH")) As before, we want to check to make sure the recoded variables we create match the existing data as expected. pers_vsum_der %>% count(Sex, V3018) ## # A tibble: 2 × 3 ## Sex V3018 n ## <fct> <fct> <int> ## 1 Female 2 150956 ## 2 Male 1 140922 pers_vsum_der %>% count(RaceHispOrigin, V3024) ## # A tibble: 11 × 3 ## RaceHispOrigin V3024 n ## <fct> <fct> <int> ## 1 White 2 197292 ## 2 White 8 883 ## 3 Black 2 29947 ## 4 Black 8 120 ## 5 Hispanic 1 41450 ## 6 Asian 2 16015 ## 7 Asian 8 61 ## 8 Native Hawaiian or Other Pacific Islander 2 891 ## 9 Native Hawaiian or Other Pacific Islander 8 9 ## 10 Other 2 5161 ## 11 Other 8 49 pers_vsum_der %>% filter(RaceHispOrigin != "Hispanic" | is.na(RaceHispOrigin)) %>% count(RaceHispOrigin, V3023A) ## # A tibble: 20 × 3 ## RaceHispOrigin V3023A n ## <fct> <fct> <int> ## 1 White 1 198175 ## 2 Black 2 30067 ## 3 Asian 4 16076 ## 4 Native Hawaiian or Other Pacific Islander 5 900 ## 5 Other 3 1319 ## 6 Other 6 1217 ## 7 Other 7 1025 ## 8 Other 8 837 ## 9 Other 9 184 ## 10 Other 10 178 ## 11 Other 11 87 ## 12 Other 12 27 ## 13 Other 13 13 ## 14 Other 14 53 ## 15 Other 15 136 ## 16 Other 16 45 ## 17 Other 17 11 ## 18 Other 18 33 ## 19 Other 19 22 ## 20 Other 20 23 pers_vsum_der %>% group_by(AgeGroup) %>% summarize(minAge = min(V3014), maxAge = max(V3014), .groups = "drop") ## # A tibble: 6 × 3 ## AgeGroup minAge maxAge ## <fct> <dbl> <dbl> ## 1 12-17 12 17 ## 2 18-24 18 24 ## 3 25-34 25 34 ## 4 35-49 35 49 ## 5 50-64 50 64 ## 6 65 or older 65 90 pers_vsum_der %>% count(MaritalStatus, V3015) ## # A tibble: 6 × 3 ## MaritalStatus V3015 n ## <fct> <fct> <int> ## 1 Never married 5 90425 ## 2 Married 1 148131 ## 3 Widowed 2 17668 ## 4 Divorced 3 28596 ## 5 Separated 4 4524 ## 6 <NA> 8 2534 We then create tibbles that contain only the variables we need, which makes it easier for analyses. hh_vsum_slim <- hh_vsum_der %>% select(YEARQ:V2118, WGTVICCY:ADJINC_WT, Tenure, Urbanicity, Income, PlaceSize, Region) pers_vsum_slim <- pers_vsum_der %>% select(YEARQ:WGTPERCY, WGTVICCY:ADJINC_WT, Sex:Region) To calculate estimates about types of crime, such as what percentage of violent crimes are reported to the police, we must use the incident file. The incident file is not guaranteed to have every pseudostratum and half-sample code, so dummy records are created to append before estimation. Finally, we merge demographic variables onto the incident tibble. dummy_records <- hh_vsum_slim %>% distinct(V2117, V2118) %>% mutate(Dummy = 1, WGTVICCY = 1, NEWWGT = 1) inc_analysis <- inc_ind %>% mutate(Dummy = 0) %>% left_join(select(pers_vsum_slim, YEARQ, IDHH, IDPER, Sex:Region), by = c("YEARQ", "IDHH", "IDPER")) %>% bind_rows(dummy_records) %>% select(YEARQ:IDPER, WGTVICCY, NEWWGT, V4529, WeapCat, ReportPolice, Property:Region) The tibbles hh_vsum_slim, pers_vsum_slim, and inc_analysis can now be used to create design objects and calculate crime rate estimates. 13.5 Survey Design Objects All the data prep above is necessary to prepare the data for survey analysis. At this point, we can create the design objects and finally begin analysis. We will create three design objects for different types of analysis as they depend on which type of estimate we are creating. For the incident data, the weight of analysis is NEWWGT, which we constructed previously. The household and person-level data use WGTHHCY and WGTPERCY, respectively. For all analyses, V2117 is the strata variable, and V2118 is the cluster/PSU variable for analysis. inc_des <- inc_analysis %>% as_survey( weight = NEWWGT, strata = V2117, ids = V2118, nest = TRUE ) hh_des <- hh_vsum_slim %>% as_survey( weight = WGTHHCY, strata = V2117, ids = V2118, nest = TRUE ) pers_des <- pers_vsum_slim %>% as_survey( weight = WGTPERCY, strata = V2117, ids = V2118, nest = TRUE ) 13.6 Calculating Estimates Now that we have prepared our data and created the design effects, we can calculate our estimates. As a reminder, those are: Victimization totals estimate the number of criminal victimizations with a given characteristic. Victimization proportions estimate characteristics among victimizations or victims. Victimization rates are estimates of the number of victimizations per 1,000 persons or households in the population. Prevalence rates are estimates of the percentage of the population (persons or households) who are victims of a crime. 13.6.1 Estimation 1: Victimization Totals There are two ways to calculate victimization totals. Using the incident design object (inc_des) is the most straightforward method, but the person (pers_des) and household (hh_des) design objects can be used as well if the adjustment factor (ADJINC_WT) is incorporated. In the example below, the total number of property and violent victimizations is first calculated using the incident file and then using the household and person design objects. The incident file is smaller, and thus, estimation is faster using that file, but the estimates will be the same as illustrated below: vt1 <- inc_des %>% summarize(Property_Vzn = survey_total(Property, na.rm = TRUE), Violent_Vzn = survey_total(Violent, na.rm = TRUE)) vt2a <- hh_des %>% summarize(Property_Vzn = survey_total(Property * ADJINC_WT, na.rm = TRUE)) vt2b <- pers_des %>% summarize(Violent_Vzn = survey_total(Violent * ADJINC_WT, na.rm = TRUE)) vt1 ## # A tibble: 1 × 4 ## Property_Vzn Property_Vzn_se Violent_Vzn Violent_Vzn_se ## <dbl> <dbl> <dbl> <dbl> ## 1 11682056. 263844. 4598306. 198115. vt2a ## # A tibble: 1 × 2 ## Property_Vzn Property_Vzn_se ## <dbl> <dbl> ## 1 11682056. 263844. vt2b ## # A tibble: 1 × 2 ## Violent_Vzn Violent_Vzn_se ## <dbl> <dbl> ## 1 4598306. 198115. The number of victimizations estimated using the incident file is equivalent to the person and household file method. There are 11,682,056 property incidents and 4,598,306 violent incidents in a six-month period. 13.6.2 Estimation 2: Victimization Proportions Victimization proportions are proportions describing features of a victimization. The key here is that these are questions among victimizations, not among the population. These types of estimates can only be calculated using the incident design object (inc_des). For example, we could be interested in the percentage of property victimizations reported to the police as shown in the following code with an estimate, the standard error, and 95% confidence interval: prop1 <- inc_des %>% filter(Property) %>% summarize(Pct = survey_mean(ReportPolice, na.rm = TRUE, proportion=TRUE, vartype=c("se", "ci")) * 100) prop1 ## # A tibble: 1 × 4 ## Pct Pct_se Pct_low Pct_upp ## <dbl> <dbl> <dbl> <dbl> ## 1 30.8 0.798 29.2 32.4 Or, the percentage of violent victimizations that are in urban areas: prop2 <- inc_des %>% filter(Violent) %>% summarize(Pct = survey_mean(Urbanicity=="Urban", na.rm = TRUE) * 100) prop2 ## # A tibble: 1 × 2 ## Pct Pct_se ## <dbl> <dbl> ## 1 18.1 1.49 In 2021, we estimate that 30.8% of property crimes were reported to the police and 18.1% of violent crimes occurred in urban areas. 13.6.3 Estimation 3: Victimization Rates Victimization rates measure the number of victimizations per population. They are not an estimate of the proportion of households or persons who are victimized, which is a prevalence rate described in section 13.6.4. Victimization rates are estimated using the household (hh_des) or person (pers_des) design objects depending on the type of crime, and the adjustment factor (ADJINC_WT) must be incorporated. We return to the example of property and violent victimizations used in the example for victimization totals (section 13.6.1). In the following example, the property victimization totals are calculated as above, as well as the property victimization rate (using survey_mean()) and the population size using survey_total(). As mentioned in the introduction, victimization rates use the incident weight in the numerator and the person or household weight in the denominator. This is accomplished by calculating the rates with the weight adjustment (ADJINC_WT) multiplied by the estimate of interest. Let’s look at an example of property victimization. vr_prop <- hh_des %>% summarize( Property_Vzn = survey_total(Property * ADJINC_WT, na.rm = TRUE), Property_Rate = survey_mean(Property * ADJINC_WT * 1000, na.rm = TRUE), PopSize = survey_total(1, vartype = NULL) ) vr_prop ## # A tibble: 1 × 5 ## Property_Vzn Property_Vzn_se Property_Rate Property_Rate_se PopSize ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 11682056. 263844. 90.3 1.95 129319232. In the output above, we see the estimate for property victimization rate in 2021 was 90.3 per 1,000 households, which is consistent with calculating as the number of victimizations per 1,000 population as demonstrated in the next chunk: vr_prop %>% select(-ends_with("se")) %>% mutate(Property_Rate_manual=Property_Vzn/PopSize*1000) ## # A tibble: 1 × 4 ## Property_Vzn Property_Rate PopSize Property_Rate_manual ## <dbl> <dbl> <dbl> <dbl> ## 1 11682056. 90.3 129319232. 90.3 Victimization rates can also be calculated for particular characteristics of the victimization. In the following example, the rate of aggravated assault with no weapon, with a firearm, with a knife, and with another weapon. pers_des %>% summarize(across( starts_with("AAST_"), ~ survey_mean(. * ADJINC_WT * 1000, na.rm = TRUE) )) ## # A tibble: 1 × 8 ## AAST_NoWeap AAST_NoWeap_se AAST_Firearm AAST_Firearm_se AAST_Knife ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.249 0.0595 0.860 0.101 0.455 ## # ℹ 3 more variables: AAST_Knife_se <dbl>, AAST_Other <dbl>, ## # AAST_Other_se <dbl> A common desire is to calculate victimization rates by several characteristics. For example, we may want to calculate the violent victimization rate and aggravated assault rate by sex, race/Hispanic origin, age group, marital status, and household income. This requires a group_by() statement for each categorization separately. Thus, we make a function to do this and then use map_df() from the {purrr} package (part of the tidyverse) to loop through the variables. This function takes a demographic variable as its input (byarvar) and calculates the violent and aggravated assault vicitimization rate for each level. It then creates some columns with the variable, the level of each variable, and a numeric version of the variable (LevelNum) for sorting later. The function is run across multiple variables using map() and then stacks the results into a single output using bind_rows(). pers_est_by <- function(byvar) { pers_des %>% rename(Level := {{byvar}}) %>% filter(!is.na(Level)) %>% group_by(Level) %>% summarize( Violent = survey_mean(Violent * ADJINC_WT * 1000, na.rm = TRUE), AAST = survey_mean(AAST * ADJINC_WT * 1000, na.rm = TRUE) ) %>% mutate( Variable = byvar, LevelNum = as.numeric(Level), Level = as.character(Level) ) %>% select(Variable, Level, LevelNum, everything()) } pers_est_df <- c("Sex", "RaceHispOrigin", "AgeGroup", "MaritalStatus", "Income") %>% map(pers_est_by) %>% bind_rows() The output from all the estimates is cleanded to create better labels such as going from “RaceHispOrigin” to “Race/Hispanic Origin”. Finally, the {gt} package is used to make a publishable table (Table 13.5). Using the functions from the {gt} package, column labels and footnotes are added and estimates are presented to the first decimal place. vr_gt<-pers_est_df %>% mutate( Variable = case_when( Variable == "RaceHispOrigin" ~ "Race/Hispanic origin", Variable == "MaritalStatus" ~ "Marital status", Variable == "AgeGroup" ~ "Age", TRUE ~ Variable ) ) %>% select(-LevelNum) %>% group_by(Variable) %>% gt(rowname_col = "Level") %>% tab_spanner( label = "Violent crime", id = "viol_span", columns = c("Violent", "Violent_se") ) %>% tab_spanner(label = "Aggravated assault", columns = c("AAST", "AAST_se")) %>% cols_label( Violent = "Rate", Violent_se = "SE", AAST = "Rate", AAST_se = "SE", ) %>% fmt_number( columns = c("Violent", "Violent_se", "AAST", "AAST_se"), decimals = 1 ) %>% tab_footnote( footnote = "Includes rape or sexual assault, robbery, aggravated assault, and simple assault.", locations = cells_column_spanners(spanners = "viol_span") ) %>% tab_footnote( footnote = "Excludes persons of Hispanic origin", locations = cells_stub(rows = Level %in% c("White", "Black", "Asian", NHOPI, "Other"))) %>% tab_footnote( footnote = "Includes persons who identified as Native Hawaiian or Other Pacific Islander only.", locations = cells_stub(rows = Level == NHOPI) ) %>% tab_footnote( footnote = "Includes persons who identified as American Indian or Alaska Native only or as two or more races.", locations = cells_stub(rows = Level == "Other") ) %>% tab_source_note( source_note = "Note: Rates per 1,000 persons age 12 or older.") %>% tab_source_note(source_note = "Source: Bureau of Justice Statistics, National Crime Victimization Survey, 2021.") %>% tab_stubhead(label = "Victim demographic") %>% tab_caption("Rate and standard error of violent victimization, by type of crime and demographic characteristics, 2021") vr_gt #jslvphoojc table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #jslvphoojc thead, #jslvphoojc tbody, #jslvphoojc tfoot, #jslvphoojc tr, #jslvphoojc td, #jslvphoojc th { border-style: none; } #jslvphoojc p { margin: 0; padding: 0; } #jslvphoojc .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #jslvphoojc .gt_caption { padding-top: 4px; padding-bottom: 4px; } #jslvphoojc .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #jslvphoojc .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #jslvphoojc .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #jslvphoojc .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #jslvphoojc .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #jslvphoojc .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #jslvphoojc .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #jslvphoojc .gt_column_spanner_outer:first-child { padding-left: 0; } #jslvphoojc .gt_column_spanner_outer:last-child { padding-right: 0; } #jslvphoojc .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #jslvphoojc .gt_spanner_row { border-bottom-style: hidden; } #jslvphoojc .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #jslvphoojc .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #jslvphoojc .gt_from_md > :first-child { margin-top: 0; } #jslvphoojc .gt_from_md > :last-child { margin-bottom: 0; } #jslvphoojc .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #jslvphoojc .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #jslvphoojc .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #jslvphoojc .gt_row_group_first td { border-top-width: 2px; } #jslvphoojc .gt_row_group_first th { border-top-width: 2px; } #jslvphoojc .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #jslvphoojc .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #jslvphoojc .gt_first_summary_row.thick { border-top-width: 2px; } #jslvphoojc .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #jslvphoojc .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #jslvphoojc .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #jslvphoojc .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #jslvphoojc .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #jslvphoojc .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #jslvphoojc .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #jslvphoojc .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #jslvphoojc .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #jslvphoojc .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #jslvphoojc .gt_left { text-align: left; } #jslvphoojc .gt_center { text-align: center; } #jslvphoojc .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #jslvphoojc .gt_font_normal { font-weight: normal; } #jslvphoojc .gt_font_bold { font-weight: bold; } #jslvphoojc .gt_font_italic { font-style: italic; } #jslvphoojc .gt_super { font-size: 65%; } #jslvphoojc .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #jslvphoojc .gt_asterisk { font-size: 100%; vertical-align: 0; } #jslvphoojc .gt_indent_1 { text-indent: 5px; } #jslvphoojc .gt_indent_2 { text-indent: 10px; } #jslvphoojc .gt_indent_3 { text-indent: 15px; } #jslvphoojc .gt_indent_4 { text-indent: 20px; } #jslvphoojc .gt_indent_5 { text-indent: 25px; } TABLE 13.5: Rate and standard error of violent victimization, by type of crime and demographic characteristics, 2021 Victim demographic Violent crime1 Aggravated assault Rate SE Rate SE Sex Female 15.5 0.9 2.3 0.2 Male 17.5 1.1 3.2 0.3 Race/Hispanic origin White2 16.1 0.9 2.7 0.3 Black2 18.5 2.2 3.7 0.7 Hispanic 15.9 1.7 2.3 0.4 Asian2 8.6 1.3 1.9 0.6 Native Hawaiian or Other Pacific Islander2,3 36.1 34.4 0.0 0.0 Other2,4 45.4 13.0 6.2 2.0 Age 12-17 13.2 2.2 2.5 0.8 18-24 23.1 2.1 3.9 0.9 25-34 22.0 2.1 4.0 0.6 35-49 19.4 1.6 3.6 0.5 50-64 16.9 1.9 2.0 0.3 65 or older 6.4 1.1 1.1 0.3 Marital status Never married 22.2 1.4 4.0 0.4 Married 9.5 0.9 1.5 0.2 Widowed 10.7 3.5 0.9 0.2 Divorced 27.4 2.9 4.0 0.7 Separated 36.8 6.7 8.8 3.1 Income Less than $25,000 29.6 2.5 5.1 0.7 $25,000-49,999 16.9 1.5 3.0 0.4 $50,000-99,999 14.6 1.1 1.9 0.3 $100,000-199,999 12.2 1.3 2.5 0.4 $200,000 or more 9.7 1.4 1.7 0.6 Note: Rates per 1,000 persons age 12 or older. Source: Bureau of Justice Statistics, National Crime Victimization Survey, 2021. 1 Includes rape or sexual assault, robbery, aggravated assault, and simple assault. 2 Excludes persons of Hispanic origin 3 Includes persons who identified as Native Hawaiian or Other Pacific Islander only. 4 Includes persons who identified as American Indian or Alaska Native only or as two or more races. 13.6.4 Estimation 4: Prevalence Rates Prevalence rates differ from victimization rates as the numerator is the number of people or households victimized rather than the number of victimizations. To calculate the prevalence rates, we must run another summary of the data by calculating an indicator for whether a person or household is a victim of a particular crime at any point in the year. Below is an example of calculating first the indicator and then the prevalence rate of violent crime and aggravated assault. pers_prev_des <- pers_vsum_slim %>% mutate(Year = floor(YEARQ)) %>% mutate(Violent_Ind = sum(Violent) > 0, AAST_Ind = sum(AAST) > 0, .by = c("Year", "IDHH", "IDPER")) %>% as_survey( weight = WGTPERCY, strata = V2117, ids = V2118, nest = TRUE ) pers_prev_ests <- pers_prev_des %>% summarize(Violent_Prev = survey_mean(Violent_Ind * 100), AAST_Prev = survey_mean(AAST_Ind * 100)) pers_prev_ests ## # A tibble: 1 × 4 ## Violent_Prev Violent_Prev_se AAST_Prev AAST_Prev_se ## <dbl> <dbl> <dbl> <dbl> ## 1 0.980 0.0349 0.215 0.0143 In the example above, the indicator is multiplied by 100 to return a percentage rather than a proportion. In 2021, we estimate that 0.98% of people aged 12 and older were a victim of violent crime in the United States, and 0.22% were victims of aggravated assault. 13.7 Statistical testing For any of the types of estimates discussed, we can also perform statistical testing. For example, we could test whether property victimization rates are different between properties that are owned versus rented. First, we calculate the point estimates. prop_tenure <- hh_des %>% group_by(Tenure) %>% summarize( Property_Rate = survey_mean(Property * ADJINC_WT * 1000, na.rm = TRUE, vartype="ci"), ) prop_tenure ## # A tibble: 3 × 4 ## Tenure Property_Rate Property_Rate_low Property_Rate_upp ## <fct> <dbl> <dbl> <dbl> ## 1 Owned 68.2 64.3 72.1 ## 2 Rented 130. 123. 137. ## 3 <NA> NaN NaN NaN The property victimization rate for rented households is 129.8 per 1,000 households while the property victimization rate for owned households is 68.2, which seem very different especially given the non-overlapping confidence intervals. However, survey data is inheriently non-independent so statistical testing cannot be done by comparing confidence intervals. To conduct the statistical test, we first need to create a variable that we will compare which incorporates the adjusted incident weight (ADJINC_WT) and then the test can be conducted as discussed in Chapter 6. prop_tenure_test <- hh_des %>% mutate( Prop_Adj=Property * ADJINC_WT * 1000 ) %>% svyttest( formula = Prop_Adj ~ Tenure, design = ., na.rm = TRUE ) %>% broom::tidy() prop_tenure_test ## # A tibble: 1 × 8 ## estimate statistic p.value parameter conf.low conf.high method ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 61.6 16.0 8.91e-36 169 54.0 69.2 Design-based… ## # ℹ 1 more variable: alternative <chr> The output of the statistical test shows the same difference of 61.6 between the property victimization rates of renters and owners and the test is highly significant with the p-value of <0.0001. 13.8 Exercises What proportion of completed motor vehicle thefts are not reported to the police? Hint: Use the codebook to look at the definition of Type of Crime (V4529). ans1 <- inc_des %>% filter(str_detect(V4529, "40|41")) %>% summarize(Pct = survey_mean(ReportPolice, na.rm = TRUE) * 100) ans1 ## # A tibble: 1 × 2 ## Pct Pct_se ## <dbl> <dbl> ## 1 76.9 2.60 How many violent crimes occur in each region? inc_des %>% filter(Violent) %>% survey_count(Region) ## # A tibble: 4 × 3 ## Region n n_se ## <fct> <dbl> <dbl> ## 1 Northeast 698406. 82419. ## 2 Midwest 1144407. 95860. ## 3 South 1394214. 107505. ## 4 West 1361278. 109479. What is the property victimization rate among each income level? hh_des %>% group_by(Income) %>% summarize(Property_Rate = survey_mean(Property * ADJINC_WT * 1000, na.rm = TRUE)) ## # A tibble: 6 × 3 ## Income Property_Rate Property_Rate_se ## <fct> <dbl> <dbl> ## 1 Less than $25,000 111. 4.97 ## 2 $25,000-49,999 89.5 3.42 ## 3 $50,000-99,999 87.8 3.30 ## 4 $100,000-199,999 76.5 3.49 ## 5 $200,000 or more 91.8 5.69 ## 6 <NA> NaN NaN What is the difference between the violent victimization rate between males and females? Is it statistically different? pers_des %>% group_by(Sex) %>% summarize( Violent_rate=survey_mean(Violent * ADJINC_WT * 1000, na.rm=TRUE) ) ## # A tibble: 2 × 3 ## Sex Violent_rate Violent_rate_se ## <fct> <dbl> <dbl> ## 1 Female 15.5 0.873 ## 2 Male 17.5 1.11 pers_des %>% mutate( Violent_Adj=Violent * ADJINC_WT * 1000 ) %>% svyttest( formula = Violent_Adj ~ Sex, design = ., na.rm = TRUE ) %>% broom::tidy() ## Warning in summary.glm(g): observations with zero weight not used for ## calculating dispersion ## Warning in summary.glm(glm.object): observations with zero weight not ## used for calculating dispersion ## # A tibble: 1 × 8 ## estimate statistic p.value parameter conf.low conf.high method ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 1.93 1.43 0.156 169 -0.745 4.61 Design-based … ## # ℹ 1 more variable: alternative <chr> References Bureau of Justice Statistics. 2017. “National Crime Victimization Survey, 2016: Technical Documentation.” https://bjs.ojp.gov/sites/g/files/xyckuh236/files/media/document/ncvstd16.pdf. Shook-Sa, Bonnie, Couzens, G. Lance, and Berzofsky, Marcus. 2015. “Users’ Guide to the National Crime Victimization Survey (NCVS) Direct Variance Estimation.” https://bjs.ojp.gov/sites/g/files/xyckuh236/files/media/document/ncvs_variance_user_guide_11.06.14.pdf; Bureau of Justice Statistics. United States. Bureau of Justice Statistics. 2022. “National Crime Victimization Survey, [United States], 2021.” Inter-university Consortium for Political; Social Research [distributor]. https://doi.org/10.3886/ICPSR38429.v1. https://www.icpsr.umich.edu/web/ICPSR/series/95↩︎ https://www.icpsr.umich.edu/web/NACJD/studies/38429↩︎ BJS publishes victimization rates per 1,000, which are also presented in these examples↩︎ "],["c14-ambarom-vignette.html", "Chapter 14 AmericasBarometer Vignette 14.1 Introduction 14.2 Data structure 14.3 Preparing files 14.4 Survey design objects 14.5 Calculating estimates 14.6 Mapping survey data 14.7 Exercises", " Chapter 14 AmericasBarometer Vignette Prerequisites For this chapter, load the following packages: library(tidyverse) library(survey) library(srvyr) library(sf) library(rnaturalearth) library(rnaturalearthdata) library(gt) library(ggpattern) In this vignette, we use a subset of data from the 2021 AmericasBarometer survey. Download the raw files, available on the LAPOP website. We work with version 1.2 of the data, and there are separate files for each of the 22 countries. To read all files into R while ignoring the Stata labels, we recommend running code like this: stata_files <- list.files(here("RawData", "LAPOP_2021"), "*.dta") read_stata_unlabeled <- function(file) { read_stata(file) %>% zap_labels() %>% zap_label() } ambarom_in <- here("RawData", "LAPOP_2021", stata_files) %>% map_df(read_stata_unlabeled) %>% select(pais, strata, upm, weight1500, strata, core_a_core_b, q2, q1tb, covid2at, a4, idio2, idio2cov, it1, jc13, m1, mil10a, mil10e, ccch1, ccch3, ccus1, ccus3, edr, ocup4a, q14, q11n, q12c, q12bn, starts_with("covidedu1"), gi0n, r15, r18n, r18) The code above reads all .dta files and combines them into one tibble. 14.1 Introduction The AmericasBarometer surveys, conducted by the LAPOP Lab (LAPOP 2023b), are public opinion surveys of the Americas focused on democracy. The study was launched in 2004/2005 with 11 countries. Though the countries grow and fluctuate over time, AmericasBarometers maintains a consistent methodology across many countries. In 2021, the study included 22 countries ranging from Canada in the north to Chile and Argentina in the South (LAPOP 2023a). Historically, surveys were administered through in-person household interviews, but the COVID-19 pandemic changed the study significantly. Now, random-digit dialing (RDD) of mobile phones is used in all countries except the United States and Canada (LAPOP 2021c). In Canada, LAPOP collaborated with the Environics Institute to collect data from a panel of Canadians using a web survey (LAPOP 2021a). In the United States, YouGov conducted the survey on behalf of LAPOP by conducting a web survey among its panelists (LAPOP 2021b). The survey includes a core set of questions for all countries, but not every question is asked in each country. Additionally, some questions are only posed to half of the respondents in a country, with different sections randomized to respondents (LAPOP 2021d). 14.2 Data structure Each country and year has its own file available in Stata format (.dta). In this vignette, we download and combine all the data from the 22 participating countries in 2021. We subset the data to a smaller set of columns, as noted in the prerequisites box. Review the core questionnaire to understand the common variables across the countries (LAPOP 2021d). 14.3 Preparing files Many of the variables are coded as numeric and do not have intuitive variable names, so the next step is to create derived variables and wrangle the data for analysis. Using the core questionnaire as a codebook, we reference the factor descriptions to create derived variables with informative names: ambarom <- ambarom_in %>% mutate( Country = factor( case_match(pais, 1 ~ "Mexico", 2 ~ "Guatemala", 3 ~ "El Salvador", 4 ~ "Honduras", 5 ~ "Nicaragua", 6 ~ "Costa Rica", 7 ~ "Panama", 8 ~ "Colombia", 9 ~ "Ecuador", 10 ~ "Bolivia", 11 ~ "Peru", 12 ~ "Paraguay", 13 ~ "Chile", 14 ~ "Uruguay", 15 ~ "Brazil", 17 ~ "Argentina", 21 ~ "Dominican Republic", 22 ~ "Haiti", 23 ~ "Jamaica", 24 ~ "Guyana", 40 ~ "United States", 41 ~ "Canada")), CovidWorry = fct_reorder( case_match(covid2at, 1 ~ "Very worried", 2 ~ "Somewhat worried", 3 ~ "A little worried", 4 ~ "Not worried at all"), covid2at, .na_rm = FALSE) ) %>% rename(Educ_NotInSchool = covidedu1_1, Educ_NormalSchool = covidedu1_2, Educ_VirtualSchool = covidedu1_3, Educ_Hybrid = covidedu1_4, Educ_NoSchool = covidedu1_5, BroadbandInternet = r18n, Internet = r18) At this point, it is a good time to check the cross-tabs between the original and newly derived variables. These tables help us confirm that we have correctly matched the numeric data from the original dataset to the renamed factor data in the new dataset. For instance, let’s check the original variable pais and the derived variable Country. We can consult the questionnaire or codebook to confirm that Argentina is coded as 17, Bolivia as 10, etc. Similarly, for CovidWorry and covid2at, we can verify that Very worried is coded as 1, and so on for the other variables. ambarom %>% count(Country, pais) %>% print(n = 22) ## # A tibble: 22 × 3 ## Country pais n ## <fct> <dbl> <int> ## 1 Argentina 17 3011 ## 2 Bolivia 10 3002 ## 3 Brazil 15 3016 ## 4 Canada 41 2201 ## 5 Chile 13 2954 ## 6 Colombia 8 2993 ## 7 Costa Rica 6 2977 ## 8 Dominican Republic 21 3000 ## 9 Ecuador 9 3005 ## 10 El Salvador 3 3245 ## 11 Guatemala 2 3000 ## 12 Guyana 24 3011 ## 13 Haiti 22 3088 ## 14 Honduras 4 2999 ## 15 Jamaica 23 3121 ## 16 Mexico 1 2998 ## 17 Nicaragua 5 2997 ## 18 Panama 7 3183 ## 19 Paraguay 12 3004 ## 20 Peru 11 3038 ## 21 United States 40 1500 ## 22 Uruguay 14 3009 ambarom %>% count(CovidWorry, covid2at) ## # A tibble: 5 × 3 ## CovidWorry covid2at n ## <fct> <dbl> <int> ## 1 Very worried 1 24327 ## 2 Somewhat worried 2 13233 ## 3 A little worried 3 11478 ## 4 Not worried at all 4 8628 ## 5 <NA> NA 6686 14.4 Survey design objects The technical report is the best reference for understanding how to specify the sampling design in R (LAPOP 2021c). The data includes two weights: wt and weight1500. The first weight variable is specific to each country and sums to the sample size, but it is calibrated to reflect each country’s demographics. The second weight variable sums to 1500 for each country and is recommended for multi-country analyses. Although not explicitly stated in the documentation, the Stata syntax example (svyset upm [pw=weight1500], strata(strata)) indicates the variable upm is a clustering variable and strata is the strata variable. Therefore, the design object is created in R as follows: ambarom_des <- ambarom %>% as_survey_design(ids = upm, strata = strata, weight = weight1500) One interesting thing to note is that these weight variables can provide estimates for comparing countries but not for multi-country estimates. The reason is that the weights do not account for the different sizes of countries. For example, Canada has about 10% of the population of the United States, but an estimate that uses records from both countries would weigh them equally. 14.5 Calculating estimates When calculating estimates from the data, we use the survey design object ambarom_des and then apply the survey_mean() function. The next sections walk through a few examples. 14.5.1 Example: Worried about COVID This survey was administered between March and August of 2021, with the specific timing varying by country40. Given the state of the pandemic at that time, several questions about COVID were included. The first question about COVID asked: How worried are you about the possibility that you or someone in your household will get sick from coronavirus in the next 3 months? Very worried Somewhat worried A little worried Not worried at all If we are interested in those who are very worried or somewhat worried, we can create a new variable (CovidWorry_bin) that groups levels of the original question using the fct_collapse() function from the {forcats} package. We then use the survey_count() function to understand how responses are distributed across each category of the original variable (CovidWorry) and the new variable (CovidWorry_bin). covid_worry_collapse <- ambarom_des %>% mutate(CovidWorry_bin = fct_collapse( CovidWorry, WorriedHi = c("Very worried", "Somewhat worried"), WorriedLo = c("A little worried", "Not worried at all") )) covid_worry_collapse %>% survey_count(CovidWorry_bin, CovidWorry) ## # A tibble: 5 × 4 ## CovidWorry_bin CovidWorry n n_se ## <fct> <fct> <dbl> <dbl> ## 1 WorriedHi Very worried 12369. 83.6 ## 2 WorriedHi Somewhat worried 6378. 63.4 ## 3 WorriedLo A little worried 5896. 62.6 ## 4 WorriedLo Not worried at all 4840. 59.7 ## 5 <NA> <NA> 3518. 42.2 With this new variable, we can now use survey_mean() to calculate the percentage of people in each country who are either very or somewhat worried about COVID. There are missing data, as indicated in the survey_count() output above, so we need to use na.rm = TRUE in the survey_mean() function to handle the missing values. covid_worry_country_ests <- covid_worry_collapse %>% group_by(Country) %>% summarize(p = survey_mean(CovidWorry_bin == "WorriedHi", na.rm = TRUE) * 100) covid_worry_country_ests ## # A tibble: 22 × 3 ## Country p p_se ## <fct> <dbl> <dbl> ## 1 Argentina 65.8 1.08 ## 2 Bolivia 71.6 0.960 ## 3 Brazil 83.5 0.962 ## 4 Canada 48.9 1.34 ## 5 Chile 81.8 0.828 ## 6 Colombia 67.9 1.12 ## 7 Costa Rica 72.6 0.952 ## 8 Dominican Republic 50.1 1.13 ## 9 Ecuador 71.7 0.967 ## 10 El Salvador 52.5 1.02 ## # ℹ 12 more rows To view the results for all countries, we can use the {gt} package to create Table 14.1. covid_worry_country_ests_gt <- covid_worry_country_ests %>% gt(rowname_col = "Country") %>% cols_label(p = "Percent", p_se = "SE") %>% fmt_number(decimals = 1) %>% tab_source_note("AmericasBarometer Surveys, 2021") covid_worry_country_ests_gt #uphlolqabb table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #uphlolqabb thead, #uphlolqabb tbody, #uphlolqabb tfoot, #uphlolqabb tr, #uphlolqabb td, #uphlolqabb th { border-style: none; } #uphlolqabb p { margin: 0; padding: 0; } #uphlolqabb .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #uphlolqabb .gt_caption { padding-top: 4px; padding-bottom: 4px; } #uphlolqabb .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #uphlolqabb .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #uphlolqabb .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #uphlolqabb .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #uphlolqabb .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #uphlolqabb .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #uphlolqabb .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #uphlolqabb .gt_column_spanner_outer:first-child { padding-left: 0; } #uphlolqabb .gt_column_spanner_outer:last-child { padding-right: 0; } #uphlolqabb .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #uphlolqabb .gt_spanner_row { border-bottom-style: hidden; } #uphlolqabb .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #uphlolqabb .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #uphlolqabb .gt_from_md > :first-child { margin-top: 0; } #uphlolqabb .gt_from_md > :last-child { margin-bottom: 0; } #uphlolqabb .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #uphlolqabb .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #uphlolqabb .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #uphlolqabb .gt_row_group_first td { border-top-width: 2px; } #uphlolqabb .gt_row_group_first th { border-top-width: 2px; } #uphlolqabb .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #uphlolqabb .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #uphlolqabb .gt_first_summary_row.thick { border-top-width: 2px; } #uphlolqabb .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #uphlolqabb .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #uphlolqabb .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #uphlolqabb .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #uphlolqabb .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #uphlolqabb .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #uphlolqabb .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #uphlolqabb .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #uphlolqabb .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #uphlolqabb .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #uphlolqabb .gt_left { text-align: left; } #uphlolqabb .gt_center { text-align: center; } #uphlolqabb .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #uphlolqabb .gt_font_normal { font-weight: normal; } #uphlolqabb .gt_font_bold { font-weight: bold; } #uphlolqabb .gt_font_italic { font-style: italic; } #uphlolqabb .gt_super { font-size: 65%; } #uphlolqabb .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #uphlolqabb .gt_asterisk { font-size: 100%; vertical-align: 0; } #uphlolqabb .gt_indent_1 { text-indent: 5px; } #uphlolqabb .gt_indent_2 { text-indent: 10px; } #uphlolqabb .gt_indent_3 { text-indent: 15px; } #uphlolqabb .gt_indent_4 { text-indent: 20px; } #uphlolqabb .gt_indent_5 { text-indent: 25px; } TABLE 14.1: Percentage worried about the possibility that they or someone in their household will get sick from coronavirus in the next 3 months Percent SE Argentina 65.8 1.1 Bolivia 71.6 1.0 Brazil 83.5 1.0 Canada 48.9 1.3 Chile 81.8 0.8 Colombia 67.9 1.1 Costa Rica 72.6 1.0 Dominican Republic 50.1 1.1 Ecuador 71.7 1.0 El Salvador 52.5 1.0 Guatemala 69.3 1.0 Guyana 60.0 1.6 Haiti 54.4 1.8 Honduras 64.6 1.1 Jamaica 28.4 0.9 Mexico 63.6 1.0 Nicaragua 80.0 1.0 Panama 70.2 1.0 Paraguay 61.5 1.1 Peru 77.1 2.5 United States 46.6 1.7 Uruguay 60.9 1.1 AmericasBarometer Surveys, 2021 14.5.2 Example: Education affected by COVID Respondents were also asked a question about how the pandemic affected education. This question was asked to households with children under the age of 13, and respondents could select more than one option, as follows: Did any of these children have their school education affected due to the pandemic?   - No, because they are not yet school age or because they do not attend school for another reason   - No, their classes continued normally   - Yes, they went to virtual or remote classes   - Yes, they switched to a combination of virtual and in-person classes   - Yes, they cut all ties with the school Working with multiple-choice questions can be both challenging and interesting. Let’s walk through how to analyze this question. If we are interested in the impact on education, we should focus on the data of those whose children are attending school. This means we need to exclude those who selected the first response option: “No, because they are not yet school age or because they do not attend school for another reason.” To do this, we use the Educ_NotInSchool variable in the dataset, which has values of 0 and 1. A value of 1 indicates that the respondent chose the first response option (none of the children are in school), and a value of 0 means that at least one of their children is in school. By filtering the data to those with a value of 0 (they have at least one child in school), we can consider only respondents with at least one child attending school. Now, let’s review the data for those who selected one of the next three response options: No, their classes continued normally: Educ_NormalSchool Yes, they went to virtual or remote classes: Educ_VirtualSchool Yes, they switched to a combination of virtual and in-person classes: Educ_Hybrid The unweighted cross-tab for these responses is included below. It reveals a wide range of impacts, where many combinations of effects on education are possible. ambarom %>% filter(Educ_NotInSchool == 0) %>% count(Educ_NormalSchool, Educ_VirtualSchool, Educ_Hybrid) ## # A tibble: 8 × 4 ## Educ_NormalSchool Educ_VirtualSchool Educ_Hybrid n ## <dbl> <dbl> <dbl> <int> ## 1 0 0 0 861 ## 2 0 0 1 1192 ## 3 0 1 0 7554 ## 4 0 1 1 280 ## 5 1 0 0 833 ## 6 1 0 1 18 ## 7 1 1 0 72 ## 8 1 1 1 7 In reviewing the survey question, we might be interested in knowing the answers to the following: What percentage of households indicated that school continued as normal with no virtual or hybrid option? What percentage of households indicated that the education medium was changed to either virtual or hybrid? What percentage of households indicated that they cut ties with their school? To find the answers, we create indicators for the first two questions, make national estimates for all three questions, and then construct a summary table for easy viewing. First, we create and inspect the indicators and their distributions using survey_count(). ambarom_des_educ <- ambarom_des %>% filter(Educ_NotInSchool == 0) %>% mutate( Educ_OnlyNormal = (Educ_NormalSchool == 1 & Educ_VirtualSchool == 0 & Educ_Hybrid == 0), Educ_MediumChange = (Educ_VirtualSchool == 1 | Educ_Hybrid == 1) ) ambarom_des_educ %>% survey_count(Educ_OnlyNormal, Educ_NormalSchool, Educ_VirtualSchool, Educ_Hybrid) ## # A tibble: 8 × 6 ## Educ_OnlyNormal Educ_NormalSchool Educ_VirtualSchool Educ_Hybrid ## <lgl> <dbl> <dbl> <dbl> ## 1 FALSE 0 0 0 ## 2 FALSE 0 0 1 ## 3 FALSE 0 1 0 ## 4 FALSE 0 1 1 ## 5 FALSE 1 0 1 ## 6 FALSE 1 1 0 ## 7 FALSE 1 1 1 ## 8 TRUE 1 0 0 ## # ℹ 2 more variables: n <dbl>, n_se <dbl> ambarom_des_educ %>% survey_count(Educ_MediumChange, Educ_VirtualSchool, Educ_Hybrid) ## # A tibble: 4 × 5 ## Educ_MediumChange Educ_VirtualSchool Educ_Hybrid n n_se ## <lgl> <dbl> <dbl> <dbl> <dbl> ## 1 FALSE 0 0 880. 26.1 ## 2 TRUE 0 1 561. 19.2 ## 3 TRUE 1 0 3812. 49.4 ## 4 TRUE 1 1 136. 9.86 Next, we group the data by country and calculate the population estimates for our three questions. covid_educ_ests <- ambarom_des_educ %>% group_by(Country) %>% summarize( p_onlynormal = survey_mean(Educ_OnlyNormal, na.rm = TRUE) * 100, p_mediumchange = survey_mean(Educ_MediumChange, na.rm = TRUE) * 100, p_noschool = survey_mean(Educ_NoSchool, na.rm = TRUE) * 100, ) covid_educ_ests ## # A tibble: 16 × 7 ## Country p_onlynormal p_onlynormal_se p_mediumchange p_mediumchange_se ## <fct> <dbl> <dbl> <dbl> <dbl> ## 1 Argent… 5.39 1.14 87.1 1.72 ## 2 Brazil 4.28 1.17 81.5 2.33 ## 3 Chile 0.715 0.267 96.2 0.962 ## 4 Colomb… 2.84 0.727 90.3 1.40 ## 5 Domini… 3.75 0.793 87.4 1.45 ## 6 Ecuador 5.18 0.963 87.5 1.39 ## 7 El Sal… 2.92 0.680 85.8 1.53 ## 8 Guatem… 3.00 0.727 82.2 1.73 ## 9 Guyana 3.34 0.702 85.3 1.67 ## 10 Haiti 81.1 2.25 7.25 1.48 ## 11 Hondur… 3.68 0.882 80.7 1.72 ## 12 Jamaica 5.42 0.950 88.1 1.43 ## 13 Panama 7.20 1.18 89.4 1.42 ## 14 Paragu… 4.66 0.939 90.7 1.37 ## 15 Peru 2.04 0.604 91.8 1.20 ## 16 Uruguay 8.60 1.40 84.3 2.02 ## # ℹ 2 more variables: p_noschool <dbl>, p_noschool_se <dbl> Finally, to view the results for all countries, we can use the {gt} package to construct Table 14.2. covid_educ_ests_gt <- covid_educ_ests %>% gt(rowname_col = "Country") %>% cols_label( p_onlynormal = "%", p_onlynormal_se = "SE", p_mediumchange = "%", p_mediumchange_se = "SE", p_noschool = "%", p_noschool_se = "SE" ) %>% tab_spanner(label = "Normal school only", columns = c("p_onlynormal", "p_onlynormal_se")) %>% tab_spanner(label = "Medium change", columns = c("p_mediumchange", "p_mediumchange_se")) %>% tab_spanner(label = "Cut ties with school", columns = c("p_noschool", "p_noschool_se")) %>% fmt_number(decimals = 1) %>% tab_source_note("AmericasBarometer Surveys, 2021") covid_educ_ests_gt #ismfkpkdnv table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #ismfkpkdnv thead, #ismfkpkdnv tbody, #ismfkpkdnv tfoot, #ismfkpkdnv tr, #ismfkpkdnv td, #ismfkpkdnv th { border-style: none; } #ismfkpkdnv p { margin: 0; padding: 0; } #ismfkpkdnv .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #ismfkpkdnv .gt_caption { padding-top: 4px; padding-bottom: 4px; } #ismfkpkdnv .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #ismfkpkdnv .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #ismfkpkdnv .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #ismfkpkdnv .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #ismfkpkdnv .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #ismfkpkdnv .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #ismfkpkdnv .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #ismfkpkdnv .gt_column_spanner_outer:first-child { padding-left: 0; } #ismfkpkdnv .gt_column_spanner_outer:last-child { padding-right: 0; } #ismfkpkdnv .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #ismfkpkdnv .gt_spanner_row { border-bottom-style: hidden; } #ismfkpkdnv .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #ismfkpkdnv .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #ismfkpkdnv .gt_from_md > :first-child { margin-top: 0; } #ismfkpkdnv .gt_from_md > :last-child { margin-bottom: 0; } #ismfkpkdnv .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #ismfkpkdnv .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #ismfkpkdnv .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #ismfkpkdnv .gt_row_group_first td { border-top-width: 2px; } #ismfkpkdnv .gt_row_group_first th { border-top-width: 2px; } #ismfkpkdnv .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #ismfkpkdnv .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #ismfkpkdnv .gt_first_summary_row.thick { border-top-width: 2px; } #ismfkpkdnv .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #ismfkpkdnv .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #ismfkpkdnv .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #ismfkpkdnv .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #ismfkpkdnv .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #ismfkpkdnv .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #ismfkpkdnv .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #ismfkpkdnv .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #ismfkpkdnv .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #ismfkpkdnv .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #ismfkpkdnv .gt_left { text-align: left; } #ismfkpkdnv .gt_center { text-align: center; } #ismfkpkdnv .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #ismfkpkdnv .gt_font_normal { font-weight: normal; } #ismfkpkdnv .gt_font_bold { font-weight: bold; } #ismfkpkdnv .gt_font_italic { font-style: italic; } #ismfkpkdnv .gt_super { font-size: 65%; } #ismfkpkdnv .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #ismfkpkdnv .gt_asterisk { font-size: 100%; vertical-align: 0; } #ismfkpkdnv .gt_indent_1 { text-indent: 5px; } #ismfkpkdnv .gt_indent_2 { text-indent: 10px; } #ismfkpkdnv .gt_indent_3 { text-indent: 15px; } #ismfkpkdnv .gt_indent_4 { text-indent: 20px; } #ismfkpkdnv .gt_indent_5 { text-indent: 25px; } TABLE 14.2: Impact on education in households with children under the age of 13 who had children that would generally attend school Normal school only Medium change Cut ties with school % SE % SE % SE Argentina 5.4 1.1 87.1 1.7 9.9 1.6 Brazil 4.3 1.2 81.5 2.3 22.1 2.5 Chile 0.7 0.3 96.2 1.0 4.0 1.0 Colombia 2.8 0.7 90.3 1.4 7.5 1.3 Dominican Republic 3.8 0.8 87.4 1.5 10.5 1.4 Ecuador 5.2 1.0 87.5 1.4 7.9 1.1 El Salvador 2.9 0.7 85.8 1.5 11.8 1.4 Guatemala 3.0 0.7 82.2 1.7 17.7 1.8 Guyana 3.3 0.7 85.3 1.7 13.0 1.6 Haiti 81.1 2.3 7.2 1.5 11.7 1.8 Honduras 3.7 0.9 80.7 1.7 16.9 1.6 Jamaica 5.4 0.9 88.1 1.4 7.5 1.2 Panama 7.2 1.2 89.4 1.4 3.8 0.9 Paraguay 4.7 0.9 90.7 1.4 6.4 1.2 Peru 2.0 0.6 91.8 1.2 6.8 1.1 Uruguay 8.6 1.4 84.3 2.0 8.0 1.6 AmericasBarometer Surveys, 2021 In the countries that were asked this question, many households experienced a change in their child’s education medium. However, in Haiti, only 7.2% of households with children switched to virtual or hybrid learning. 14.6 Mapping survey data While the table effectively presents the data, a map could also be insightful. To generate maps of the countries, we can use the package {rnaturalearth} and subset North and South America with the ne_countries() function. The function returns an sf (simple features) object with many columns, but most importantly, soverignt (sovereignty), geounit (country or territory), and geometry (the shape). For an example of the difference between sovereignty and country/territory, the United States, Puerto Rico, and the US Virgin Islands are all separate units with the same sovereignty. A map without data is plotted in Figure 14.1. country_shape <- ne_countries( scale = "medium", returnclass = "sf", continent = c("North America", "South America") ) country_shape %>% ggplot() + geom_sf() FIGURE 14.1: Map of North and South America The map in Figure 14.1 appears very wide due to the Aleutian islands in Alaska extending into the Eastern Hemisphere. We can crop the shapefile to include only the Western Hemisphere, which removes some of the trailing islands of Alaska. country_shape_crop <- country_shape %>% st_crop(c(xmin = -180, xmax = 0, ymin = -90, ymax = 90)) Now that we have the necessary shape files, our next step is to match our survey data to the map. Countries can be named differently (e.g., “U.S”, “U.S.A”, “United States”). To make sure we can visualize our survey data on the map, we need to match the country names in both the survey data and the map data. To do this, we can use the anti_join() function to identify the countries in the survey data that aren’t in the map data. For example, as shown below, the United States is referred to as “United States” in the survey data but “United States of America” in the map data. Table 14.3 shows the countries in the survey data but not the map data and Table 14.4 shows the countries in the map data but not the survey data. survey_country_list <- ambarom %>% distinct(Country) survey_country_list_gt <- survey_country_list %>% anti_join(country_shape_crop, by = c("Country" = "geounit")) %>% gt() survey_country_list_gt #zpnruhcqur table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #zpnruhcqur thead, #zpnruhcqur tbody, #zpnruhcqur tfoot, #zpnruhcqur tr, #zpnruhcqur td, #zpnruhcqur th { border-style: none; } #zpnruhcqur p { margin: 0; padding: 0; } #zpnruhcqur .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #zpnruhcqur .gt_caption { padding-top: 4px; padding-bottom: 4px; } #zpnruhcqur .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #zpnruhcqur .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #zpnruhcqur .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #zpnruhcqur .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #zpnruhcqur .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #zpnruhcqur .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #zpnruhcqur .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #zpnruhcqur .gt_column_spanner_outer:first-child { padding-left: 0; } #zpnruhcqur .gt_column_spanner_outer:last-child { padding-right: 0; } #zpnruhcqur .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #zpnruhcqur .gt_spanner_row { border-bottom-style: hidden; } #zpnruhcqur .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #zpnruhcqur .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #zpnruhcqur .gt_from_md > :first-child { margin-top: 0; } #zpnruhcqur .gt_from_md > :last-child { margin-bottom: 0; } #zpnruhcqur .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #zpnruhcqur .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #zpnruhcqur .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #zpnruhcqur .gt_row_group_first td { border-top-width: 2px; } #zpnruhcqur .gt_row_group_first th { border-top-width: 2px; } #zpnruhcqur .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #zpnruhcqur .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #zpnruhcqur .gt_first_summary_row.thick { border-top-width: 2px; } #zpnruhcqur .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #zpnruhcqur .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #zpnruhcqur .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #zpnruhcqur .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #zpnruhcqur .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #zpnruhcqur .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #zpnruhcqur .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #zpnruhcqur .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #zpnruhcqur .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #zpnruhcqur .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #zpnruhcqur .gt_left { text-align: left; } #zpnruhcqur .gt_center { text-align: center; } #zpnruhcqur .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #zpnruhcqur .gt_font_normal { font-weight: normal; } #zpnruhcqur .gt_font_bold { font-weight: bold; } #zpnruhcqur .gt_font_italic { font-style: italic; } #zpnruhcqur .gt_super { font-size: 65%; } #zpnruhcqur .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #zpnruhcqur .gt_asterisk { font-size: 100%; vertical-align: 0; } #zpnruhcqur .gt_indent_1 { text-indent: 5px; } #zpnruhcqur .gt_indent_2 { text-indent: 10px; } #zpnruhcqur .gt_indent_3 { text-indent: 15px; } #zpnruhcqur .gt_indent_4 { text-indent: 20px; } #zpnruhcqur .gt_indent_5 { text-indent: 25px; } TABLE 14.3: Countries in the survey data but not the map data Country United States map_country_list_gt<-country_shape_crop %>% as_tibble() %>% select(geounit, sovereignt) %>% anti_join(survey_country_list, by = c("geounit" = "Country")) %>% arrange(geounit) %>% gt() map_country_list_gt #sgxskozkog table { font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji'; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; } #sgxskozkog thead, #sgxskozkog tbody, #sgxskozkog tfoot, #sgxskozkog tr, #sgxskozkog td, #sgxskozkog th { border-style: none; } #sgxskozkog p { margin: 0; padding: 0; } #sgxskozkog .gt_table { display: table; border-collapse: collapse; line-height: normal; margin-left: auto; margin-right: auto; color: #333333; font-size: 16px; font-weight: normal; font-style: normal; background-color: #FFFFFF; width: auto; border-top-style: solid; border-top-width: 2px; border-top-color: #A8A8A8; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #A8A8A8; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; } #sgxskozkog .gt_caption { padding-top: 4px; padding-bottom: 4px; } #sgxskozkog .gt_title { color: #333333; font-size: 125%; font-weight: initial; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; border-bottom-color: #FFFFFF; border-bottom-width: 0; } #sgxskozkog .gt_subtitle { color: #333333; font-size: 85%; font-weight: initial; padding-top: 3px; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; border-top-color: #FFFFFF; border-top-width: 0; } #sgxskozkog .gt_heading { background-color: #FFFFFF; text-align: center; border-bottom-color: #FFFFFF; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #sgxskozkog .gt_bottom_border { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #sgxskozkog .gt_col_headings { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; } #sgxskozkog .gt_col_heading { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; } #sgxskozkog .gt_column_spanner_outer { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: normal; text-transform: inherit; padding-top: 0; padding-bottom: 0; padding-left: 4px; padding-right: 4px; } #sgxskozkog .gt_column_spanner_outer:first-child { padding-left: 0; } #sgxskozkog .gt_column_spanner_outer:last-child { padding-right: 0; } #sgxskozkog .gt_column_spanner { border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 5px; overflow-x: hidden; display: inline-block; width: 100%; } #sgxskozkog .gt_spanner_row { border-bottom-style: hidden; } #sgxskozkog .gt_group_heading { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; text-align: left; } #sgxskozkog .gt_empty_group_heading { padding: 0.5px; color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; vertical-align: middle; } #sgxskozkog .gt_from_md > :first-child { margin-top: 0; } #sgxskozkog .gt_from_md > :last-child { margin-bottom: 0; } #sgxskozkog .gt_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; margin: 10px; border-top-style: solid; border-top-width: 1px; border-top-color: #D3D3D3; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: middle; overflow-x: hidden; } #sgxskozkog .gt_stub { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; } #sgxskozkog .gt_stub_row_group { color: #333333; background-color: #FFFFFF; font-size: 100%; font-weight: initial; text-transform: inherit; border-right-style: solid; border-right-width: 2px; border-right-color: #D3D3D3; padding-left: 5px; padding-right: 5px; vertical-align: top; } #sgxskozkog .gt_row_group_first td { border-top-width: 2px; } #sgxskozkog .gt_row_group_first th { border-top-width: 2px; } #sgxskozkog .gt_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #sgxskozkog .gt_first_summary_row { border-top-style: solid; border-top-color: #D3D3D3; } #sgxskozkog .gt_first_summary_row.thick { border-top-width: 2px; } #sgxskozkog .gt_last_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #sgxskozkog .gt_grand_summary_row { color: #333333; background-color: #FFFFFF; text-transform: inherit; padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; } #sgxskozkog .gt_first_grand_summary_row { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-top-style: double; border-top-width: 6px; border-top-color: #D3D3D3; } #sgxskozkog .gt_last_grand_summary_row_top { padding-top: 8px; padding-bottom: 8px; padding-left: 5px; padding-right: 5px; border-bottom-style: double; border-bottom-width: 6px; border-bottom-color: #D3D3D3; } #sgxskozkog .gt_striped { background-color: rgba(128, 128, 128, 0.05); } #sgxskozkog .gt_table_body { border-top-style: solid; border-top-width: 2px; border-top-color: #D3D3D3; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #D3D3D3; } #sgxskozkog .gt_footnotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #sgxskozkog .gt_footnote { margin: 0px; font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #sgxskozkog .gt_sourcenotes { color: #333333; background-color: #FFFFFF; border-bottom-style: none; border-bottom-width: 2px; border-bottom-color: #D3D3D3; border-left-style: none; border-left-width: 2px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 2px; border-right-color: #D3D3D3; } #sgxskozkog .gt_sourcenote { font-size: 90%; padding-top: 4px; padding-bottom: 4px; padding-left: 5px; padding-right: 5px; } #sgxskozkog .gt_left { text-align: left; } #sgxskozkog .gt_center { text-align: center; } #sgxskozkog .gt_right { text-align: right; font-variant-numeric: tabular-nums; } #sgxskozkog .gt_font_normal { font-weight: normal; } #sgxskozkog .gt_font_bold { font-weight: bold; } #sgxskozkog .gt_font_italic { font-style: italic; } #sgxskozkog .gt_super { font-size: 65%; } #sgxskozkog .gt_footnote_marks { font-size: 75%; vertical-align: 0.4em; position: initial; } #sgxskozkog .gt_asterisk { font-size: 100%; vertical-align: 0; } #sgxskozkog .gt_indent_1 { text-indent: 5px; } #sgxskozkog .gt_indent_2 { text-indent: 10px; } #sgxskozkog .gt_indent_3 { text-indent: 15px; } #sgxskozkog .gt_indent_4 { text-indent: 20px; } #sgxskozkog .gt_indent_5 { text-indent: 25px; } TABLE 14.4: Countries in the map data but not the survey data geounit sovereignt Anguilla United Kingdom Antigua and Barbuda Antigua and Barbuda Aruba Netherlands Barbados Barbados Belize Belize Bermuda United Kingdom British Virgin Islands United Kingdom Cayman Islands United Kingdom Cuba Cuba Curaçao Netherlands Dominica Dominica Falkland Islands United Kingdom Greenland Denmark Grenada Grenada Montserrat United Kingdom Puerto Rico United States of America Saint Barthelemy France Saint Kitts and Nevis Saint Kitts and Nevis Saint Lucia Saint Lucia Saint Martin France Saint Pierre and Miquelon France Saint Vincent and the Grenadines Saint Vincent and the Grenadines Sint Maarten Netherlands Suriname Suriname The Bahamas The Bahamas Trinidad and Tobago Trinidad and Tobago Turks and Caicos Islands United Kingdom United States Virgin Islands United States of America United States of America United States of America Venezuela Venezuela There are several ways to fix the mismatched names for a successful join. The simplest solution is to rename the data in the shape object before merging. Since only one country name in the survey data differs from the map data, we rename the map data accordingly. country_shape_upd <- country_shape_crop %>% mutate(geounit = if_else(geounit == "United States of America", "United States", geounit)) Now that the country names match, we can merge the survey and map data and then plot the data. We begin with the map file and merge it with the survey estimates generated in Section 14.5 (covid_worry_country_ests and covid_educ_ests). We use the tidyverse function of full_join(), which joins the rows in the map data and the survey estimates based on the columns geounit and Country. A full join keeps all the rows from both datasets, matching rows when possible. For any rows without matches, the function fills in an NA for the missing value. covid_sf <- country_shape_upd %>% full_join(covid_worry_country_ests, by = c("geounit" = "Country")) %>% full_join(covid_educ_ests, by = c("geounit" = "Country")) After the merge, we create two figures that display the population estimates for the percentage of people worried about COVID (Figure 14.2) and the percentage of households with at least one child participating in virtual or hybrid learning (Figure 14.3). ggplot() + geom_sf(data = covid_sf, aes(fill = p, geometry = geometry), color = "darkgray") + scale_fill_gradientn( guide = "colorbar", name = "Percent", labels = scales::comma, colors = c("#BFD7EA", "#087e8b", "#0B3954"), na.value = NA ) + geom_sf_pattern( data = filter(covid_sf, is.na(p)), pattern = "crosshatch", pattern_fill = "lightgray", pattern_color = "lightgray", fill = NA, color = "darkgray" ) + theme_minimal() FIGURE 14.2: Percent of households worried someone in their household will get COVID-19 in the next 3 months by country ggplot() + geom_sf( data = covid_sf, aes(fill = p_mediumchange, geometry = geometry), color = "darkgray" ) + scale_fill_gradientn( guide = "colorbar", name = "Percent", labels = scales::comma, colors = c("#BFD7EA", "#087e8b", "#0B3954"), na.value = NA ) + geom_sf_pattern( data = filter(covid_sf, is.na(p_mediumchange)), pattern = "crosshatch", pattern_fill = "lightgray", pattern_color = "lightgray", fill = NA, color = "darkgray" ) + theme_minimal() FIGURE 14.3: Percent of households who had at least one child participate in virtual or hybrid learning In Figure 14.3, we observe missing data (represented by the crosshatch pattern) for Canada, Mexico, and the United States. The questionnaires indicate that these three countries did not include the education question in the survey. To focus on countries with available data, we can remove North America from the map and show only Central and South America. We do this below by restricting the shape files to Latin America and the Caribbean, as depicted in Figure 14.4. covid_c_s <- covid_sf %>% filter(region_wb == "Latin America & Caribbean") ggplot() + geom_sf( data = covid_c_s, aes(fill = p_mediumchange, geometry = geometry), color = "darkgray" ) + scale_fill_gradientn( guide = "colorbar", name = "Percent", labels = scales::comma, colors = c("#BFD7EA", "#087e8b", "#0B3954"), na.value = NA ) + geom_sf_pattern( data = filter(covid_c_s, is.na(p_mediumchange)), pattern = "crosshatch", pattern_fill = "lightgray", pattern_color = "lightgray", fill = NA, color = "darkgray" ) + theme_minimal() FIGURE 14.4: Percent of households who had at least one child participate in virtual or hybrid learning, Central and South America In Figure 14.4, we can see that most countries with available data have similar percentages (reflected in their similar shades). However, Haiti stands out with a lighter shade, indicating a considerably lower percentage of households with at least one child participating in virtual or hybrid learning. 14.7 Exercises Calculate the percentage of households with broadband internet and those with any internet at home, including from a phone or tablet. Hint: if you come across countries with 0% internet usage, you may want to filter by something first. int_ests <- ambarom_des %>% filter(!is.na(Internet) | !is.na(BroadbandInternet)) %>% group_by(Country) %>% summarize( p_broadband = survey_mean(BroadbandInternet, na.rm = TRUE) * 100, p_internet = survey_mean(Internet, na.rm = TRUE) * 100 ) int_ests %>% print(n = 30) ## # A tibble: 20 × 5 ## Country p_broadband p_broadband_se p_internet p_internet_se ## <fct> <dbl> <dbl> <dbl> <dbl> ## 1 Argentina 62.3 1.13 86.2 0.871 ## 2 Bolivia 41.4 1.03 77.2 0.956 ## 3 Brazil 68.3 1.25 88.9 0.879 ## 4 Chile 63.1 1.06 93.5 0.550 ## 5 Colombia 45.7 1.15 68.7 1.09 ## 6 Costa Rica 49.6 1.07 84.4 0.798 ## 7 Dominican Republ… 37.1 1.04 73.7 1.05 ## 8 Ecuador 59.7 1.06 79.9 0.898 ## 9 El Salvador 30.2 0.906 63.9 0.985 ## 10 Guatemala 33.4 0.993 61.5 1.08 ## 11 Guyana 63.7 1.09 86.8 0.781 ## 12 Haiti 11.8 0.791 58.5 1.25 ## 13 Honduras 28.2 0.968 60.7 1.11 ## 14 Jamaica 64.2 0.986 91.5 0.602 ## 15 Mexico 44.9 1.05 70.9 1.05 ## 16 Nicaragua 39.1 1.12 76.3 1.09 ## 17 Panama 43.4 1.02 73.1 0.976 ## 18 Paraguay 33.3 0.971 72.9 1.01 ## 19 Peru 42.4 1.07 71.1 1.07 ## 20 Uruguay 62.7 1.08 90.6 0.699 Create a faceted map showing both broadband internet and any internet usage. internet_sf <- country_shape_upd %>% full_join(select(int_ests, p = p_internet, geounit = Country), by = "geounit") %>% mutate(Type = "Internet") broadband_sf <- country_shape_upd %>% full_join(select(int_ests, p = p_broadband, geounit = Country), by = "geounit") %>% mutate(Type = "Broadband") b_int_sf <- internet_sf %>% bind_rows(broadband_sf) %>% filter(region_wb == "Latin America & Caribbean") b_int_sf %>% ggplot(aes(fill = p), color="darkgray") + geom_sf() + facet_wrap( ~ Type) + scale_fill_gradientn( guide = "colorbar", name = "Percent", labels = scales::comma, colors = c("#BFD7EA", "#087E8B", "#0B3954"), na.value = NA ) + geom_sf_pattern( data = filter(b_int_sf, is.na(p)), pattern = "crosshatch", pattern_fill = "lightgray", pattern_color = "lightgray", fill = NA, color = "darkgray" ) + theme_minimal() FIGURE 14.5: Percent of broadband internet and any internet usage, Central and South America References LAPOP. 2021a. “AmericasBarometer 2021 - Canada: Technical Information.” Vanderbilt University; http://datasets.americasbarometer.org/database/files/ABCAN2021-Technical-Report-v1.0-FINAL-eng-110921.pdf. ———. 2021b. “AmericasBarometer 2021 - U.S.: Technical Information.” Vanderbilt University; http://datasets.americasbarometer.org/database/files/ABUSA2021-Technical-Report-v1.0-FINAL-eng-110921.pdf. ———. 2021c. “AmericasBarometer 2021: Technical Information.” Vanderbilt University; https://www.vanderbilt.edu/lapop/ab2021/AB2021-Technical-Report-v1.0-FINAL-eng-030722.pdf. ———. 2021d. “Core Questionnaire.” https://www.vanderbilt.edu/lapop/ab2021/AB2021-Core-Questionnaire-v17.5-Eng-210514-W-v2.pdf. ———. 2023a. “About the AmericasBarometer.” https://www.vanderbilt.edu/lapop/about-americasbarometer.php. ———. 2023b. “The AmericasBarometer by the LAPOP Lab.” www.vanderbilt.edu/lapop. See Table 2 in LAPOP (2021c) for dates by country↩︎ "],["anes-cb.html", "A ANES Derived Variable Codebook A.1 ADMIN A.2 WEIGHTS A.3 PRE-ELECTION SURVEY QUESTIONNAIRE A.4 POST-ELECTION SURVEY QUESTIONNAIRE", " A ANES Derived Variable Codebook The full codebook with the original variables is available at https://electionstudies.org/wp-content/uploads/2022/02/anes_timeseries_2020_userguidecodebook_20220210.pdf A.1 ADMIN V200001 Description: 2020 Case ID Variable class: numeric CaseID Description: 2020 Case ID Variable class: numeric V200002 Description: Mode of interview: pre-election interview Variable class: haven_labelled, vctrs_vctr, double V200002 Label n Unweighted Freq 1 Video 274 0.037 2 Telephone 115 0.015 3 Web 7064 0.948 Total 7453 1.000 InterviewMode Description: Mode of interview: pre-election interview Variable class: factor InterviewMode n Unweighted Freq Video 274 0.037 Telephone 115 0.015 Web 7064 0.948 Total 7453 1.000 A.2 WEIGHTS V200010b Description: Full sample post-election weight Variable class: numeric N Missing Minimum Median Maximum 0 0.0083 0.6863 6.651 Weight Description: Full sample post-election weight Variable class: numeric N Missing Minimum Median Maximum 0 0.0083 0.6863 6.651 V200010c Description: Full sample variance unit Variable class: numeric N Missing Minimum Median Maximum 0 1 2 3 VarUnit Description: Full sample variance unit Variable class: factor VarUnit n Unweighted Freq 1 3689 0.495 2 3750 0.503 3 14 0.002 Total 7453 1.000 V200010d Description: Full sample variance stratum Variable class: numeric N Missing Minimum Median Maximum 0 1 24 50 Stratum Description: Full sample variance stratum Variable class: factor Stratum n Unweighted Freq 1 167 0.022 2 148 0.020 3 158 0.021 4 151 0.020 5 147 0.020 6 172 0.023 7 163 0.022 8 159 0.021 9 160 0.021 10 159 0.021 11 137 0.018 12 179 0.024 13 148 0.020 14 160 0.021 15 159 0.021 16 148 0.020 17 158 0.021 18 156 0.021 19 154 0.021 20 144 0.019 21 170 0.023 22 146 0.020 23 165 0.022 24 147 0.020 25 169 0.023 26 165 0.022 27 172 0.023 28 133 0.018 29 157 0.021 30 167 0.022 31 154 0.021 32 143 0.019 33 143 0.019 34 124 0.017 35 138 0.019 36 130 0.017 37 136 0.018 38 145 0.019 39 140 0.019 40 125 0.017 41 158 0.021 42 146 0.020 43 130 0.017 44 126 0.017 45 126 0.017 46 135 0.018 47 133 0.018 48 140 0.019 49 133 0.018 50 130 0.017 Total 7453 1.000 A.3 PRE-ELECTION SURVEY QUESTIONNAIRE V201006 Description: PRE: How interested in following campaigns Question: Some people don’t pay much attention to political campaigns. How about you? Would you say that you have been very much interested, somewhat interested or not much interested in the political campaigns so far this year? Variable class: haven_labelled, vctrs_vctr, double V201006 Label n Unweighted Freq -9 -9. Refused 1 0.000 1 Very much interested 3940 0.529 2 Somewhat interested 2569 0.345 3 Not much interested 943 0.127 Total 7453 1.000 CampaignInterest Description: PRE: How interested in following campaigns Question: Some people don’t pay much attention to political campaigns. How about you? Would you say that you have been very much interested, somewhat interested or not much interested in the political campaigns so far this year? Variable class: factor CampaignInterest n Unweighted Freq Very much interested 3940 0.529 Somewhat interested 2569 0.345 Not much interested 943 0.127 NA 1 0.000 Total 7453 1.000 V201024 Description: PRE: In what manner did R vote Question: Which one of the following best describes how you voted? Variable class: haven_labelled, vctrs_vctr, double V201024 Label n Unweighted Freq -9 -9. Refused 1 0.000 -1 -1. Inapplicable 7078 0.950 1 Definitely voted in person at a polling place before election day 101 0.014 2 Definitely voted by mailing a ballot to elections officials before election day 242 0.032 3 Definitely voted in some other way 28 0.004 4 Not completely sure whether you voted or not 3 0.000 Total 7453 1.000 V201025x Description: PRE: SUMMARY: Registration and early vote status Variable class: haven_labelled, vctrs_vctr, double V201025x Label n Unweighted Freq -4 -4. Technical error 1 0.000 1 Not registered (or DK/RF), does not intend to register (or DK/RF intent) 339 0.045 2 Not registered (or DK/RF), intends to register 290 0.039 3 Registered but did not vote early (or DK/RF) 6452 0.866 4 Registered and voted early 371 0.050 Total 7453 1.000 V201029 Description: PRE: For whom did R vote for President Question: Who did you vote for? [Joe Biden, Donald Trump/Donald Trump, Joe Biden], Jo Jorgensen, Howie Hawkins, or someone else? Variable class: haven_labelled, vctrs_vctr, double V201029 Label n Unweighted Freq -9 -9. Refused 10 0.001 -1 -1. Inapplicable 7092 0.952 1 Joe Biden 239 0.032 2 Donald Trump 103 0.014 3 Jo Jorgensen 2 0.000 4 Howie Hawkins 1 0.000 5 Other candidate {SPECIFY} 4 0.001 12 Specified as refused 2 0.000 Total 7453 1.000 V201101 Description: PRE: Did R vote for President in 2016 [revised] Question: Four years ago, in 2016, Hillary Clinton ran on the Democratic ticket against Donald Trump for the Republicans. We talk to many people who tell us they did not vote. And we talk to a few people who tell us they did vote, who really did not. We can tell they did not vote by checking with official government records. What about you? If we check the official government voter records, will they show that you voted in the 2016 presidential election, or that you did not vote in that election? Variable class: haven_labelled, vctrs_vctr, double V201101 Label n Unweighted Freq -9 -9. Refused 13 0.002 -8 -8. Don’t know 1 0.000 -1 -1. Inapplicable 3780 0.507 1 Yes, voted 2780 0.373 2 No, didn’t vote 879 0.118 Total 7453 1.000 V201102 Description: PRE: Did R vote for President in 2016 Question: Four years ago, in 2016, Hillary Clinton ran on the Democratic ticket against Donald Trump for the Republicans. Do you remember for sure whether or not you voted in that election? Variable class: haven_labelled, vctrs_vctr, double V201102 Label n Unweighted Freq -9 -9. Refused 6 0.001 -8 -8. Don’t know 1 0.000 -1 -1. Inapplicable 3673 0.493 1 Yes, voted 3030 0.407 2 No, didn’t vote 743 0.100 Total 7453 1.000 VotedPres2016 Description: PRE: Did R vote for President in 2016 Question: Derived from V201102, V201101 Variable class: factor VotedPres2016 n Unweighted Freq Yes 5810 0.780 No 1622 0.218 NA 21 0.003 Total 7453 1.000 V201103 Description: PRE: Recall of last (2016) Presidential vote choice Question: Which one did you vote for? Variable class: haven_labelled, vctrs_vctr, double V201103 Label n Unweighted Freq -9 -9. Refused 41 0.006 -8 -8. Don’t know 2 0.000 -1 -1. Inapplicable 1643 0.220 1 Hillary Clinton 2911 0.391 2 Donald Trump 2466 0.331 5 Other {SPECIFY} 390 0.052 Total 7453 1.000 VotedPres2016_selection Description: PRE: Recall of last (2016) Presidential vote choice Question: Which one did you vote for? Variable class: factor VotedPres2016_selection n Unweighted Freq Clinton 2911 0.391 Trump 2466 0.331 Other 390 0.052 NA 1686 0.226 Total 7453 1.000 V201228 Description: PRE: Party ID: Does R think of self as Democrat, Republican, or Independent Question: Generally speaking, do you usually think of yourself as [a Democrat, a Republican / a Republican, a Democrat], an independent, or what? Variable class: haven_labelled, vctrs_vctr, double V201228 Label n Unweighted Freq -9 -9. Refused 37 0.005 -8 -8. Don’t know 4 0.001 -4 -4. Technical error 1 0.000 0 No preference {VOL - video/phone only} 6 0.001 1 Democrat 2589 0.347 2 Republican 2304 0.309 3 Independent 2277 0.306 5 Other party {SPECIFY} 235 0.032 Total 7453 1.000 V201229 Description: PRE: Party Identification strong - Democrat Republican Question: Would you call yourself a strong [Democrat / Republican] or a not very strong [Democrat / Republican]? Variable class: haven_labelled, vctrs_vctr, double V201229 Label n Unweighted Freq -9 -9. Refused 4 0.001 -1 -1. Inapplicable 2560 0.343 1 Strong 3341 0.448 2 Not very strong 1548 0.208 Total 7453 1.000 V201230 Description: PRE: No Party Identification - closer to Democratic Party or Republican Party Question: Do you think of yourself as closer to the Republican Party or to the Democratic Party? Variable class: haven_labelled, vctrs_vctr, double V201230 Label n Unweighted Freq -9 -9. Refused 19 0.003 -8 -8. Don’t know 2 0.000 -1 -1. Inapplicable 4893 0.657 1 Closer to Republican 782 0.105 2 Neither {VOL in video and phone} 876 0.118 3 Closer to Democratic 881 0.118 Total 7453 1.000 V201231x Description: PRE: SUMMARY: Party ID Question: Derived from V201228, V201229, and PTYID_LEANPTY Variable class: haven_labelled, vctrs_vctr, double V201231x Label n Unweighted Freq -9 -9. Refused 23 0.003 -8 -8. Don’t know 2 0.000 1 Strong Democrat 1796 0.241 2 Not very strong Democrat 790 0.106 3 Independent-Democrat 881 0.118 4 Independent 876 0.118 5 Independent-Republican 782 0.105 6 Not very strong Republican 758 0.102 7 Strong Republican 1545 0.207 Total 7453 1.000 PartyID Description: PRE: SUMMARY: Party ID Question: Derived from V201228, V201229, and PTYID_LEANPTY Variable class: factor PartyID n Unweighted Freq Strong democrat 1796 0.241 Not very strong democrat 790 0.106 Independent-democrat 881 0.118 Independent 876 0.118 Independent-republican 782 0.105 Not very strong republican 758 0.102 Strong republican 1545 0.207 NA 25 0.003 Total 7453 1.000 V201233 Description: PRE: How often trust government in Washington to do what is right [revised] Question: How often can you trust the federal government in Washington to do what is right? Variable class: haven_labelled, vctrs_vctr, double V201233 Label n Unweighted Freq -9 -9. Refused 26 0.003 -8 -8. Don’t know 3 0.000 1 Always 80 0.011 2 Most of the time 1016 0.136 3 About half the time 2313 0.310 4 Some of the time 3313 0.445 5 Never 702 0.094 Total 7453 1.000 TrustGovernment Description: PRE: How often trust government in Washington to do what is right [revised] Question: How often can you trust the federal government in Washington to do what is right? Variable class: factor TrustGovernment n Unweighted Freq Always 80 0.011 Most of the time 1016 0.136 About half the time 2313 0.310 Some of the time 3313 0.445 Never 702 0.094 NA 29 0.004 Total 7453 1.000 V201237 Description: PRE: How often can people be trusted Question: Generally speaking, how often can you trust other people? Variable class: haven_labelled, vctrs_vctr, double V201237 Label n Unweighted Freq -9 -9. Refused 12 0.002 -8 -8. Don’t know 1 0.000 1 Always 48 0.006 2 Most of the time 3511 0.471 3 About half the time 2020 0.271 4 Some of the time 1597 0.214 5 Never 264 0.035 Total 7453 1.000 TrustPeople Description: PRE: How often can people be trusted Question: Generally speaking, how often can you trust other people? Variable class: factor TrustPeople n Unweighted Freq Always 48 0.006 Most of the time 3511 0.471 About half the time 2020 0.271 Some of the time 1597 0.214 Never 264 0.035 NA 13 0.002 Total 7453 1.000 V201507x Description: PRE: SUMMARY: Respondent age Question: Derived from birth month, day and year Variable class: haven_labelled, vctrs_vctr, double N Missing N Refused (-9) Minimum Median Maximum 0 294 18 53 80 Age Description: PRE: SUMMARY: Respondent age Question: Derived from birth month, day and year Variable class: numeric N Missing Minimum Median Maximum 294 18 53 80 AgeGroup Description: PRE: SUMMARY: Respondent age Question: Derived from birth month, day and year Variable class: factor AgeGroup n Unweighted Freq 18-29 871 0.117 30-39 1241 0.167 40-49 1081 0.145 50-59 1200 0.161 60-69 1436 0.193 70 or older 1330 0.178 NA 294 0.039 Total 7453 1.000 V201510 Description: PRE: Highest level of Education Question: What is the highest level of school you have completed or the highest degree you have received? Variable class: haven_labelled, vctrs_vctr, double V201510 Label n Unweighted Freq -9 -9. Refused 25 0.003 -8 -8. Don’t know 1 0.000 1 Less than high school credential 312 0.042 2 High school graduate - High school diploma or equivalent (e.g. GED) 1160 0.156 3 Some college but no degree 1519 0.204 4 Associate degree in college - occupational/vocational 550 0.074 5 Associate degree in college - academic 445 0.060 6 Bachelor’s degree (e.g. BA, AB, BS) 1877 0.252 7 Master’s degree (e.g. MA, MS, MEng, MEd, MSW, MBA) 1092 0.147 8 Professional school degree (e.g. MD, DDS, DVM, LLB, JD)/Doctoral degree (e.g. PHD, EDD) 382 0.051 95 Other {SPECIFY} 90 0.012 Total 7453 1.000 Education Description: PRE: Highest level of Education Question: What is the highest level of school you have completed or the highest degree you have received? Variable class: factor Education n Unweighted Freq Less than HS 312 0.042 High school 1160 0.156 Post HS 2514 0.337 Bachelor’s 1877 0.252 Graduate 1474 0.198 NA 116 0.016 Total 7453 1.000 V201546 Description: PRE: R: Are you Spanish, Hispanic, or Latino Question: Are you of Hispanic, Latino, or Spanish origin? Variable class: haven_labelled, vctrs_vctr, double V201546 Label n Unweighted Freq -9 -9. Refused 45 0.006 -8 -8. Don’t know 3 0.000 1 Yes 662 0.089 2 No 6743 0.905 Total 7453 1.000 V201547a Description: RESTRICTED: PRE: Race of R: White [mention] Question: I am going to read you a list of five race categories. You may choose one or more races. For this survey, Hispanic origin is not a race. Are you White? Variable class: haven_labelled, vctrs_vctr, double V201547a Label n Unweighted Freq -3 -3. Restricted 7453 1 Total 7453 1 V201547b Description: RESTRICTED: PRE: Race of R: Black or African-American [mention] Question: I am going to read you a list of five race categories. You may choose one or more races. For this survey, Hispanic origin is not a race. Are you Black or African American? Variable class: haven_labelled, vctrs_vctr, double V201547b Label n Unweighted Freq -3 -3. Restricted 7453 1 Total 7453 1 V201547c Description: RESTRICTED: PRE: Race of R: Asian [mention] Question: I am going to read you a list of five race categories. You may choose one or more races. For this survey, Hispanic origin is not a race. Are you Asian? Variable class: haven_labelled, vctrs_vctr, double V201547c Label n Unweighted Freq -3 -3. Restricted 7453 1 Total 7453 1 V201547d Description: RESTRICTED: PRE: Race of R: Native Hawaiian or Pacific Islander [mention] Question: I am going to read you a list of five race categories. You may choose one or more races. For this survey, Hispanic origin is not a race. Are you White; Black or African American; American Indian or Alaska Native; Asian; or Native Hawaiian or Other Pacific Islander? Variable class: haven_labelled, vctrs_vctr, double V201547d Label n Unweighted Freq -3 -3. Restricted 7453 1 Total 7453 1 V201547e Description: RESTRICTED: PRE: Race of R: Native American or Alaska Native [mention] Question: I am going to read you a list of five race categories. You may choose one or more races. For this survey, Hispanic origin is not a race. Are you American Indian or Alaska Native? Variable class: haven_labelled, vctrs_vctr, double V201547e Label n Unweighted Freq -3 -3. Restricted 7453 1 Total 7453 1 V201547z Description: RESTRICTED: PRE: Race of R: other specify Question: I am going to read you a list of five race categories. You may choose one or more races. For this survey, Hispanic origin is not a race. Reported other Variable class: haven_labelled, vctrs_vctr, double V201547z Label n Unweighted Freq -3 -3. Restricted 7453 1 Total 7453 1 V201549x Description: PRE: SUMMARY: R self-identified race/ethnicity Question: Derived from V201546, V201547a-V201547e, and V201547z Variable class: haven_labelled, vctrs_vctr, double V201549x Label n Unweighted Freq -9 -9. Refused 75 0.010 -8 -8. Don’t know 6 0.001 1 White, non-Hispanic 5420 0.727 2 Black, non-Hispanic 650 0.087 3 Hispanic 662 0.089 4 Asian or Native Hawaiian/other Pacific Islander, non-Hispanic alone 248 0.033 5 Native American/Alaska Native or other race, non-Hispanic alone 155 0.021 6 Multiple races, non-Hispanic 237 0.032 Total 7453 1.000 RaceEth Description: PRE: SUMMARY: R self-identified race/ethnicity Question: Derived from V201546, V201547a-V201547e, and V201547z Variable class: factor RaceEth n Unweighted Freq White 5420 0.727 Black 650 0.087 Hispanic 662 0.089 Asian, NH/PI 248 0.033 AI/AN 155 0.021 Other/multiple race 237 0.032 NA 81 0.011 Total 7453 1.000 V201600 Description: PRE: What is your (R) sex? [revised] Question: What is your sex? Variable class: haven_labelled, vctrs_vctr, double V201600 Label n Unweighted Freq -9 -9. Refused 51 0.007 1 Male 3375 0.453 2 Female 4027 0.540 Total 7453 1.000 Gender Description: PRE: What is your (R) sex? [revised] Question: What is your sex? Variable class: factor Gender n Unweighted Freq Male 3375 0.453 Female 4027 0.540 NA 51 0.007 Total 7453 1.000 V201607 Description: RESTRICTED: PRE: Total income amount - revised Question: The next question is about [the total combined income of all members of your family / your total income] during the past 12 months. This includes money from jobs, net income from business, farm or rent, pensions, dividends, interest, Social Security payments, and any other money income received by members of your family who are 15 years of age or older. What was the total income of your family during the past 12 months? TYPE THE NUMBER. YOUR BEST GUESS IS FINE. Variable class: haven_labelled, vctrs_vctr, double V201607 Label n Unweighted Freq -3 -3. Restricted 7453 1 Total 7453 1 V201610 Description: RESTRICTED: PRE: Income amt missing - categories lt 20K Question: Please choose the answer that includes the income of all members of your family during the past 12 months before taxes. Variable class: haven_labelled, vctrs_vctr, double V201610 Label n Unweighted Freq -3 -3. Restricted 7453 1 Total 7453 1 V201611 Description: RESTRICTED: PRE: Income amt missing - categories 20-40K Question: Please choose the answer that includes the income of all members of your family during the past 12 months before taxes. Variable class: haven_labelled, vctrs_vctr, double V201611 Label n Unweighted Freq -3 -3. Restricted 7453 1 Total 7453 1 V201613 Description: RESTRICTED: PRE: Income amt missing - categories 40-70K Question: Please choose the answer that includes the income of all members of your family during the past 12 months before taxes. Variable class: haven_labelled, vctrs_vctr, double V201613 Label n Unweighted Freq -3 -3. Restricted 7453 1 Total 7453 1 V201615 Description: RESTRICTED: PRE: Income amt missing - categories 70-100K Question: Please choose the answer that includes the income of all members of your family during the past 12 months before taxes. Variable class: haven_labelled, vctrs_vctr, double V201615 Label n Unweighted Freq -3 -3. Restricted 7453 1 Total 7453 1 V201616 Description: RESTRICTED: PRE: Income amt missing - categories 100+K Question: Please choose the answer that includes the income of all members of your family during the past 12 months before taxes. Variable class: haven_labelled, vctrs_vctr, double V201616 Label n Unweighted Freq -3 -3. Restricted 7453 1 Total 7453 1 V201617x Description: PRE: SUMMARY: Total (family) income Question: Derived from V201607, V201610, V201611, V201613, V201615, V201616 Variable class: haven_labelled, vctrs_vctr, double V201617x Label n Unweighted Freq -9 -9. Refused 502 0.067 -5 -5. Interview breakoff (sufficient partial IW) 15 0.002 1 Under $9,999 647 0.087 2 $10,000-14,999 244 0.033 3 $15,000-19,999 185 0.025 4 $20,000-24,999 301 0.040 5 $25,000-29,999 228 0.031 6 $30,000-34,999 296 0.040 7 $35,000-39,999 226 0.030 8 $40,000-44,999 286 0.038 9 $45,000-49,999 213 0.029 10 $50,000-59,999 485 0.065 11 $60,000-64,999 294 0.039 12 $65,000-69,999 168 0.023 13 $70,000-74,999 243 0.033 14 $75,000-79,999 215 0.029 15 $80,000-89,999 383 0.051 16 $90,000-99,999 291 0.039 17 $100,000-109,999 451 0.061 18 $110,000-124,999 312 0.042 19 $125,000-149,999 323 0.043 20 $150,000-174,999 366 0.049 21 $175,000-249,999 374 0.050 22 $250,000 or more 405 0.054 Total 7453 1.000 Income Description: PRE: SUMMARY: Total (family) income Question: Derived from V201607, V201610, V201611, V201613, V201615, V201616 Variable class: factor Income n Unweighted Freq Under $9,999 647 0.087 $10,000-14,999 244 0.033 $15,000-19,999 185 0.025 $20,000-24,999 301 0.040 $25,000-29,999 228 0.031 $30,000-34,999 296 0.040 $35,000-39,999 226 0.030 $40,000-44,999 286 0.038 $45,000-49,999 213 0.029 $50,000-59,999 485 0.065 $60,000-64,999 294 0.039 $65,000-69,999 168 0.023 $70,000-74,999 243 0.033 $75,000-79,999 215 0.029 $80,000-89,999 383 0.051 $90,000-99,999 291 0.039 $100,000-109,999 451 0.061 $110,000-124,999 312 0.042 $125,000-149,999 323 0.043 $150,000-174,999 366 0.049 $175,000-249,999 374 0.050 $250,000 or more 405 0.054 NA 517 0.069 Total 7453 1.000 Income7 Description: PRE: SUMMARY: Total (family) income Question: Derived from V201607, V201610, V201611, V201613, V201615, V201616 Variable class: factor Income7 n Unweighted Freq Under $20k 1076 0.144 $20-40k 1051 0.141 $40-60k 984 0.132 $60-80k 920 0.123 $80-100k 674 0.090 $100-125k 763 0.102 $125k or more 1468 0.197 NA 517 0.069 Total 7453 1.000 A.4 POST-ELECTION SURVEY QUESTIONNAIRE V202051 Description: POST: R registered to vote (post-election) Question: Now on a different topic. Are you registered to vote at [Respondent’s preloaded address], registered at a different address, or not currently registered? Variable class: haven_labelled, vctrs_vctr, double V202051 Label n Unweighted Freq -9 -9. Refused 4 0.001 -6 -6. No post-election interview 4 0.001 -1 -1. Inapplicable 6820 0.915 1 Registered at this address 173 0.023 2 Registered at a different address 59 0.008 3 Not currently registered 393 0.053 Total 7453 1.000 V202066 Description: POST: Did R vote in November 2020 election Question: In talking to people about elections, we often find that a lot of people were not able to vote because they weren’t registered, they were sick, or they just didn’t have time. Which of the following statements best describes you: Variable class: haven_labelled, vctrs_vctr, double V202066 Label n Unweighted Freq -9 -9. Refused 7 0.001 -6 -6. No post-election interview 4 0.001 -1 -1. Inapplicable 372 0.050 1 I did not vote (in the election this November) 582 0.078 2 I thought about voting this time, but didn’t 265 0.036 3 I usually vote, but didn’t this time 192 0.026 4 I am sure I voted 6031 0.809 Total 7453 1.000 V202072 Description: POST: Did R vote for President Question: How about the election for President? Did you vote for a candidate for President? Variable class: haven_labelled, vctrs_vctr, double V202072 Label n Unweighted Freq -9 -9. Refused 2 0.000 -6 -6. No post-election interview 4 0.001 -1 -1. Inapplicable 1418 0.190 1 Yes, voted for President 5952 0.799 2 No, didn’t vote for President 77 0.010 Total 7453 1.000 VotedPres2020 Description: POST: Did R vote for President Question: How about the election for President? Did you vote for a candidate for President? Variable class: factor VotedPres2020 n Unweighted Freq Yes 5952 0.799 No 77 0.010 NA 1424 0.191 Total 7453 1.000 V202073 Description: POST: For whom did R vote for President Question: Who did you vote for? [Joe Biden, Donald Trump/Donald Trump, Joe Biden], Jo Jorgensen, Howie Hawkins, or someone else? Variable class: haven_labelled, vctrs_vctr, double V202073 Label n Unweighted Freq -9 -9. Refused 53 0.007 -6 -6. No post-election interview 4 0.001 -1 -1. Inapplicable 1497 0.201 1 Joe Biden 3267 0.438 2 Donald Trump 2462 0.330 3 Jo Jorgensen 69 0.009 4 Howie Hawkins 23 0.003 5 Other candidate {SPECIFY} 56 0.008 7 Specified as Republican candidate 1 0.000 8 Specified as Libertarian candidate 3 0.000 11 Specified as don’t know 2 0.000 12 Specified as refused 16 0.002 Total 7453 1.000 V202109x Description: PRE-POST: SUMMARY: Voter turnout in 2020 Question: Derived from V201024, V202066, V202051 Variable class: haven_labelled, vctrs_vctr, double V202109x Label n Unweighted Freq -2 -2. Not reported 7 0.001 0 Did not vote 1039 0.139 1 Voted 6407 0.860 Total 7453 1.000 V202110x Description: PRE-POST: SUMMARY: 2020 Presidential vote Question: Derived from V201029, V202073 Variable class: haven_labelled, vctrs_vctr, double V202110x Label n Unweighted Freq -9 -9. Refused 81 0.011 -8 -8. Don’t know 2 0.000 -1 -1. Inapplicable 1136 0.152 1 Joe Biden 3509 0.471 2 Donald Trump 2567 0.344 3 Jo Jorgensen 74 0.010 4 Howie Hawkins 24 0.003 5 Other candidate {SPECIFY} 60 0.008 Total 7453 1.000 VotedPres2020_selection Description: PRE-POST: SUMMARY: 2020 Presidential vote Question: Derived from V201029, V202073 Variable class: factor VotedPres2020_selection n Unweighted Freq Biden 3509 0.471 Trump 2567 0.344 Other 158 0.021 NA 1219 0.164 Total 7453 1.000 EarlyVote2020 Description: PRE-POST: Voted early for president Question: Derived from V201025x, VotedPres2020 Variable class: factor EarlyVote2020 n Unweighted Freq Yes 371 0.050 No 5949 0.798 NA 1133 0.152 Total 7453 1.000 "],["recs-cb.html", "B RECS Derived Variable Codebook B.1 ADMIN B.2 GEOGRAPHY B.3 WEATHER B.4 YOUR HOME B.5 SPACE HEATING B.6 AIR CONDITIONING B.7 THERMOSTAT B.8 WEIGHTS B.9 CONSUMPTION AND EXPENDITURE", " B RECS Derived Variable Codebook The full codebook with the original variables is available at https://www.eia.gov/consumption/residential/data/2020/index.php?view=microdata - “Variable and response codebook”. This codebook includes the variables on the dataset included for download along with this book. B.1 ADMIN DOEID Description: Unique identifier for each respondent ClimateRegion_BA Description: Building America Climate Zone ClimateRegion_BA n Unweighted Freq Mixed-Dry 142 0.008 Mixed-Humid 5579 0.302 Hot-Humid 2545 0.138 Hot-Dry 1577 0.085 Very-Cold 572 0.031 Cold 7116 0.385 Marine 911 0.049 Subarctic 54 0.003 Total 18496 1.000 Urbanicity Description: 2010 Census Urban Type Code Urbanicity n Unweighted Freq Urban Area 12395 0.670 Urban Cluster 2020 0.109 Rural 4081 0.221 Total 18496 1.000 B.2 GEOGRAPHY Region Description: Census Region Region n Unweighted Freq Northeast 3657 0.198 Midwest 3832 0.207 South 6426 0.347 West 4581 0.248 Total 18496 1.000 REGIONC Description: Census Region REGIONC n Unweighted Freq MIDWEST 3832 0.207 NORTHEAST 3657 0.198 SOUTH 6426 0.347 WEST 4581 0.248 Total 18496 1.000 Division Description: Census Division, Mountain Division is divided into North and South for RECS purposes Division n Unweighted Freq New England 1680 0.091 Middle Atlantic 1977 0.107 East North Central 2014 0.109 West North Central 1818 0.098 South Atlantic 3256 0.176 East South Central 1343 0.073 West South Central 1827 0.099 Mountain North 1180 0.064 Mountain South 904 0.049 Pacific 2497 0.135 Total 18496 1.000 STATE_FIPS Description: State Federal Information Processing System Code STATE_FIPS n Unweighted Freq 01 242 0.013 02 311 0.017 04 495 0.027 05 268 0.014 06 1152 0.062 08 360 0.019 09 294 0.016 10 143 0.008 11 221 0.012 12 655 0.035 13 417 0.023 15 282 0.015 16 270 0.015 17 530 0.029 18 400 0.022 19 286 0.015 20 208 0.011 21 428 0.023 22 311 0.017 23 223 0.012 24 359 0.019 25 552 0.030 26 388 0.021 27 325 0.018 28 168 0.009 29 296 0.016 30 172 0.009 31 189 0.010 32 231 0.012 33 175 0.009 34 456 0.025 35 178 0.010 36 904 0.049 37 479 0.026 38 331 0.018 39 339 0.018 40 232 0.013 41 313 0.017 42 617 0.033 44 191 0.010 45 334 0.018 46 183 0.010 47 505 0.027 48 1016 0.055 49 188 0.010 50 245 0.013 51 451 0.024 53 439 0.024 54 197 0.011 55 357 0.019 56 190 0.010 Total 18496 1.000 state_postal Description: State Postal Code state_postal n Unweighted Freq AL 242 0.013 AK 311 0.017 AZ 495 0.027 AR 268 0.014 CA 1152 0.062 CO 360 0.019 CT 294 0.016 DE 143 0.008 DC 221 0.012 FL 655 0.035 GA 417 0.023 HI 282 0.015 ID 270 0.015 IL 530 0.029 IN 400 0.022 IA 286 0.015 KS 208 0.011 KY 428 0.023 LA 311 0.017 ME 223 0.012 MD 359 0.019 MA 552 0.030 MI 388 0.021 MN 325 0.018 MS 168 0.009 MO 296 0.016 MT 172 0.009 NE 189 0.010 NV 231 0.012 NH 175 0.009 NJ 456 0.025 NM 178 0.010 NY 904 0.049 NC 479 0.026 ND 331 0.018 OH 339 0.018 OK 232 0.013 OR 313 0.017 PA 617 0.033 RI 191 0.010 SC 334 0.018 SD 183 0.010 TN 505 0.027 TX 1016 0.055 UT 188 0.010 VT 245 0.013 VA 451 0.024 WA 439 0.024 WV 197 0.011 WI 357 0.019 WY 190 0.010 Total 18496 1.000 state_name Description: State Name state_name n Unweighted Freq Alabama 242 0.013 Alaska 311 0.017 Arizona 495 0.027 Arkansas 268 0.014 California 1152 0.062 Colorado 360 0.019 Connecticut 294 0.016 Delaware 143 0.008 District of Columbia 221 0.012 Florida 655 0.035 Georgia 417 0.023 Hawaii 282 0.015 Idaho 270 0.015 Illinois 530 0.029 Indiana 400 0.022 Iowa 286 0.015 Kansas 208 0.011 Kentucky 428 0.023 Louisiana 311 0.017 Maine 223 0.012 Maryland 359 0.019 Massachusetts 552 0.030 Michigan 388 0.021 Minnesota 325 0.018 Mississippi 168 0.009 Missouri 296 0.016 Montana 172 0.009 Nebraska 189 0.010 Nevada 231 0.012 New Hampshire 175 0.009 New Jersey 456 0.025 New Mexico 178 0.010 New York 904 0.049 North Carolina 479 0.026 North Dakota 331 0.018 Ohio 339 0.018 Oklahoma 232 0.013 Oregon 313 0.017 Pennsylvania 617 0.033 Rhode Island 191 0.010 South Carolina 334 0.018 South Dakota 183 0.010 Tennessee 505 0.027 Texas 1016 0.055 Utah 188 0.010 Vermont 245 0.013 Virginia 451 0.024 Washington 439 0.024 West Virginia 197 0.011 Wisconsin 357 0.019 Wyoming 190 0.010 Total 18496 1.000 B.3 WEATHER HDD65 Description: Heating degree days in 2020, base temperature 65F; Derived from the weighted temperatures of nearby weather stations N Missing Minimum Median Maximum 0 0 4396 17383 CDD65 Description: Cooling degree days in 2020, base temperature 65F; Derived from the weighted temperatures of nearby weather stations N Missing Minimum Median Maximum 0 0 1179 5534 HDD30YR Description: Heating degree days, 30-year average 1981-2010, base temperature 65F; Taken from nearest weather station, inoculated with random errors N Missing Minimum Median Maximum 0 0 4825 16071 CDD30YR Description: Cooling degree days, 30-year average 1981-2010, base temperature 65F; Taken from nearest weather station, inoculated with random errors N Missing Minimum Median Maximum 0 0 1020 4905 B.4 YOUR HOME HousingUnitType Description: Type of housing unit Question: Which best describes your home? HousingUnitType n Unweighted Freq Mobile home 974 0.053 Single-family detached 12319 0.666 Single-family attached 1751 0.095 Apartment: 2-4 Units 1013 0.055 Apartment: 5 or more units 2439 0.132 Total 18496 1.000 YearMade Description: Range when housing unit was built Question: Derived from: In what year was your home built? AND Although you do not know the exact year your home was built, it is helpful to have an estimate. About when was your home built? YearMade n Unweighted Freq Before 1950 2721 0.147 1950-1959 1685 0.091 1960-1969 1867 0.101 1970-1979 2817 0.152 1980-1989 2435 0.132 1990-1999 2451 0.133 2000-2009 2748 0.149 2010-2015 989 0.053 2016-2020 783 0.042 Total 18496 1.000 TOTSQFT_EN Description: Total energy-consuming area (square footage) of the housing unit. Includes all main living areas; all basements; heated, cooled, or finished attics; and heating or cooled garages. For single-family housing units this is derived using the respondent-reported square footage (SQFTEST) and adjusted using the “include” variables (e.g., SQFTINCB), where applicable. For apartments and mobile homes this is the respondent-reported square footage. A derived variable rounded to the nearest 10 N Missing Minimum Median Maximum 0 200 1700 15000 TOTHSQFT Description: Square footage of the housing unit that is heated by space heating equipment. A derived variable rounded to the nearest 10 N Missing Minimum Median Maximum 0 0 1520 15000 TOTCSQFT Description: Square footage of the housing unit that is cooled by air-conditioning equipment or evaporative cooler, a derived variable rounded to the nearest 10 N Missing Minimum Median Maximum 0 0 1200 14600 ZTOTSQFT_EN Description: Imputation indicator for SQFTEST ZTOTSQFT_EN n Unweighted Freq Not imputed 11930 0.645 Imputed 6566 0.355 Total 18496 1.000 ZYearMade Description: Imputation indicator for YEARMADERANGE ZYearMade n Unweighted Freq Not imputed 18176 0.983 Imputed 320 0.017 Total 18496 1.000 ZHousingUnitType Description: Imputation indicator for TYPEHUQ ZHousingUnitType n Unweighted Freq Not imputed 18496 1 Total 18496 1 B.5 SPACE HEATING SpaceHeatingUsed Description: Space heating equipment used Question: Is your home heated during the winter? SpaceHeatingUsed n Unweighted Freq FALSE 751 0.041 TRUE 17745 0.959 Total 18496 1.000 ZSpaceHeatingUsed Description: Imputation indicator for HEATHOME ZSpaceHeatingUsed n Unweighted Freq Not imputed 18474 0.999 Imputed 22 0.001 Total 18496 1.000 B.6 AIR CONDITIONING ACUsed Description: Air conditioning equipment used Question: Is any air conditioning equipment used in your home? ACUsed n Unweighted Freq FALSE 2325 0.126 TRUE 16171 0.874 Total 18496 1.000 ZACUsed Description: Imputation indicator for AIRCOND ZACUsed n Unweighted Freq Not imputed 18448 0.997 Imputed 48 0.003 Total 18496 1.000 ZACBehavior Description: Imputation indicator for COOLCNTL ZACBehavior n Unweighted Freq Not imputed 15819 0.855 Imputed 352 0.019 Not applicable 2325 0.126 Total 18496 1.000 B.7 THERMOSTAT HeatingBehavior Description: Winter temperature control method Question: Which of the following best describes how your household controls the indoor temperature during the winter? HeatingBehavior n Unweighted Freq Set one temp and leave it 7806 0.422 Manually adjust at night/no one home 4654 0.252 Programmable or smart thermostat automatically adjusts the temperature 3310 0.179 Turn on or off as needed 1491 0.081 No control 438 0.024 Other 46 0.002 NA 751 0.041 Total 18496 1.000 WinterTempDay Description: Winter thermostat setting or temperature in home when someone is home during the day Question: During the winter, what is your home’s typical indoor temperature when someone is home during the day? N Missing Minimum Median Maximum 751 50 70 90 WinterTempAway Description: Winter thermostat setting or temperature in home when no one is home during the day Question: During the winter, what is your home’s typical indoor temperature when no one is inside your home during the day? N Missing Minimum Median Maximum 751 50 68 90 WinterTempNight Description: Winter thermostat setting or temperature in home at night Question: During the winter, what is your home’s typical indoor temperature inside your home at night? N Missing Minimum Median Maximum 751 50 68 90 ACBehavior Description: Summer temperature control method Question: Which of the following best describes how your household controls the indoor temperature during the summer? ACBehavior n Unweighted Freq Set one temp and leave it 6738 0.364 Manually adjust at night/no one home 3637 0.197 Programmable or smart thermostat automatically adjusts the temperature 2638 0.143 Turn on or off as needed 2746 0.148 No control 409 0.022 Other 3 0.000 NA 2325 0.126 Total 18496 1.000 SummerTempDay Description: Summer thermostat setting or temperature in home when someone is home during the day Question: During the summer, what is your home’s typical indoor temperature when someone is home during the day? N Missing Minimum Median Maximum 2325 50 72 90 SummerTempAway Description: Summer thermostat setting or temperature in home when no one is home during the day Question: During the summer, what is your home’s typical indoor temperature when no one is inside your home during the day? N Missing Minimum Median Maximum 2325 50 74 90 SummerTempNight Description: Summer thermostat setting or temperature in home at night Question: During the summer, what is your home’s typical indoor temperature inside your home at night? N Missing Minimum Median Maximum 2325 50 72 90 ZHeatingBehavior Description: Imputation indicator for HEATCNTL ZHeatingBehavior n Unweighted Freq Not imputed 17395 0.940 Imputed 350 0.019 Not applicable 751 0.041 Total 18496 1.000 ZWinterTempAway Description: Imputation indicator for TEMPGONE ZWinterTempAway n Unweighted Freq Not imputed 16840 0.910 Imputed 905 0.049 Not applicable 751 0.041 Total 18496 1.000 ZSummerTempAway Description: Imputation indicator for TEMPGONEAC ZSummerTempAway n Unweighted Freq Not imputed 15240 0.824 Imputed 931 0.050 Not applicable 2325 0.126 Total 18496 1.000 ZWinterTempDay Description: Imputation indicator for TEMPHOME ZWinterTempDay n Unweighted Freq Not imputed 17382 0.940 Imputed 363 0.020 Not applicable 751 0.041 Total 18496 1.000 ZSummerTempDay Description: Imputation indicator for TEMPHOMEAC ZSummerTempDay n Unweighted Freq Not imputed 15658 0.847 Imputed 513 0.028 Not applicable 2325 0.126 Total 18496 1.000 ZWinterTempNight Description: Imputation indicator for TEMPNITE ZWinterTempNight n Unweighted Freq Not imputed 17207 0.930 Imputed 538 0.029 Not applicable 751 0.041 Total 18496 1.000 ZSummerTempNight Description: Imputation indicator for TEMPNITEAC ZSummerTempNight n Unweighted Freq Not imputed 15497 0.838 Imputed 674 0.036 Not applicable 2325 0.126 Total 18496 1.000 B.8 WEIGHTS NWEIGHT Description: Final Analysis Weight N Missing Minimum Median Maximum 0 437.9 6119 29279 NWEIGHT1 Description: Final Analysis Weight for replicate 1 N Missing Minimum Median Maximum 0 0 6136 30015 NWEIGHT2 Description: Final Analysis Weight for replicate 2 N Missing Minimum Median Maximum 0 0 6151 29422 NWEIGHT3 Description: Final Analysis Weight for replicate 3 N Missing Minimum Median Maximum 0 0 6151 29431 NWEIGHT4 Description: Final Analysis Weight for replicate 4 N Missing Minimum Median Maximum 0 0 6153 29494 NWEIGHT5 Description: Final Analysis Weight for replicate 5 N Missing Minimum Median Maximum 0 0 6134 30039 NWEIGHT6 Description: Final Analysis Weight for replicate 6 N Missing Minimum Median Maximum 0 0 6147 29419 NWEIGHT7 Description: Final Analysis Weight for replicate 7 N Missing Minimum Median Maximum 0 0 6135 29586 NWEIGHT8 Description: Final Analysis Weight for replicate 8 N Missing Minimum Median Maximum 0 0 6151 29499 NWEIGHT9 Description: Final Analysis Weight for replicate 9 N Missing Minimum Median Maximum 0 0 6139 29845 NWEIGHT10 Description: Final Analysis Weight for replicate 10 N Missing Minimum Median Maximum 0 0 6163 29635 NWEIGHT11 Description: Final Analysis Weight for replicate 11 N Missing Minimum Median Maximum 0 0 6140 29681 NWEIGHT12 Description: Final Analysis Weight for replicate 12 N Missing Minimum Median Maximum 0 0 6160 29849 NWEIGHT13 Description: Final Analysis Weight for replicate 13 N Missing Minimum Median Maximum 0 0 6142 29843 NWEIGHT14 Description: Final Analysis Weight for replicate 14 N Missing Minimum Median Maximum 0 0 6154 30184 NWEIGHT15 Description: Final Analysis Weight for replicate 15 N Missing Minimum Median Maximum 0 0 6145 29970 NWEIGHT16 Description: Final Analysis Weight for replicate 16 N Missing Minimum Median Maximum 0 0 6133 29825 NWEIGHT17 Description: Final Analysis Weight for replicate 17 N Missing Minimum Median Maximum 0 0 6126 30606 NWEIGHT18 Description: Final Analysis Weight for replicate 18 N Missing Minimum Median Maximum 0 0 6155 29689 NWEIGHT19 Description: Final Analysis Weight for replicate 19 N Missing Minimum Median Maximum 0 0 6153 29336 NWEIGHT20 Description: Final Analysis Weight for replicate 20 N Missing Minimum Median Maximum 0 0 6139 30274 NWEIGHT21 Description: Final Analysis Weight for replicate 21 N Missing Minimum Median Maximum 0 0 6135 29766 NWEIGHT22 Description: Final Analysis Weight for replicate 22 N Missing Minimum Median Maximum 0 0 6149 29791 NWEIGHT23 Description: Final Analysis Weight for replicate 23 N Missing Minimum Median Maximum 0 0 6148 30126 NWEIGHT24 Description: Final Analysis Weight for replicate 24 N Missing Minimum Median Maximum 0 0 6136 29946 NWEIGHT25 Description: Final Analysis Weight for replicate 25 N Missing Minimum Median Maximum 0 0 6150 30445 NWEIGHT26 Description: Final Analysis Weight for replicate 26 N Missing Minimum Median Maximum 0 0 6136 29893 NWEIGHT27 Description: Final Analysis Weight for replicate 27 N Missing Minimum Median Maximum 0 0 6125 30030 NWEIGHT28 Description: Final Analysis Weight for replicate 28 N Missing Minimum Median Maximum 0 0 6149 29599 NWEIGHT29 Description: Final Analysis Weight for replicate 29 N Missing Minimum Median Maximum 0 0 6146 30136 NWEIGHT30 Description: Final Analysis Weight for replicate 30 N Missing Minimum Median Maximum 0 0 6149 29895 NWEIGHT31 Description: Final Analysis Weight for replicate 31 N Missing Minimum Median Maximum 0 0 6144 29604 NWEIGHT32 Description: Final Analysis Weight for replicate 32 N Missing Minimum Median Maximum 0 0 6159 29310 NWEIGHT33 Description: Final Analysis Weight for replicate 33 N Missing Minimum Median Maximum 0 0 6148 29408 NWEIGHT34 Description: Final Analysis Weight for replicate 34 N Missing Minimum Median Maximum 0 0 6139 29564 NWEIGHT35 Description: Final Analysis Weight for replicate 35 N Missing Minimum Median Maximum 0 0 6141 30437 NWEIGHT36 Description: Final Analysis Weight for replicate 36 N Missing Minimum Median Maximum 0 0 6149 27896 NWEIGHT37 Description: Final Analysis Weight for replicate 37 N Missing Minimum Median Maximum 0 0 6133 30596 NWEIGHT38 Description: Final Analysis Weight for replicate 38 N Missing Minimum Median Maximum 0 0 6139 30130 NWEIGHT39 Description: Final Analysis Weight for replicate 39 N Missing Minimum Median Maximum 0 0 6147 29262 NWEIGHT40 Description: Final Analysis Weight for replicate 40 N Missing Minimum Median Maximum 0 0 6144 30344 NWEIGHT41 Description: Final Analysis Weight for replicate 41 N Missing Minimum Median Maximum 0 0 6153 29594 NWEIGHT42 Description: Final Analysis Weight for replicate 42 N Missing Minimum Median Maximum 0 0 6137 29938 NWEIGHT43 Description: Final Analysis Weight for replicate 43 N Missing Minimum Median Maximum 0 0 6157 29878 NWEIGHT44 Description: Final Analysis Weight for replicate 44 N Missing Minimum Median Maximum 0 0 6148 29896 NWEIGHT45 Description: Final Analysis Weight for replicate 45 N Missing Minimum Median Maximum 0 0 6149 29729 NWEIGHT46 Description: Final Analysis Weight for replicate 46 N Missing Minimum Median Maximum 0 0 6152 29103 NWEIGHT47 Description: Final Analysis Weight for replicate 47 N Missing Minimum Median Maximum 0 0 6150 30070 NWEIGHT48 Description: Final Analysis Weight for replicate 48 N Missing Minimum Median Maximum 0 0 6139 29343 NWEIGHT49 Description: Final Analysis Weight for replicate 49 N Missing Minimum Median Maximum 0 0 6146 29590 NWEIGHT50 Description: Final Analysis Weight for replicate 50 N Missing Minimum Median Maximum 0 0 6159 30027 NWEIGHT51 Description: Final Analysis Weight for replicate 51 N Missing Minimum Median Maximum 0 0 6150 29247 NWEIGHT52 Description: Final Analysis Weight for replicate 52 N Missing Minimum Median Maximum 0 0 6154 29445 NWEIGHT53 Description: Final Analysis Weight for replicate 53 N Missing Minimum Median Maximum 0 0 6156 30131 NWEIGHT54 Description: Final Analysis Weight for replicate 54 N Missing Minimum Median Maximum 0 0 6151 29439 NWEIGHT55 Description: Final Analysis Weight for replicate 55 N Missing Minimum Median Maximum 0 0 6143 29216 NWEIGHT56 Description: Final Analysis Weight for replicate 56 N Missing Minimum Median Maximum 0 0 6153 29203 NWEIGHT57 Description: Final Analysis Weight for replicate 57 N Missing Minimum Median Maximum 0 0 6138 29819 NWEIGHT58 Description: Final Analysis Weight for replicate 58 N Missing Minimum Median Maximum 0 0 6137 29818 NWEIGHT59 Description: Final Analysis Weight for replicate 59 N Missing Minimum Median Maximum 0 0 6144 29606 NWEIGHT60 Description: Final Analysis Weight for replicate 60 N Missing Minimum Median Maximum 0 0 6140 29818 B.9 CONSUMPTION AND EXPENDITURE BTUEL Description: Total electricity use, in thousand Btu, 2020, including self-generation of solar power N Missing Minimum Median Maximum 0 143.3 31890 628155 DOLLAREL Description: Total electricity cost, in dollars, 2020 N Missing Minimum Median Maximum 0 -889.5 1258 15680 ZBTUEL Description: Imputation flag for total electricity use ZBTUEL n Unweighted Freq Not imputed 15965 0.863 Imputed amount and cost 2138 0.116 Imputed only amount for SOLAR=1 cases 393 0.021 Total 18496 1.000 BTUNG Description: Total natural gas use, in thousand Btu, 2020 N Missing Minimum Median Maximum 0 0 22012 1134709 DOLLARNG Description: Total natural gas cost, in dollars, 2020 N Missing Minimum Median Maximum 0 0 313.9 8155 ZBTUNG Description: Imputation flag for total natural gas use ZBTUNG n Unweighted Freq Not imputed 8823 0.477 Imputed 2331 0.126 Not applicable 7342 0.397 Total 18496 1.000 BTULP Description: Total propane use, in thousand Btu, 2020 N Missing Minimum Median Maximum 0 0 0 364215 DOLLARLP Description: Total propane cost, in dollars, 2020 N Missing Minimum Median Maximum 0 0 0 6621 ZBTULP Description: Imputation flag for total propane use ZBTULP n Unweighted Freq Not imputed 896 0.048 Imputed 1103 0.060 Not applicable 16497 0.892 Total 18496 1.000 BTUFO Description: Total fuel oil/kerosene use, in thousand Btu, 2020 N Missing Minimum Median Maximum 0 0 0 426268 DOLLARFO Description: Total fuel oil/kerosene cost, in dollars, 2020 N Missing Minimum Median Maximum 0 0 0 7004 ZBTUFO Description: Imputation flag for total fuel oil/kerosene use ZBTUFO n Unweighted Freq Not imputed 626 0.034 Imputed 607 0.033 Not applicable 17263 0.933 Total 18496 1.000 BTUWOOD Description: Total wood use, in thousand Btu, 2020 N Missing Minimum Median Maximum 0 0 0 5e+05 ZBTUWOOD Description: Imputation flag for total wood use ZBTUWOOD n Unweighted Freq Not imputed 1730 0.094 Imputed 244 0.013 Not applicable 16522 0.893 Total 18496 1.000 TOTALBTU Description: Total usage including electricity, natural gas, propane, and fuel oil, in thousand Btu, 2020 N Missing Minimum Median Maximum 0 1182 74180 1367548 TOTALDOL Description: Total cost including electricity, natural gas, propane, and fuel oil, in dollars, 2020 N Missing Minimum Median Maximum 0 -150.5 1793 20043 "],["importing-survey-data-into-r.html", "C Importing survey data into R C.1 Importing delimiter-separated files into R C.2 Loading Excel files into R C.3 Importing Stata, SAS, and SPSS files into R C.4 Importing data from APIs into R C.5 Accessing databases in R C.6 Importing data from other formats", " C Importing survey data into R To analyze a survey, we need to import the survey data into R. This process is often referred to as importing, loading, or reading in data. Survey files come in different formats depending on the software used to create them. One of the many advantages of R is the flexibility in handling various data formats, regardless of their file extensions. Here are examples of common public-use survey file formats we may encounter: Delimiter-separated text files Excel spreadsheets in .xls or .xlsx format R native .rda files Stata datasets in .dta format SAS datasets in .sas format SPSS datasets in .sav format Application Programming Interfaces (APIs), often in JSON format Data stored in databases This appendix guides analysts through the process of importing these various types of survey data into R. C.1 Importing delimiter-separated files into R Delimiter-separated files use specific characters, known as delimiters, to separate values within the file. For example, CSV (Comma-Separated Values) files use commas as delimiters, while TSV (Tab-Separated Values) files use tabs. These file formats are widely used because of their simplicity and compatibility with various software applications. The {readr} package, part of the tidyverse ecosystem, offers efficient ways to import delimiter-separated files into R. It provides several advantages, including automatic data type detection and flexible handling of missing values, depending on one’s survey research needs. The {readr} package includes functions for: read_csv(): This function is specifically designed to read CSV files. read_tsv(): Use this function for Tab-Separated Values (TSV) files. read_delim(): This function can handle a broader range of delimiter-separated files, including CSV and TSV. Specify the delimiter using the delim argument. read_fwf(): This function is useful for importing Fixed-Width Files, where columns have predetermined widths, and values are aligned in specific positions. read_table(): Use this function when dealing with whitespace-separated files, such as those with spaces or multiple spaces as delimiters. read_log(): This function can read and parse web log files. The syntax for read_csv() is: read_csv( file, col_names = TRUE, col_types = NULL, col_select = NULL, id = NULL, locale = default_locale(), na = c("", "NA"), comment = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000, n_max), name_repair = "unique", num_threads = readr_threads(), progress = show_progress(), show_col_types = should_show_types(), skip_empty_rows = TRUE, lazy = should_read_lazy() ) The arguments are: file: the path to the Excel file to import col_names: a value of TRUE will import the first row of the file as column names and not included in the data frame. A value of FALSE will create automated column names. Alternatively, we can provide a vector of column names. col_types: by default, R will infer the column variable types. We can also provide a column specification using list() or cols(); for example, use col_types = cols(.default = \"c\") to read all the columns as characters. Alternatively, we can use a string to specify the variable types for each column. col_select: the columns to include in the results id: a column for storing the file path. This is useful for keeping track of the input file when importing multiple CSVs at a time. locale: the location-specific defaults for the file na: a character vector of values to interpret as missing comment: a character vector of values to interpret as comments trim_ws: a value of TRUE will trim leading and trailing white space skip: number of lines to skip before importing the data n_max: maximum number of lines to read guess_max: maximum number of lines use for guessing column types name_repair: whether to check column names. By default, the column names are unique. num_threads: the number of processing threads to use for initial parsing and lazy reading of data progress: a value of TRUE displays a progress bar show_col_types: a value of TRUE displays the column types skip_empty_rows: a value of TRUE will ignore blank rows lazy: a value of TRUE will read values lazily The other functions share a similar syntax to read_csv(). To find more details, run ?? followed by the function name. For example, run ??read_delim in the Console for additional information. In the example below, we use {readr} to load a CSV file named ‘anes_timeseries_2020_csv_20220210.csv’ into an R object called anes_csv. The read_csv() imports the file and stores the data in the anes_csv object. We can then use this object for further analysis. library(readr) anes_csv <- read_csv("data/anes_timeseries_2020_csv_20220210.csv") C.2 Loading Excel files into R Excel, a widely used spreadsheet software program created by Microsoft, is a common file format in survey research. We can load Excel spreadsheets into the R environment using the {readxl} package. The package supports both the legacy .xls files and the modern .xlsx format. To load Excel data into R, we can use the read_excel() function from the {readxl} package. This function offers a range of customizable options for the import process. Let’s explore the syntax: read_excel( path, sheet = NULL, range = NULL, col_names = TRUE, col_types = NULL, na = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000, n_max), progress = readxl_progress(), .name_repair = "unique" ) The arguments are: path: the path to the Excel file to import sheet: the name or index of the sheet (sometimes called tabs) within the Excel file range: the range of cells to import (for example, “P15:T87”) col_names: indicates whether the first row of the dataset contains column names col_types: specify the data types of columns na: define the representation of missing values (for example, NULL) trim_ws: controls whether leading and trailing whitespaces should be trimmed skip and n_max: enable skipping rows and limit the number of rows imported guess_max: sets the maximum number of rows used for data type guessing progress: specifies a progress bar for large imports .name_repair: determines how column names are repaired if they are not valid In the code example below, we load an Excel spreadsheet named ‘anes_timeseries_2020_csv_20220210.xlsx’ into R. The resulting data is saved as a tibble in the anes_excel object, ready for further analysis. library(readxl) anes_excel <- read_excel(path = "data/anes_timeseries_2020_csv_20220210.xlsx") C.3 Importing Stata, SAS, and SPSS files into R The {haven} package, also from the tidyverse ecosystem, imports various proprietary data formats: Stata .dta files, SPSS .sav files, and SAS .sas7bdat and .sas7bcat files. One of the notable strengths of the {haven} package is its ability to handle multiple proprietary formats within a unified framework. It offers dedicated functions for each supported proprietary format, making it straightforward to import data regardless of the program. Here, we introduce read_dat() for Stata files, read_sav() for SPSS files, and read_sas() for SAS files. C.3.1 Syntax Let’s explore the syntax for importing Stata files .dat files using haven::read_dat(): read_dta( file, encoding = NULL, col_select = NULL, skip = 0, n_max = Inf, .name_repair = "unique" ) The arguments are: file: the path to the proprietary data file to import encoding: specifies the character encoding of the data file col_select: select specific columns for import skip and n_max: control the number of rows skipped and the maximum number of rows imported .name_repair: determines how column names are repaired if they are not valid The syntax for read_sav() is similar to read_dat(): read_sav( file, encoding = NULL, user_na = FALSE, col_select = NULL, skip = 0, n_max = Inf, .name_repair = "unique" ) The arguments are: file: the path to the proprietary data file to import encoding: specifies the character encoding of the data file col_select: select specific columns for import user_na: a value of TRUE will read variables with user defined missing labels will be read into labelled_spss() objects skip and n_max: control the number of rows skipped and the maximum number of rows imported .name_repair: determines how column names are repaired if they are not valid The syntax for importing SAS files with read_sas() is as follows: read_sas( data_file, catalog_file = NULL, encoding = NULL, catalog_encoding = encoding, col_select = NULL, skip = 0L, n_max = Inf, .name_repair = "unique" ) The arguments are: data_file: the path to the proprietary data file to import catalog_file: the path to the catalog file to import encoding: specifies the character encoding of the data file catalog_encoding: specifies the character encoding of the catalog file col_select: select specific columns for import skip and n_max: control the number of rows skipped and the maximum number of rows imported .name_repair: determines how column names are repaired if they are not valid In the code examples below, we demonstrate how to load Stata, SPSS, and SAS files into R using the respective {haven} functions. The resulting data is stored in anes_dta, anes_sav, and anes_sas objects as tibbles, ready for use in R. Stata: library(haven) anes_dta <- read_dta(system.file("extdata", "anes_2020_stata_example.dta", package="srvyrexploR")) SPSS: library(haven) anes_sav <- read_sav(file = "data/anes_timeseries_2020_spss_20220210.sav") SAS: library(haven) anes_sas <- read_sas(file = "data/anes_timeseries_2020_sas_20220210.sas7bdat") C.3.2 Working with labeled data Stata, SPSS, and SAS files often contain labeled variables and values. These labels provide descriptive information about categorical data, making it easier to understand and analyze. When importing data from Stata, SPSS, or SAS, preserving these labels is essential for maintaining data fidelity. Consider a variable like ‘Education Level’ with coded values (e.g., 1, 2, 3). Without labels, these codes can be cryptic. However, with labels (‘High School Graduate,’ ‘Bachelor’s Degree,’ ‘Master’s Degree’), the data becomes more informative and easier to work with. With the {haven} package, we have the capability to import and work with labeled data from Stata, SPSS, and SAS files. The package uses a special class of data called haven_labelled to store labeled variables. When a dataset label is defined in Stata, it is stored in the ‘label’ attribute of the tibble when imported, ensuring that the information is not lost. We can use functions like select(), glimpse(), and is.labelled() to inspect the imported data and verify if variables are labeled. Take a look at the ANES Stata file. Notice that categorical variables are marked with a type of <dbl+lbl>. This notation indicates that these variables are labeled. library(dplyr) anes_dta %>% select(1:6) %>% glimpse() ## Rows: 7,453 ## Columns: 6 ## $ V200001 <dbl> 200015, 200022, 200039, 200046, 200053, 200060, 20008… ## $ V200002 <dbl+lbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3… ## $ V200010b <dbl> 1.0057, 1.1635, 0.7687, 0.5210, 0.9658, 0.2347, 0.440… ## $ V200010d <dbl> 9, 26, 41, 29, 23, 37, 7, 37, 32, 41, 22, 7, 38, 21, … ## $ V200010c <dbl> 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1,… ## $ V201006 <dbl+lbl> 2, 3, 2, 3, 2, 1, 2, 3, 2, 2, 2, 2, 2, 1, 2, 1, 1… We can confirm this label status using the haven::is.labelled() function. haven::is.labelled(anes_dta$V200002) ## [1] TRUE To explore the labels further, we can use the attributes() function. This function provides insights into both the variable labels ($label) and the associated value labels ($labels). attributes(anes_dta$V200002) ## $label ## [1] "Mode of interview: pre-election interview" ## ## $format.stata ## [1] "%10.0g" ## ## $class ## [1] "haven_labelled" "vctrs_vctr" "double" ## ## $labels ## 1. Video 2. Telephone 3. Web ## 1 2 3 When we import a labeled dataset using {haven}, it results in a tibble containing both the data and label information. However, this is meant to be an intermediary data structure and not intended to be the final data format for analysis. Instead, we should convert it into a regular R data frame before continuing our data workflow. There are two primary methods to achieve this conversion: (1) convert to factors or (2) remove the labels. Option 1: Convert the vector into a factor Factors are native R data types for working with categorical data. They consist of integer values that correspond to character values, known as levels. Below is a dummy example of factors. Printing factors shows the four different levels in the data: strongly agree, agree, disagree, and strongly disagree. response <- c("strongly agree", "agree", "agree", "disagree") response_levels <- c("strongly agree", "agree", "disagree", "strongly disagree") factors <- factor(response, levels = response_levels) factors ## [1] strongly agree agree agree disagree ## Levels: strongly agree agree disagree strongly disagree Factors are integer vectors, though they may look like character strings. We can confirm by looking at the vector’s structure: glimpse(factors) ## Factor w/ 4 levels "strongly agree",..: 1 2 2 3 R’s factors differ from Stata, SPSS, or SAS’ labeled vectors. However, we can convert labeled variables into factors using the as_factor() function. anes_dta %>% transmute(V200002 = as_factor(V200002)) ## # A tibble: 7,453 × 1 ## V200002 ## <fct> ## 1 3. Web ## 2 3. Web ## 3 3. Web ## 4 3. Web ## 5 3. Web ## 6 3. Web ## 7 3. Web ## 8 3. Web ## 9 3. Web ## 10 3. Web ## # ℹ 7,443 more rows The as_factor() function can be applied to all columns in a data frame or individual ones. Below, we convert all <dbl+lbl> columns into factors. anes_dta_factor <- anes_dta %>% as_factor() anes_dta_factor %>% select(1:6) %>% glimpse() ## Rows: 7,453 ## Columns: 6 ## $ V200001 <dbl> 200015, 200022, 200039, 200046, 200053, 200060, 20008… ## $ V200002 <fct> 3. Web, 3. Web, 3. Web, 3. Web, 3. Web, 3. Web, 3. We… ## $ V200010b <dbl> 1.0057, 1.1635, 0.7687, 0.5210, 0.9658, 0.2347, 0.440… ## $ V200010d <dbl> 9, 26, 41, 29, 23, 37, 7, 37, 32, 41, 22, 7, 38, 21, … ## $ V200010c <dbl> 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1,… ## $ V201006 <fct> 2. Somewhat interested, 3. Not much interested, 2. So… Option 2: Strip the labels The second option is to remove the labels altogether, converting the labeled data into a regular R data frame. To remove, or ‘zap’ the labels from our tibble, we can use the {haven} package’s zap_label() and zap_labels() functions. This approach removes the labels but retains the data values in their original form. The ANES Stata file columns contains variable labels. Using purrr’s map(), we can review the labels using attr. In the example below, we list the first two variables and their labels. For instance, the label for V200002 is “Mode of interview: pre-election interview”. purrr::map(anes_dta, ~attr(.x, "label")) %>% head(2) ## $V200001 ## [1] "2020 Case ID" ## ## $V200002 ## [1] "Mode of interview: pre-election interview" Use zap_label() to remove the variable labels but retain the value labels. Notice that the labels return as NULL. zap_label(anes_dta) %>% purrr::map(~attr(.x, "label")) %>% head(2) ## $V200001 ## NULL ## ## $V200002 ## 1. Video 2. Telephone 3. Web ## 1 2 3 To remove the value labels, use zap_labels(). Notice the previous <dbl+lbl> columns are now <dbl>. zap_labels(anes_dta) %>% select(1:6) %>% glimpse() ## Rows: 7,453 ## Columns: 6 ## $ V200001 <dbl> 200015, 200022, 200039, 200046, 200053, 200060, 20008… ## $ V200002 <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,… ## $ V200010b <dbl> 1.0057, 1.1635, 0.7687, 0.5210, 0.9658, 0.2347, 0.440… ## $ V200010d <dbl> 9, 26, 41, 29, 23, 37, 7, 37, 32, 41, 22, 7, 38, 21, … ## $ V200010c <dbl> 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1,… ## $ V201006 <dbl> 2, 3, 2, 3, 2, 1, 2, 3, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1,… While it is important to convert labeled datasets into regular R data frames for working in R, the labels themselves often contain valuable information that provide context and meaning to the survey variables. To aid with interpretability and documention, consider creating a data dictionary from the labeled dataset. A data dictionary is a reference document that provides detailed information about the variables and values of a survey. The {labelled} package offers a convenient function, generate_dictionary(), that creates data dictionaries directly from a labeled dataset. This function extracts variable labels, value labels, and other metadata and organizes them into a structured document that we can browse and reference throughout our analysis. Let’s create a data dictionary from the ANES Stata dataset as an example: library(labelled) dictionary <- generate_dictionary(anes_dta) Once we’ve generated the data dictionary, we can take a look at the V200002 variable and see the label, column type, number of missing entries, and associated values. dictionary %>% filter(variable == "V200002") ## pos variable label col_type missing ## 2 V200002 Mode of interview: pre-electi~ dbl+lbl 0 ## ## ## values ## [1] 1. Video ## [2] 2. Telephone ## [3] 3. Web C.3.3 Labeled missing data values In survey data analysis, dealing with missing values is a crucial aspect of data preparation. Stata, SPSS, and SAS files each have their own methods for handling missing values. Stata has “extended” missing values, .A through .Z. SAS has “special” missing values, .A through .Z and ._. SPSS has per-column “user” missing values. Each column can declare up to three distinct values or a range of values (plus one distinct value) that should be treated as missing. SAS and Stata use a concept known as ‘tagged’ missing values, which extend R’s regular NA. A ‘tagged’ missing value is essentially an NA with an additional single-character label. These values behave identically to regular NA in standard R operations while preserving the informative tag associated with the missing value. Here is an example from the NORC at the University of Chicago’s 2018 General Society Survey. head(gss_dta$HEALTH) #> <labelled<double>[6]>: condition of health #> [1] 2 1 NA(i) NA(i) 1 2 #> #> Labels: #> value label #> 1 excellent #> 2 good #> 3 fair #> 4 poor #> NA(d) DK #> NA(i) IAP #> NA(n) NA In contrast, SPSS uses a different approach called ‘user-defined values’ to denote missing values. Each column in an SPSS dataset can have up to three distinct values designated as missing or a specified range of missing values. To model these additional user-defined missing values, {haven} provides the labeled_spss() subclass of labeled(). When you import SPSS data using {haven}, it ensures that user-defined missing values are correctly handled. You can work with this data in R while preserving the unique missing value conventions from SPSS. Here is what the GSS SPSS data looks like when loaded with {haven}. head(gss_sps$HEALTH) #> <labelled_spss<double>[6]>: Condition of health #> [1] 2 1 0 0 1 2 #> Missing values: 0, 8, 9 #> #> Labels: #> value label #> 0 IAP #> 1 EXCELLENT #> 2 GOOD #> 3 FAIR #> 4 POOR #> 8 DK #> 9 NA C.4 Importing data from APIs into R In addition to working with data saved as files, we may also need to retrieve data through Application Programming Interfaces (APIs). APIs provide a structured way to access data hosted on external servers and import it directly into R for analysis. To access this data, you need to understand how to construct API requests. Each API has unique endpoints, parameters, and authentication requirements. Pay attention to: Endpoints: These are URLs that point to specific data or services. Parameters: Information you pass to the API to customize your request (e.g., date ranges, filters). Authentication: APIs may require API keys or tokens for access. Rate Limits: APIs may have usage limits, so be aware of any rate limits or quotas. Typically, we begin by making a GET request to an API endpoint. The {httr2} package allows us to generate and process HTTP requests. We can make the GET request by pointing to the URL that contains the data we would like. library(httr2) api_url <- "https://api.example.com/survey-data" response <- GET(api_url) Once we make the request, we will obtain the data as the response. The data often comes in JSON format. We can extract and parse the data using the {jsonlite} package, allowing us to work with it in R. The fromJSON() function, shown below, coverts JSON data to an R object. survey_data <- fromJSON(content(response, "text")) Note that these are dummy examples. Please review the documentation to understand how to make requests from your specific API. R offers several packages that simplify API access by providing ready-to-use functions for popular APIs. These packages are called “wrappers”, as they “wrap” the API to make it easier to use. For example, the {tidycensus} package used in this book simplifies access to U.S. Census data, allowing us to retrieve data with R commands instead of writing complex API requests. For example, if we are interested in the population (B01003_001) of each census tract in North Carolina from the 2020 ACS, we would use the get_acs() function and the code below. Behind the scenes, get_acs() is making a GET request from the Census API and the tidycensus functions are converting the response into an R-friendly format. library(tidycensus) census_data <- get_acs( geography = "tract", variables = "B01003_001", year = 2020, state = "NC" ) To discover if there’s an R package that directly interfaces with a specific survey or data source, search for “[survey] R wrapper” or “[data source] R package” online. C.5 Accessing databases in R Databases provide a secure and organized solution as the volume and complexity of data grow. We can access, manage, and update data stored in databases in a systematic way. Because of how the data are organized, teams can draw from the same source and obtain any metadata that would be helpful for analysis. There are various ways of working with databases in RStudio. We can connect to different databases through the Connections Pane in the top right of the IDE. We can also use packages like {DBI} and {odbc} to access database tables in R files. Here is an example script connecting to a database: con <- DBI::dbConnect(odbc::odbc(), Driver = "[your driver's name]", Server = "[your server's path]", UID = rstudioapi::askForPassword("Database user"), PWD = rstudioapi::askForPassword("Database password"), Database = "[your database's name]", Warehouse = "[your warehouse's name]", Schema = "[your schema's name]" ) The {dbplyr} and {dplyr} packages allow us to make queries and run data analysis entirely using {dplyr} syntax. All of the code can be written in R so we do not have to switch between R and SQL to explore the data. Here is some sample code: q1 <- tbl(con, "bank") %>% group_by(month_idx, year, month) %>% summarise( subscribe = sum(ifelse(term_deposit == "yes", 1, 0)), total = n()) show_query(q1) Be sure to check the documentation to configure a database connection. C.6 Importing data from other formats R also offers dedicated packages such as {googlesheets4} for Google Sheets or {qualtRics} for Qualtrics. With less common or proprietary file formats, the broader data science community can often provide guidance. Online resources like Stack Overflow and dedicated forums like Posit Community are valuable sources of information for importing data into R. "],["references.html", "References", " References American National Election Studies. 2021. “ANES 2020 Time Series Study: Pre-Election and Post-Election Survey Questionnaires.” https://electionstudies.org/wp-content/uploads/2021/07/anes_timeseries_2020_questionnaire_20210719.pdf. ———. 2022. “ANES 2020 Time Series Study Full Release: User Guide and Codebook.” https://electionstudies.org/wp-content/uploads/2022/02/anes_timeseries_2020_userguidecodebook_20220210.pdf. Biemer, Paul P. 2010. “Total Survey Error: Design, Implementation, and Evaluation.” Public Opinion Quarterly 74 (5): 817–48. https://doi.org/10.1093/poq/nfq058. Biemer, Paul P., and Lars E. Lyberg. 2003. Introduction to Survey Quality. John Wiley & Sons. Biemer, Paul P., Joe Murphy, Stephanie Zimmer, Chip Berry, Grace Deng, and Katie Lewis. 2017. “Using Bonus Monetary Incentives to Encourage Web Response in Mixed-Mode Household Surveys.” Journal of Survey Statistics and Methodology 6 (2): 240–61. https://doi.org/10.1093/jssam/smx015. Bollen, Kenneth A., Paul P. Biemer, Alan F. Karr, Stephen Tueller, and Marcus E. Berzofsky. 2016. “Are Survey Weights Needed? A Review of Diagnostic Tests in Regression Analysis.” Annual Review of Statistics and Its Application 3 (1): 375–92. https://doi.org/10.1146/annurev-statistics-011516-012958. Bradburn, Norman M., Seymour Sudman, and Brian Wansink. 2004. Asking Questions: The Definitive Guide to Questionnaire Design. 2nd Edition. Jossey-Bass. Bryan, Jenny, and Jim Hester. 2023. Happy Git and GitHub for the useR. https://happygitwithr.com/. Bureau of Justice Statistics. 2017. “National Crime Victimization Survey, 2016: Technical Documentation.” https://bjs.ojp.gov/sites/g/files/xyckuh236/files/media/document/ncvstd16.pdf. Centers for Disease Control and Prevention (CDC). 2021. “Behavioral Risk Factor Surveillance System Survey Questionnaire.” U.S. Department of Health; Human Services, Centers for Disease Control; Prevention; https://www.cdc.gov/brfss/questionnaires/pdf-ques/2021-BRFSS-Questionnaire-1-19-2022-508.pdf. Cochran, William G. 1977. Sampling Techniques. John Wiley & Sons. Cox, Brenda G, David A Binder, B Nanjamma Chinnappa, Anders Christianson, Michael J Colledge, and Phillip S Kott. 2011. Business Survey Methods. John Wiley & Sons. DeBell, Matthew. 2010. “How to Analyze ANES Survey Data.” ANES Technical Report Series nes012492. Palo Alto, CA: Stanford University; Ann Arbor, MI: the University of Michigan; https://electionstudies.org/wp-content/uploads/2018/05/HowToAnalyzeANESData.pdf. DeBell, Matthew and Amsbary, Michelle and Brader, Ted and Brock, Shelley and Good, Cindy and Kamens, Justin and Maisel, Natalya and Pinto, Sarah. 2022. “Methodology Report for the ANES 2020 Time Series Study.” https://electionstudies.org/wp-content/uploads/2022/08/anes_timeseries_2020_methodology_report.pdf. DeLeeuw, Edith D. 2005. “To Mix or Not to Mix Data Collection Modes in Surveys.” Journal of Official Statistics 21: 233–55. ———. 2018. “Mixed-Mode: Past, Present, and Future.” Survey Research Methods 12 (2): 75–89. https://doi.org/10.18148/srm/2018.v12i2.7402. Deming, W Edwards. 1991. Sample Design in Business Research. Vol. 23. John Wiley & Sons. Dillman, Don A, Jolene D Smyth, and Leah Melani Christian. 2014. Internet, Phone, Mail, and Mixed-Mode Surveys: The Tailored Design Method. John Wiley & Sons. Fowler, Floyd J, and Thomas W. Mangione. 1989. Standardized Survey Interviewing. SAGE. Fuller, Wayne A. 2011. Sampling Statistics. John Wiley & Sons. Gelman, Andrew. 2007. “Struggles with Survey Weighting and Regression Modeling.” Statistical Science 22 (2): 153–64. https://doi.org/10.1214/088342306000000691. Groves, Robert M, Floyd J Fowler Jr, Mick P Couper, James M Lepkowski, Eleanor Singer, and Roger Tourangeau. 2009. Survey Methodology. John Wiley & Sons. Harter, Rachel, Michael P Battaglia, Trent D Buskirk, Don A Dillman, Ned English, Mansour Fahimi, Martin R Frankel, et al. 2016. “Address-Based Sampling.” Task force report. American Association for Public Opinion Research; https://aapor.org/wp-content/uploads/2022/11/AAPOR_Report_1_7_16_CLEAN-COPY-FINAL-2.pdf. Kim, Jae Kwang, and Jun Shao. 2021. Statistical Methods for Handling Incomplete Data. Chapman & Hall/CRC Press. LAPOP. 2021a. “AmericasBarometer 2021 - Canada: Technical Information.” Vanderbilt University; http://datasets.americasbarometer.org/database/files/ABCAN2021-Technical-Report-v1.0-FINAL-eng-110921.pdf. ———. 2021b. “AmericasBarometer 2021 - U.S.: Technical Information.” Vanderbilt University; http://datasets.americasbarometer.org/database/files/ABUSA2021-Technical-Report-v1.0-FINAL-eng-110921.pdf. ———. 2021c. “AmericasBarometer 2021: Technical Information.” Vanderbilt University; https://www.vanderbilt.edu/lapop/ab2021/AB2021-Technical-Report-v1.0-FINAL-eng-030722.pdf. ———. 2021d. “Core Questionnaire.” https://www.vanderbilt.edu/lapop/ab2021/AB2021-Core-Questionnaire-v17.5-Eng-210514-W-v2.pdf. ———. 2023a. “About the AmericasBarometer.” https://www.vanderbilt.edu/lapop/about-americasbarometer.php. ———. 2023b. “The AmericasBarometer by the LAPOP Lab.” www.vanderbilt.edu/lapop. Levy, Paul S, and Stanley Lemeshow. 2013. Sampling of Populations: Methods and Applications. John Wiley & Sons. Lumley, Thomas. 2010. Complex Surveys: A Guide to Analysis Using r: A Guide to Analysis Using r. John Wiley; Sons. ———. 2023. Survey: Analysis of Complex Survey Samples. http://r-survey.r-forge.r-project.org/survey/. Mack, Christina, Zhaohui Su, and Daniel Westreich. 2018. “Types of Missing Data.” In Managing Missing Data in Patient Registries: Addendum to Registries for Evaluating Patient Outcomes: A User’s Guide, Third Edition [Internet]. Rockville (MD): Agency for Healthcare Research; Quality (US); https://www.ncbi.nlm.nih.gov/books/NBK493614/. Penn State. 2019. “STAT 506: Sampling Theory and Methods [Online Course].” https://online.stat.psu.edu/stat506/. Särndal, Carl-Erik, Bengt Swensson, and Jan Wretman. 2003. Model Assisted Survey Sampling. Springer Science & Business Media. Schafer, Joseph L, and John W Graham. 2002. “Missing Data: Our View of the State of the Art.” Psychological Methods 7: 147–77. https://doi.org/10.1037//1082-989X.7.2.147. Schouten, Barry, Andy Peytchev, and James Wagner. 2018. Adaptive Survey Design. Chapman & Hall/CRC Press. Scott, Alastair. 2007. Rao-Scott Corrections and Their Impact. Section on Survey Research Methods. http://www.asasrms.org/Proceedings/y2007/Files/JSM2007-000874.pdf; ASA. Shah, Babubhai V, and Akhil K Vaish. 2006. “Confidence Intervals for Quantile Estimation from Complex Survey Data.” In Proceedings of the Section on Survey Research Methods. Shook-Sa, Bonnie, Couzens, G. Lance, and Berzofsky, Marcus. 2015. “Users’ Guide to the National Crime Victimization Survey (NCVS) Direct Variance Estimation.” https://bjs.ojp.gov/sites/g/files/xyckuh236/files/media/document/ncvs_variance_user_guide_11.06.14.pdf; Bureau of Justice Statistics. Skinner, Chris. 2009. “Chapter 15: Statistical Disclosure Control for Survey Data.” In Handbook of Statistics: Sample Surveys: Design, Methods and Applications, edited by C. R. Rao, 381–96. Elsevier B.V. Tierney, Nicholas, and Dianne Cook. 2023. “Expanding Tidy Data Principles to Facilitate Missing Data Exploration, Visualization and Assessment of Imputations.” Journal of Statistical Software 105 (7): 1–31. https://doi.org/10.18637/jss.v105.i07. Tourangeau, Roger, Mick P. Couper, and Frederick Conrad. 2004. “Sapcing, Position, and Order: Interpretive Heuristics for Visual Features of Survey Questions.” Public Opinion Quarterly 68: 368–93. Tourangeau, Roger, Lance J. Rips, and Kenneth Rasinski. 2000. Psychology of Survey Response. Cambridge University Press. United States. Bureau of Justice Statistics. 2022. “National Crime Victimization Survey, [United States], 2021.” Inter-university Consortium for Political; Social Research [distributor]. https://doi.org/10.3886/ICPSR38429.v1. U.S. Census Bureau. 2021. “Understanding and Using the American Community Survey Public Use Microdata Sample Files What Data Users Need to Know.” U.S. Government Printing Office; https://www.census.gov/content/dam/Census/library/publications/2021/acs/acs_pums_handbook_2021.pdf. U.S. Energy Information Administration. 2017. “Residential Energy Consumption Survey (RECS): Using the 2015 microdata file to compute estimates and standard errors (RSEs).” https://www.eia.gov/consumption/residential/data/2015/pdf/microdata_v3.pdf. ———. 2023a. “2020 Residential Energy Consumption Survey: Household Characteristics Technical Documentation Summary.” https://www.eia.gov/consumption/residential/data/2020/pdf/2020%20RECS_Methodology%20Report.pdf. ———. 2023b. “2020 Residential Energy Consumption Survey: Using the microdata file to compute estimates and relative standard errors (RSEs).” https://www.eia.gov/consumption/residential/data/2020/pdf/microdata-guide.pdf. Valliant, Richard, and Jill A. Dever. 2018. Survey Weights: A Step-by-Step Guide to Calculation. Stata Press. Valliant, Richard, Jill A Dever, and Frauke Kreuter. 2013. Practical Tools for Designing and Weighting Survey Samples. Vol. 1. Springer. Wickham, Hadley. 2019. Advanced R. https://adv-r.hadley.nz/; CRC press. ———. 2023. Ggplot2: Elegant Graphics for Data Analysis. 3rd Edition. https://ggplot2-book.org/; Springer. Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2rd Edition. https://r4ds.hadley.nz/; O’Reilly Media. Wolter, Kirk M. 2007. Introduction to Variance Estimation. Vol. 53. Springer. "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]]