From 13ceea68dfe379edc4afabc743b910d8c686a058 Mon Sep 17 00:00:00 2001 From: Rebecca Powell Date: Tue, 23 Apr 2024 21:34:57 -0400 Subject: [PATCH] One voice (#128) * Making data plural * Standardize A/C format * Standardize cross-tab format * change final section header in chapter 9 to not be "summary" to match all other chapters. * Removing "you" language. * Adjusting tense "we will.." to just "we..." * Remove markdown comments * Changing from target population to population of interest. * Updates to ch1 from one voice review * Edits to ch02 from one-voice * Ch03 edits from one-voice review * Ch04 updates from one-voice * Fix broken reference link in ch04. * Ch05 edits from one-voice * Ch06 edits from one-voice * Ch07 edits from one-voice * Ch08 edits from one-voice * Ch09 edits from one-voice * Ch10 edits from one-voice * Ch11 edits from one-voice * Ch12 edits from one-voice * Ch13 edits from one-voice * Ch14 edits from one-voice * Appendix A edits from one-voice * Adding blank line, to add a comment. * Fixing reference type for Scott2007 to have author show up in bibliography. * Adding spaces at ends of lines to add comment. * Fixing typo in formula in ch7. * Adding space to end of line to add a comment. * Adding space at end of line to add comment. * Fix ref to C10 * SZ full book review (#129) * Change interaction example (#130) * IV one voice review --------- Co-authored-by: Stephanie Zimmer Co-authored-by: Isabella Velasquez --- 01-introduction.Rmd | 69 ++++---- 02-overview-surveys.Rmd | 96 +++++------ 03-survey-data-documentation.Rmd | 40 +++-- 04-set-up.Rmd | 97 ++++++------ 05-descriptive-analysis.Rmd | 186 +++++++++++++--------- 06-statistical-testing.Rmd | 144 +++++++++-------- 07-modeling.Rmd | 171 ++++++++++---------- 08-communicating-results.Rmd | 83 +++++----- 09-reproducible-data.Rmd | 59 +++---- 10-sample-designs-replicate-weights.Rmd | 201 +++++++++++------------- 11-missing-data.Rmd | 83 +++++----- 12-successful-survey-data-analysis.Rmd | 50 +++--- 13-ncvs-vignette.Rmd | 199 ++++++++++++++++------- 14-ambarom-vignette.Rmd | 36 +++-- 89-Appendix-DataImport.Rmd | 110 +++++++------ 93-AppendixD.Rmd | 24 ++- 99-references.Rmd | 8 +- book.bib | 56 ++++++- index.Rmd | 3 - renv.lock | 10 +- 20 files changed, 970 insertions(+), 755 deletions(-) diff --git a/01-introduction.Rmd b/01-introduction.Rmd index c5b57085..6d1fe84e 100644 --- a/01-introduction.Rmd +++ b/01-introduction.Rmd @@ -4,11 +4,11 @@ # Introduction {#c01-intro} -Surveys are valuable tools for gathering information about a population, and are used by researchers, governments, and businesses alike to better understand public opinion and behaviors. For example, a non-profit group may analyze societal trends to measure their impact, government agencies may study behaviors to inform policy, or companies may seek to learn customer product preferences to refine business strategy. With survey data, we can explore the world around us. +Surveys are valuable tools for gathering information about a population. Researchers, governments, and businesses use surveys to better understand public opinion and behaviors. For example, a non-profit group may analyze societal trends to measure their impact, government agencies may study behaviors to inform policy, or companies may seek to learn customer product preferences to refine business strategy. With survey data, we can explore the world around us. -Surveys are often conducted with a sample of the population. Therefore, in order to use the survey data to understand the population, we use weights to adjust the survey results for unequal probabilities of selection, non-response, and post-stratification. These adjustments ensure the sample accurately represents the population of interest [@gard2023weightsdef]. To account for the intricate nature of the survey design, analysts rely on statistical software such as SAS, Stata, SUDAAN, and R. +Surveys are often conducted with a sample of the population. Therefore, to use the survey data to understand the population, we use weights to adjust the survey results for unequal probabilities of selection, non-response, and post-stratification. These adjustments ensure the sample accurately represents the population of interest [@gard2023weightsdef]. To account for the intricate nature of the survey design, analysts rely on statistical software such as SAS, Stata, SUDAAN, and R. -In this book, we focus on R to introduce survey analysis. Our goal is to provide a comprehensive guide for individuals new to survey analysis but with some familiarity with statistics and R programming. We use a combination of the {survey} and {srvyr} packages and present the code following best practices from the tidyverse and assume weights have already been calculated and are available [@R-srvyr; @lumley2010complex; @tidyverse2019]. +In this book, we focus on R to introduce survey analysis. Our goal is to provide a comprehensive guide for individuals new to survey analysis but with some familiarity with statistics and R programming. We use a combination of the {survey} and {srvyr} packages and present the code following best practices from the tidyverse [@R-srvyr; @lumley2010complex; @tidyverse2019]. ## Survey analysis in R @@ -19,15 +19,15 @@ The {survey} package was released on the [Comprehensive R Archive Network (CRAN) * Variances by Taylor linearization or by replicate weights, including balance repeated replication, jackknife, bootstrap, multistage bootstrap, or user-supplied methods * Hypothesis testing for means, proportions, and other parameters -The {srvyr} package builds on the {survey} package by providing wrappers for functions that align with the tidyverse philosophy. This is our motivation for using and recommending this package. We find that the {srvyr} package is user-friendly for those familiar with the tidyverse packages in R. +The {srvyr} package builds on the {survey} package by providing wrappers for functions that align with the tidyverse philosophy. This is our motivation for using and recommending the {srvyr} package. We find that it is user-friendly for those familiar with the tidyverse packages in R. -For example, while many functions in the {survey} package use variables as formulas, the {srvyr} package uses tidy selection to pass variable names, a common feature in the tidyverse [@R-tidyselect]. Users of the tidyverse are likely familiar with the magrittr pipe operator (`%>%`), which seamlessly works with functions from the {srvyr} package. Moreover, several common functions from {dplyr}, such as `filter()`, `mutate()`, and `summarize()`, can be applied to survey objects [@R-dplyr]. This enables users to streamline their analysis workflow and leverage the benefits of both the {srvyr} and {tidyverse} packages. +For example, while many functions in the {survey} package access variables through formulas, the {srvyr} package uses tidy selection to pass variable names, a common feature in the tidyverse [@R-tidyselect]. Users of the tidyverse are also likely familiar with the magrittr pipe operator (`%>%`), which seamlessly works with functions from the {srvyr} package. Moreover, several common functions from {dplyr}, such as `filter()`, `mutate()`, and `summarize()`, can be applied to survey objects [@R-dplyr]. This enables users to streamline their analysis workflow and leverage the benefits of both the {srvyr} and {tidyverse} packages. -While the {srvyr} package offers many advantages, there is one notable limitation: it doesn't fully incorporate the modeling capabilities of the {survey} package into tidy wrappers. When discussing modeling and hypothesis testing, we primarily rely on the {survey} package. However, we guide you on how to apply the pipe operator to these functions to maintain clarity and consistency in your analyses. +While the {srvyr} package offers many advantages, there is one notable limitation: it doesn't fully incorporate the modeling capabilities of the {survey} package into tidy wrappers. When discussing modeling and hypothesis testing, we primarily rely on the {survey} package. However, we provide information on how to apply the pipe operator to these functions to maintain clarity and consistency in analyses. ## What to expect {#what-to-expect} -This book covers many aspects of survey design and analysis, from understanding how to create design objects to conducting descriptive analysis, statistical tests, and models. We emphasize coding best practices and effective presentation techniques while using real-world data and practical examples to help you gain proficiency in survey analysis. +This book covers many aspects of survey design and analysis, from understanding how to create design objects to conducting descriptive analysis, statistical tests, and models. We emphasize coding best practices and effective presentation techniques while using real-world data and practical examples to help readers gain proficiency in survey analysis. Below is a summary of each chapter: @@ -35,52 +35,58 @@ Below is a summary of each chapter: - Overview of survey design processes - References for more in-depth knowledge - **Chapter \@ref(c03-survey-data-documentation) - Survey data documentation**: - - Guide to survey documentation + - Guide to survey documentation types + - How to read survey documentation - **Chapter \@ref(c04-getting-started) - Getting started**: - Installation of packages - Introduction to the {srvyrexploR} package and its analytic datasets - Outline of the survey analysis process - Comparison between the {dplyr} and {srvyr} packages - **Chapter \@ref(c05-descriptive-analysis) - Descriptive analyses**: - - Calculation of point estimates, standard errors, confidence intervals, and design effects + - Calculation of point estimates + - Estimation of standard errors and confidence intervals + - Calculation of design effects - **Chapter \@ref(c06-statistical-testing) - Statistical testing**: - Statistical testing methods - Comparison of means and proportions - Goodness of fit tests, tests of independence, and tests of homogeneity - **Chapter \@ref(c07-modeling) - Modeling**: + - Overview of model formula specifications - Linear regression, ANOVA, and logistic regression modeling - **Chapter \@ref(c08-communicating-results) - Communication of results**: - Strategies for communicating survey results - Tools and guidance for creating publishable tables and graphs - **Chapter \@ref(c09-reprex-data) - Reproducible research**: - - Various tools and methods for achieving reproducibility + - Tools and methods for achieving reproducibility + - Resources for reproducible research - **Chapter \@ref(c10-sample-designs-replicate-weights) - Sample designs and replicate weights**: - - Description of common sampling designs and how to specify in R - - Description of replicate weight methods and how to specify in R + - Overview of common sampling designs + - Replicate weight methods + - How to specify survey designs in R - **Chapter \@ref(c11-missing-data) - Missing data**: - Overview of missing data in surveys - Approaches to dealing with missing data - **Chapter \@ref(c12-recommendations) - Successful survey analysis recommendations**: - Tips for successful analysis - - Debugging skills + - Recommendations for debugging - **Chapter \@ref(c13-ncvs-vignette) - National Crime Victimization Survey Vignette**: - Vignette on analyzing National Crime Victimization Survey (NCVS) data - - Illustrates analysis requiring multiple files for victimization rates + - Illustration of analysis requiring multiple files for victimization rates - **Chapter \@ref(c14-ambarom-vignette) - AmericasBarometer Vignette**: - Vignette on analyzing AmericasBarometer survey data - - Includes making choropleth maps with survey estimates + - Creation of choropleth maps with survey estimates -The majority of chapters contain code that you can follow. Each of these chapters starts with a "set-up" section, which includes the code needed to load the packages and datasets. We then provide the main idea of the chapter and examples of how to use the functions. Most chapters conclude with exercises to work through. We provide the solutions to the exercises in the online version of the book, available at [tidy-survey-r.github.io](https://tidy-survey-r.github.io/). +The majority of chapters contain code that readers can follow. Each of these chapters starts with a "Prerequisites" section, which includes the code needed to load the packages and datasets used in the chapter. We then provide the main idea of the chapter and examples of how to use the functions. Most chapters conclude with exercises to work through. We provide the solutions to the exercises in the [online version of the book](https://tidy-survey-r.github.io/tidy-survey-book/). -While we provide a brief overview of survey methodology and statistical theory, this book is not intended to be the sole resource for these topics. We reference other materials throughout the book and encourage readers to seek those out for more information. +While we provide a brief overview of survey methodology and statistical theory, this book is not intended to be the sole resource for these topics. We reference other materials and encourage readers to seek them out for more information. ## Prerequisites -To get the most of our this book, we assume that you have already conducted a survey and have the data or obtained a microdata file. Microdata, also known as respondent-level or row-level data, differs from summarized data typically found in tables. It contains individual survey responses, along with analysis weights and design variables such as strata or clusters. +To get the most out of our this book, we assume a survey has already been conducted and readers have obtained a microdata file. Microdata, also known as respondent-level or row-level data, differs from summarized data typically found in tables. They contain individual survey responses, along with analysis weights and design variables such as strata or clusters. -Additionally, the survey data should already include weights and design variables. These are required to accurately calculate unbiased estimates. The concepts and techniques discussed in this book will help you to extract meaningful insights from your survey data, but will not cover how to create weights in the first place as this is a separate complex topic. If you do not already have weights created for the survey data you are using, we recommend reviewing other resources focused on weight creation such as @Valliant2018weights. +Additionally, the survey data should already include weights and design variables. These are required to accurately calculate unbiased estimates. The concepts and techniques discussed in this book help readers to extract meaningful insights from survey data, but do not cover how to create weights as this is a separate complex topic. If weights are not already created for the survey data, we recommend reviewing other resources focused on weight creation such as @Valliant2018weights. -This book is tailored for analysts already familiar with R and the tidyverse but who may be new to complex survey analysis in R. We anticipate that readers of this book can: +This book is tailored for analysts already familiar with R and the tidyverse, but who may be new to complex survey analysis in R. We anticipate that readers of this book can: * Install R and their Integrated Development Environment (IDE) of choice, such as RStudio * Install and load packages from CRAN and GitHub repositories @@ -89,39 +95,39 @@ This book is tailored for analysts already familiar with R and the tidyverse but * Understand fundamental tidyverse concepts such as tidy/long/wide data, tibbles, the magrittr pipe (`%>%`), and tidy selection * Use the tidyverse packages to wrangle, tidy, and visualize data -If these concepts or skills are new to you, we recommend starting with introductory resources to cover these topics before reading this book. R for Data Science [@wickham2023r4ds] is a beginner-friendly guide for getting started in data science using R. It offers guidance on preliminary installation steps and basic R syntax, and it introduces tidyverse concepts and packages. +If these concepts or skills are new, we recommend starting with introductory resources to cover these topics before reading this book. R for Data Science [@wickham2023r4ds] is a beginner-friendly guide for getting started in data science using R. It offers guidance on preliminary installation steps, basic R syntax, and tidyverse concepts and packages. ## Datasets used in this book -We work with two key datasets throughout the book: the Residential Energy Consumption Survey [RECS -- @recs-2020-tech] and the American National Election Studies [ANES -- @debell]. We introduce and demonstrate the loading and preparation of these datasets in Chapter \@ref(c04-getting-started). +We work with two key datasets throughout the book: the Residential Energy Consumption Survey [RECS -- @recs-2020-tech] and the American National Election Studies [ANES -- @debell]. We introduce the loading and preparation of these datasets in Chapter \@ref(c04-getting-started). ## Conventions Throughout the book, we use the following typographical conventions: * Package names are surrounded by curly brackets: {srvyr} -* Function names are in constant width text format and include parentheses: `survey_mean()` -* Object and variable names are in constant width text format: `anes_des` +* Function names are in constant-width text format and include parentheses: `survey_mean()` +* Object and variable names are in constant-width text format: `anes_des` ## Getting help We recommend first trying to resolve errors and issues independently using the tips provided in **Chapter \@ref(c12-recommendations)**. -If you have questions or face issues while working through the book, please report them to its [GitHub repository](https://github.com/tidy-survey-r/tidy-survey-book). - There are several community forums for asking questions, including: -* Posit Community: -* R for Data Science Slack Community: -* Stack Overflow: +* [Posit Community](https://forum.posit.co/) +* [R for Data Science Slack Community](https://rfordatasci.com/) +* [Stack Overflow](https://stackoverflow.com/) + +Please report any bugs and issues to the book's [GitHub repository](https://github.com/tidy-survey-r/tidy-survey-book/issues). ## Acknowledgements -We would like to thank Holly Cast, Greg Freedman Ellis, Joe Murphy, and Sheila Saia for their reviews of the initial draft. Their detailed and honest feedback helped to make this book considerably better, and we are grateful for their input. Additionally, this book started from two short courses. The first at the Annual Conference for the American Association for Public Opinion Research (AAPOR) and the second as a series of webinars for the Midwest Association of Public Opinion Research (MAPOR). We would like to also thank those that assisted us by moderating breakout rooms and answering questions from attendees: Greg Freedman Ellis, Raphael Nishimura, and Benjamin Schneider. +We would like to thank Holly Cast, Greg Freedman Ellis, Joe Murphy, and Sheila Saia for their reviews of the initial draft. Their detailed and honest feedback helped improve this book, and we are grateful for their input. Additionally, this book started with two short courses. The first was at the Annual Conference for the American Association for Public Opinion Research (AAPOR) and the second was a series of webinars for the Midwest Association of Public Opinion Research (MAPOR.) We would like to also thank those who assisted us by moderating breakout rooms and answering questions from attendees: Greg Freedman Ellis, Raphael Nishimura, and Benjamin Schneider. ## Colophon -This book was written in [bookdown](http://bookdown.org/) using [RStudio](http://www.rstudio.com/ide/). The complete source is available on GitHub: . +This book was written in [bookdown](http://bookdown.org/) using [RStudio](http://www.rstudio.com/ide/). The complete source is available on [GitHub](https://github.com/tidy-survey-r/tidy-survey-book). This version of the book was built with `r R.version.string` and with the packages listed in Table \@ref(tab:intro-packages-tab). @@ -133,6 +139,7 @@ This version of the book was built with `r R.version.string` and with the packag library(prettyunits) library(DiagrammeR) library(tidyverse) +library(tidycensus) library(survey) library(srvyr) library(srvyrexploR) diff --git a/02-overview-surveys.Rmd b/02-overview-surveys.Rmd index bfc64fa1..a404f6a4 100644 --- a/02-overview-surveys.Rmd +++ b/02-overview-surveys.Rmd @@ -2,9 +2,9 @@ ## Introduction -Developing surveys to gather accurate information about populations often involves a intricate and time-intensive process. Researchers can spend months, or even years, developing the study design, questions, and other methods for a single survey to ensure high-quality data is collected. +Developing surveys to gather accurate information about populations involves an intricate and time-intensive process. Researchers can spend months, or even years, developing the study design, questions, and other methods for a single survey to ensure high-quality data is collected. -Prior to analyzing survey data, we recommend understanding the entire survey life cycle. This understanding can provide a better insight into what types of analyses should be conducted on the data. The *survey life cycle* consists of the necessary stages to execute a survey project successfully. Each stage influences the survey's timing, costs, and feasibility, consequently impacting the data collected and how we should analyze it. Figure \@ref(fig:overview-diag) shows a high level view of the survey process and this chapter gives an overview of each step. +Before analyzing survey data, we recommend understanding the entire survey life cycle. This understanding can provide better insight into what types of analyses should be conducted on the data. The *survey life cycle* consists of the necessary stages to execute a survey project successfully. Each stage influences the survey's timing, costs, and feasibility, consequently impacting the data collected and how we should analyze it. Figure \@ref(fig:overview-diag) shows a high-level overview of the survey process. ```{r} #| label: overview-diag @@ -36,7 +36,7 @@ graph TD ``` -The survey life cycle starts with a *research topic or question of interest* (e.g., what impact does childhood trauma have on health outcomes later in life). Researchers typically review existing data sources to determine if data are already available that can address this question, as drawing from available resources can result in a reduced burden on respondents, cheaper research costs, and faster research outcomes. However, if existing data cannot answer the nuances of the research question, a survey can be used to capture the exact data that the researcher needs through a questionnaire, or a set of questions. +The survey life cycle starts with a *research topic or question of interest* (e.g., what impact does childhood trauma have on health outcomes later in life.) Drawing from available resources can result in a reduced burden on respondents, cheaper research costs, and faster research outcomes. Therefore, we recommend reviewing existing data sources to determine if data that can address this question are already available. However, if existing data cannot answer the nuances of the research question, we can capture the exact data we need through a questionnaire, or a set of questions. To gain a deeper understanding of survey design and implementation, we recommend reviewing several pieces of existing literature in detail [e.g., @biemer2003survqual; @Bradburn2004; @dillman2014mode; @groves2009survey; @Tourangeau2000psych; @valliant2013practical]. @@ -44,9 +44,9 @@ To gain a deeper understanding of survey design and implementation, we recommend Throughout this book, we use public-use datasets from different surveys, including the American National Election Survey (ANES), the Residential Energy Consumption Survey (RECS), the National Crime Victimization Survey (NCVS), and the AmericasBarometer surveys. -As mentioned above, researchers should look for existing data that can provide insights into their research questions before embarking on a new survey. One of the greatest sources of data is the government. For example, in the U.S., we can get data directly from the various statistical agencies like with RECS and NCVS. Other countries often have data available through official statistics offices, such as the Office for National Statistics in the United Kingdom. +As mentioned above, we should look for existing data that can provide insights into our research questions before embarking on a new survey. One of the greatest sources of data is the government. For example, in the U.S., we can get data directly from the various statistical agencies such as the U.S. Energy Information Administration or Bureau of Justice Statistics. Other countries often have data available through official statistics offices, such as the Office for National Statistics in the United Kingdom. -In addition to government data, many researchers will make their data publicly available through repositories such as the [Inter-university Consortium for Political and Social Research (ICPSR) variable search](https://www.icpsr.umich.edu/web/pages/ICPSR/ssvd/) or the [Odum Institute Data Archive](https://odum.unc.edu/archive/). Searching these repositories or other compiled lists (e.g., [Analyze Survey Data for Free](https://asdfree.com)) can be an efficient way to identify surveys with questions related to the researcher's topic of interest. +In addition to government data, many researchers make their data publicly available through repositories such as the [Inter-university Consortium for Political and Social Research (ICPSR)](https://www.icpsr.umich.edu/web/pages/ICPSR/ssvd/) or the [Odum Institute Data Archive](https://odum.unc.edu/archive/). Searching these repositories or other compiled lists (e.g., [Analyze Survey Data for Free](https://asdfree.com)) can be an efficient way to identify surveys with questions related to our research topic. ## Pre-survey planning {#pre-survey-planning} @@ -55,28 +55,28 @@ There are multiple things to consider when starting a survey. *Errors* are the d Generally, survey researchers consider there to be seven main sources of error that fall under either Representation and Measurement [@groves2009survey]: - **Representation** - - **Coverage Error**: A mismatch between the *population of interest* (also known as the target population or study population) and the *sampling frame*, the list from which the sample is drawn. + - **Coverage Error**: A mismatch between the *population of interest* and the *sampling frame*, the list from which the sample is drawn. - **Sampling Error**: Error produced when selecting a *sample*, the subset of the population, from the sampling frame. This error is due to randomization, and we discuss how to quantify this error in Chapter \@ref(c10-sample-designs-replicate-weights). There is no sampling error in a census as there is no randomization. The sampling error measures the difference between all potential samples under the same sampling method. - - **Nonresponse Error**: Differences between those who responded and did not respond to the survey (unit nonresponse) or a given question (item nonresponse). + - **Nonresponse Error**: Differences between those who responded and did not respond to the survey (unit nonresponse) or a given question (item nonresponse.) - **Adjustment Error**: Error introduced during post-survey statistical adjustments. - **Measurement** - - **Validity**: A mismatch between the topic of interest and the question(s) used to collect that information. + - **Validity**: A mismatch between the research topic and the question(s) used to collect that information. - **Measurement Error**: A mismatch between what the researcher asked and how the respondent answered. - - **Processing Error**: Edits by the researcher to responses provided by the respondent (e.g., adjustments to data based on illogical responses). + - **Processing Error**: Edits by the researcher to responses provided by the respondent (e.g., adjustments to data based on illogical responses.) Almost every survey has errors. Researchers attempt to conduct a survey that reduces the *total survey error*, or the accumulation of all errors that may arise throughout the survey life cycle. By assessing these different types of errors together, researchers can seek strategies to maximize the overall survey quality and improve the reliability and validity of results [@tse-doc]. However, attempts to reduce individual sources errors (and therefore total survey error) come at the price of time and money. For example: - **Coverage Error Tradeoff**: Researchers can search for or create more accurate and updated sampling frames, but they can be difficult to construct or obtain. - **Sampling Error Tradeoff**: Researchers can increase the sample size to reduce sampling error; however, larger samples can be expensive and time-consuming to field. - **Nonresponse Error Tradeoff**: Researchers can increase or diversify efforts to improve survey participation but this may be resource-intensive while not entirely removing nonresponse bias. -- **Adjustment Error Tradeoff**: *Weighting* is a statistical technique used to adjust the contribution of individual survey responses to the final survey estimates. It is typically done to make the sample more representative of the target population. However, if researchers do not carefully execute the adjustments or base them on inaccurate information, they can introduce new biases, leading to less accurate estimates. +- **Adjustment Error Tradeoff**: *Weighting* is a statistical technique used to adjust the contribution of individual survey responses to the final survey estimates. It is typically done to make the sample more representative of the population of interest. However, if researchers do not carefully execute the adjustments or base them on inaccurate information, they can introduce new biases, leading to less accurate estimates. - **Validity Error Tradeoff**: Researchers can increase validity through a variety of ways, such as using established scales or collaborating with a psychometrician during survey design to pilot and evaluate questions. However, doing so lengthens the amount of time and resources needed to complete survey design. -- **Measurement Error Tradeoff**: Reseachers can use techniques such as questionnaire testing and cognitive interviewing to ensure respondents are answering questions as expected. However, these activities also require time and resources to complete. +- **Measurement Error Tradeoff**: Researchers can use techniques such as questionnaire testing and cognitive interviewing to ensure respondents are answering questions as expected. However, these activities require time and resources to complete. - **Processing Error Tradeoff**: Researchers can impose rigorous data cleaning and validation processes. However, this requires supervision, training, and time. The challenge for survey researchers is to find the optimal tradeoffs among these errors. They must carefully consider ways to reduce each error source and total survey error while balancing their study's objectives and resources. -For survey analysts, understanding the decisions that researchers took to minimize these error sources can impact how results are interpreted. The remainder of this chapter dives into critical considerations for survey development. We explore how to consider each of these sources of error and how these error sources can inform the interpretations of the data. +For survey analysts, understanding the decisions that researchers took to minimize these error sources can impact how results are interpreted. The remainder of this chapter explores critical considerations for survey development. We explore how to consider each of these sources of error and how these error sources can inform the interpretations of the data. ## Study design {#overview-design} @@ -86,34 +86,34 @@ From formulating methodologies to choosing an appropriate sampling frame, the st The set or group we want to survey is known as the *population of interest* or the *target population*. The population of interest could be broad, such as “all adults age 18+ living in the U.S.” or a specific population based on a particular characteristic or location. For example, we may want to know about "adults aged 18-24 who live in North Carolina" or "eligible voters living in Illinois." -However, a *sampling frame* with contact information is needed to survey individuals in these populations of interest. If researchers are looking at eligible voters, the sampling frame could be the voting registry for a given state or area. If researchers are looking at more board target populations like all adults in the United States, the sampling frame is likely imperfect. In these cases, a full list of individuals in the United States is not available for a sampling frame. Instead, researchers may choose to use a sampling frame of mailing addresses and send the survey to households, or they may choose to use random digit dialing (RDD) and call random phone numbers (that may or may not be assigned, connected, and working). +However, a *sampling frame* with contact information is needed to survey individuals in these populations of interest. If we are looking at eligible voters, the sampling frame could be the voting registry for a given state or area. If we are looking at more board populations of interest, like all adults in the United States, the sampling frame is likely imperfect. In these cases, a full list of individuals in the United States is not available for a sampling frame. Instead, we may choose to use a sampling frame of mailing addresses and send the survey to households, or we may choose to use random digit dialing (RDD) and call random phone numbers (that may or may not be assigned, connected, and working.) -These imperfect sampling frames can result in *coverage error* where there is a mismatch between the target population and the list of individuals researchers can select. For example, if a researcher is looking to obtain estimates for "all adults aged 18+ living in the U.S.", a sampling frame of mailing addresses will miss specific types of individuals, such as the homeless, transient populations, and incarcerated individuals. Additionally, many households have more than one adult resident, so researchers would need to consider how to get a specific individual to fill out the survey (called *within household selection*) or adjust the target population to report on "U.S. households" instead of "individuals." +These imperfect sampling frames can result in *coverage error* where there is a mismatch between the population of interest and the list of individuals we can select. For example, if we are looking to obtain estimates for "all adults aged 18+ living in the U.S.", a sampling frame of mailing addresses will miss specific types of individuals, such as the homeless, transient populations, and incarcerated individuals. Additionally, many households have more than one adult resident, so we would need to consider how to get a specific individual to fill out the survey (called *within household selection*) or adjust the population of interest to report on "U.S. households" instead of "individuals." -Once the researchers have selected the sampling frame, the next step is determining how to select individuals for the survey. In rare cases, researchers may conduct a *census* and survey everyone on the sampling frame. However, the ability to implement a questionnaire at that scale is something only some can do (e.g., government censuses). Instead, researchers typically choose to sample individuals and use weights to estimate numbers in the target population. They can use a variety of different sampling methods, and more information on these can be found in Chapter \@ref(c10-sample-designs-replicate-weights). This decision of which sampling method to use impacts *sampling error* and can be accounted for in weighting. +Once we have selected the sampling frame, the next step is determining how to select individuals for the survey. In rare cases, we may conduct a *census* and survey everyone on the sampling frame. However, the ability to implement a questionnaire at that scale is something only few can do (e.g., government censuses.) Instead, we typically choose to sample individuals and use weights to estimate numbers in the population of interest. They can use a variety of different sampling methods, and more information on these can be found in Chapter \@ref(c10-sample-designs-replicate-weights). This decision of which sampling method to use impacts *sampling error* and can be accounted for in weighting. #### Example: Number of pets in a household {.unnumbered #overview-design-sampdesign-ex} -Let's use a simple example where a researcher is interested in the average number of pets in a household. Our researcher needs to consider the target population for this study. Specifically, are they interested in all households in a given country or households in a more local area (e.g., city or state)? Let's assume our researcher is interested in the number of pets in a U.S. household with at least one adult (18 years old or older). In this case, a sampling frame of mailing addresses would introduce only a small amount of coverage error as the frame would closely match our target population. Specifically, our researcher would likely want to use the Computerized Delivery Sequence File (CDSF), which is a file of mailing addresses that the United States Postal Service (USPS) creates and covers nearly 100% of U.S. households [@harter2016address]. To sample these households, for simplicity, we use a stratified simple random sample design (see Chapter \@ref(c10-sample-designs-replicate-weights) for more information on sample designs), where we randomly sample households within each state (i.e., we stratify by state). +Let's use a simple example where we are interested in the average number of pets in a household. We need to consider the population of interest for this study. Specifically, are we interested in all households in a given country or households in a more local area (e.g., city or state)? Let's assume we are interested in the number of pets in a U.S. household with at least one adult (18 years old or older.) In this case, a sampling frame of mailing addresses would introduce only a small amount of coverage error as the frame would closely match our population of interest. Specifically, we would likely want to use the Computerized Delivery Sequence File (CDSF), which is a file of mailing addresses that the United States Postal Service (USPS) creates and covers nearly 100% of U.S. households [@harter2016address]. To sample these households, for simplicity, we use a stratified simple random sample design (see Chapter \@ref(c10-sample-designs-replicate-weights) for more information on sample designs), where we randomly sample households within each state (i.e., we stratify by state.) Throughout this chapter, we build on this example research question to plan a survey. ### Data collection planning {#overview-design-dcplanning} -With the sampling design decided, researchers can then decide how to survey these individuals. Specifically, the *modes* used for contacting and surveying the sample, how frequently to send reminders and follow-ups, and the overall timeline of the study are four of the major data collection determinations. Traditionally, researchers have considered four main modes^[Other modes such as using mobile apps or text messaging can also be considered, but at the time of publication, have smaller reach or are better for longitudinal studies (i.e., surveying the same individuals over many time periods of a single study).]: +With the sampling design decided, researchers can then decide how to survey these individuals. Specifically, the *modes* used for contacting and surveying the sample, how frequently to send reminders and follow-ups, and the overall timeline of the study are four of the major data collection determinations. Traditionally, survey researchers have considered there to be four main modes^[Other modes such as using mobile apps or text messaging can also be considered, but at the time of publication, have smaller reach or are better for longitudinal studies (i.e., surveying the same individuals over many time periods of a single study.)]: - Computer Assisted Personal Interview (CAPI; also known as face-to-face or in-person interviewing) - Computer Assisted Telephone Interview (CATI; also known as phone or telephone interviewing) - Computer Assisted Web Interview (CAWI; also known as web or online interviewing) - Paper and Pencil Interview (PAPI) -Researchers can use a single mode to collect data or multiple modes (also called *mixed-modes*). Using mixed-modes can allow for broader reach and increase response rates depending on the target population [@biemer_choiceplus; @deLeeuw2005; @DeLeeuw_2018]. For example, researchers could both call households to conduct a CATI survey and send mail with a PAPI survey to the household. Using both modes, researchers could gain participation through the mail from individuals who do not pick up the phone to unknown numbers or through the phone from individuals who do not open all of their mail. However, mode effects (where responses differ based on the mode of response) can be present in the data and may need to be considered during analysis. +We can use a single mode to collect data or multiple modes (also called *mixed-modes*.) Using mixed-modes can allow for broader reach and increase response rates depending on the population of interest [@biemer_choiceplus; @deLeeuw2005; @DeLeeuw_2018]. For example, we could both call households to conduct a CATI survey and send mail with a PAPI survey to the household. By using both modes, we could gain participation through the mail from individuals who do not pick up the phone to unknown numbers or through the phone from individuals who do not open all of their mail. However, mode effects (where responses differ based on the mode of response) can be present in the data and may need to be considered during analysis. -When selecting which mode, or modes, to use, understanding the unique aspects of the chosen target population and sampling frame provides insight into how they can best be reached and engaged. For example, if we plan to survey adults aged 18-24 who live in North Carolina, asking them to complete a survey using CATI (i.e., over the phone) would likely not be as successful as other modes like the web. This age group does not talk on the phone as much as other generations and often does not answer their phones for unknown numbers. Additionally, the mode for contacting respondents relies on what information is available in the sampling frame. For example, if our sampling frame includes an email address, we could email our selected sample members to convince them to complete a survey. Alternatively, if the sampling frame is a list of mailing addresses, we could contact sample members with a letter. +When selecting which mode, or modes, to use, understanding the unique aspects of the chosen population of interest and sampling frame provides insight into how they can best be reached and engaged. For example, if we plan to survey adults aged 18-24 who live in North Carolina, asking them to complete a survey using CATI (i.e., over the phone) would likely not be as successful as other modes like the web. This age group does not talk on the phone as much as other generations and often does not answer their phones for unknown numbers. Additionally, the mode for contacting respondents relies on what information is available in the sampling frame. For example, if our sampling frame includes an email address, we could email our selected sample members to convince them to complete a survey. Alternatively, if the sampling frame is a list of mailing addresses, we could contact sample members with a letter. It is important to note that there can be a difference between the contact and survey modes. For example, if we have a sampling frame with addresses, we can send a letter to our sample members and provide information on completing a web survey. Another option is using mixed-mode surveys by mailing sample members a paper and pencil survey but also including instructions to complete the survey online. Combining different contact modes and different survey modes can be helpful in reducing *unit nonresponse error*--where the entire unit (e.g., a household) does not respond to the survey at all--as different sample members may respond better to different contact and survey modes. However, when considering which modes to use, it is important to make access to the survey as easy as possible for sample members to reduce burden and unit nonresponse. -Another way to reduce unit nonresponse error is by varying the language of the contact materials [@dillman2014mode]. People are motivated by different things, so constantly repeating the same message may not be helpful. Instead, mixing up the messaging and the type of contact material the sample member receives can increase response rates and reduce the unit nonresponse error. For example, instead of only sending standard letters, researchers could consider sending mailings that invoke "urgent" or "important" thoughts by sending priority letters or using other delivery services like FedEx, UPS, or DHL. +Another way to reduce unit nonresponse error is by varying the language of the contact materials [@dillman2014mode]. People are motivated by different things, so constantly repeating the same message may not be helpful. Instead, mixing up the messaging and the type of contact material the sample member receives can increase response rates and reduce the unit nonresponse error. For example, instead of only sending standard letters, we could consider sending mailings that invoke "urgent" or "important" thoughts by sending priority letters or using other delivery services like FedEx, UPS, or DHL. A study timeline may also determine the number and types of contacts. If the timeline is long, there is plentiful time for follow-ups and diversified messages in contact materials. If the timeline is short, then fewer follow-ups can be implemented. Many studies start with the tailored design method put forth by @dillman2014mode and implement five contacts: @@ -127,7 +127,7 @@ This method is easily adaptable based on the study timeline and needs but provid #### Example: Number of pets in a household {.unnumbered #overview-design-dcplanning-ex} -Let's return to our example of a researcher who wants to know the average number of pets in a household. We are using a sampling frame of mailing addresses, so we recommend starting our data collection with letters mailed to households, but later in data collection, we want to send interviewers to the house to conduct an in-person (or CAPI) interview to decrease unit nonresponse error. This means we have two contact modes (paper and in-person). As mentioned above, the survey mode does not have to be the same as the contact mode, so we recommend a mixed-mode study with both Web and CAPI modes. Let's assume we have six months for data collection, so we may want to recommend the following protocol: +Let's return to our example of the average number of pets in a household. We are using a sampling frame of mailing addresses, so we recommend starting our data collection with letters mailed to households, but later in data collection, we want to send interviewers to the house to conduct an in-person (or CAPI) interview to decrease unit nonresponse error. This means we have two contact modes (paper and in-person.) As mentioned above, the survey mode does not have to be the same as the contact mode, so we recommend a mixed-mode study with both Web and CAPI modes. Let's assume we have six months for data collection, so we could recommend the following protocol: Table: Protocol Example for 6-month Web and CAPI Data Collection @@ -143,21 +143,21 @@ Table: Protocol Example for 6-month Web and CAPI Data Collection | 20 | In-Person Visit | --- | CAPI | | 25 | Mail: Letter in large envelope | Survey Closing Notice | Web, but includes a number to call to schedule CAPI | -This is just one possible protocol that we can use that starts respondents with the web (typically done to reduce costs). However, researchers may want to begin in-person data collection earlier during the data collection period or ask their interviewers to attempt more than two visits with a household. +This is just one possible protocol that we can use that starts respondents with the web (typically done to reduce costs.) However, we could begin in-person data collection earlier during the data collection period or ask their interviewers to attempt more than two visits with a household. ### Questionnaire design {#overview-design-questionnaire} -When developing the questionnaire, it can be helpful to first outline the topics to be asked and include the "why" each question or topic is important to the research question(s). This can help researchers better tailor the questionnaire and reduce the number of questions (and thus the burden on the respondent) if topics are deemed irrelevant to the research question. When making these decisions, researchers should also consider questions needed for weighting. While we would love to have everyone in our population of interest answer our survey, this rarely happens. Thus, including questions about demographics in the survey can assist with weighting for *nonresponse errors* (both unit and item nonresponse). Knowing the details of the sampling plan and what may impact *coverage error* and *sampling error* can help researchers determine what types of demographics to include. Thus questionnaire design is done in conjunction with sampling design. +When developing the questionnaire, it can be helpful to first outline the topics to be asked and include the "why" each question or topic is important to the research question(s). This can help us better tailor the questionnaire and reduce the number of questions (and thus the burden on the respondent) if topics are deemed irrelevant to the research question. When making these decisions, we should also consider questions needed for weighting. While we would love to have everyone in our population of interest answer our survey, this rarely happens. Thus, including questions about demographics in the survey can assist with weighting for *nonresponse errors* (both unit and item nonresponse.) Knowing the details of the sampling plan and what may impact *coverage error* and *sampling error* can help us determine what types of demographics to include. Thus questionnaire design is typically done in conjunction with sampling design. -Researchers can benefit from the work of others by using questions from other surveys. Demographic sections such as race, ethnicity, or education borrow questions from a government census or other official surveys. Question banks such as the [Inter-university Consortium for Political and Social Research (ICPSR) variable search](https://www.icpsr.umich.edu/web/pages/ICPSR/ssvd/) can provide additional potential questions. +We can benefit from the work of others by using questions from other surveys. Demographic sections in surveys, such as race, ethnicity, or education, often are borrowed questions from a government census or other official surveys. Question banks such as the [Inter-university Consortium for Political and Social Research (ICPSR) variable search](https://www.icpsr.umich.edu/web/pages/ICPSR/ssvd/) can provide additional potential questions. -If a question does not exist in a question bank, researchers can craft their own. When developing survey questions, researchers should start with the research topic and attempt to write questions that match the concept. The closer the question asked is to the overall concept, the better *validity* there is. For example, if the researcher wants to know how people consume T.V. series and movies but only asks a question about how many T.V.s are in the house, then they would be missing other ways that people watch T.V. series and movies, such as on other devices or at places outside of the home. As mentioned above, researchers can employ techniques to increase the validity of their questionnaires. For example, *questionnaire testing* involves piloting the survey instrument to identify and fix potential issues before conducting the main survey. Additionally, researchers could conduct *cognitive interviews* -- a technique where researchers walk through the survey with participants, encouraging them to speak their thoughts out loud to uncover how they interpret and understand survey questions. +If a question does not exist in a question bank, we can craft our own. When developing survey questions, we should start with the research topic and attempt to write questions that match the concept. The closer the question asked is to the overall concept, the better the *validity*. For example, if we want to know how people consume T.V. series and movies but only ask a question about how many T.V.s are in the house, then they would be missing other ways that people watch T.V. series and movies, such as on other devices or at places outside of the home. As mentioned above, we can employ techniques to increase the validity of their questionnaires. For example, *questionnaire testing* involves piloting the survey instrument to identify and fix potential issues before conducting the main survey. Additionally, we could conduct *cognitive interviews* -- a technique where we walk through the survey with participants, encouraging them to speak their thoughts out loud to uncover how they interpret and understand survey questions. -Additionally, when designing questions, researchers should consider the mode for the survey and adjust the language appropriately. In self-administered surveys (e.g., web or mail), respondents can see all the questions and response options, but that is not the case in interviewer-administered surveys (e.g., CATI or CAPI). With interviewer-administered surveys, the response options must be read aloud to the respondents, so the question may need to be adjusted to create a better flow to the interview. Additionally, with self-administered surveys, because the respondents are viewing the questionnaire, the formatting of the questions is even more critical to ensure accurate measurement. Incorrect formatting or wording can result in *measurement error*, so following best practices or using existing validated questions can reduce error. There are multiple resources to help researchers draft questions for different modes [e.g., @Bradburn2004; @dillman2014mode; @Fowler1989; @Tourangeau2004spacing]. +Additionally, when designing questions, we should consider the mode for the survey and adjust the language appropriately. In self-administered surveys (e.g., web or mail), respondents can see all the questions and response options, but that is not the case in interviewer-administered surveys (e.g., CATI or CAPI.) With interviewer-administered surveys, the response options must be read aloud to the respondents, so the question may need to be adjusted to create a better flow to the interview. Additionally, with self-administered surveys, because the respondents are viewing the questionnaire, the formatting of the questions is even more critical to ensure accurate measurement. Incorrect formatting or wording can result in *measurement error*, so following best practices or using existing validated questions can reduce error. There are multiple resources to help researchers draft questions for different modes [e.g., @Bradburn2004; @dillman2014mode; @Fowler1989; @Tourangeau2004spacing]. #### Example: Number of pets in a household {.unnumbered #overview-design-questionnaire-ex} -As part of our survey on the average number of pets in a household, researchers may want to know what animal most people prefer to have as a pet. Let's say we have the following question in our survey: +As part of our survey on the average number of pets in a household, we may want to know what animal most people prefer to have as a pet. Let's say we have a question in our survey displayed in Figure \@ref(fig:overview-pet-examp1). ```{r} #| label: overview-pet-examp1 @@ -170,9 +170,9 @@ As part of our survey on the average number of pets in a household, researchers knitr::include_graphics(path="images/PetExample1.png") ``` -This question may have validity issues as it only provides the options of "dogs" and "cats" to respondents, and the interpretation of the data could be incorrect. For example, if we had 100 respondents who answered the question and 50 selected dogs, then the results of this question cannot be "50% of the population prefers to have a dog as a pet," as only two response options were provided. If a respondent taking our survey prefers turtles, they could either be forced to choose a response between these two (i.e., interpret the question as "between dogs and cats, which do you prefer?" and result in *measurement error*), or they may not answer the question (which results in *item nonresponse error*). Based on this, the interpretation of this question should be, "When given a choice between dogs and cats, 50% of respondents preferred to have a dog as a pet." +This question may have validity issues as it only provides the options of "dogs" and "cats" to respondents, and the interpretation of the data could be incorrect. For example, if we had 100 respondents who answered the question and 50 selected dogs, then the results of this question cannot be "50% of the population prefers to have a dog as a pet," as only two response options were provided. If a respondent taking our survey prefers turtles, they could either be forced to choose a response between these two (i.e., interpret the question as "between dogs and cats, which do you prefer?" and result in *measurement error*), or they may not answer the question (which results in *item nonresponse error*.) Based on this, the interpretation of this question should be, "When given a choice between dogs and cats, 50% of respondents preferred to have a dog as a pet." -To avoid this issue, researchers should consider these possibilities and adjust the question accordingly. One simple way could be to add an "other" response option to give respondents a chance to provide a different response. The "other" response option could then include a way for respondents to write their other preference. For example, we could rewrite this question as: +To avoid this issue, we should consider these possibilities and adjust the question accordingly. One simple way could be to add an "other" response option to give respondents a chance to provide a different response. The "other" response option could then include a way for respondents to write their other preference. For example, we could rewrite this question as displayed in Figure \@ref(fig:overview-pet-examp2). ```{r} #| label: overview-pet-examp2 @@ -185,57 +185,57 @@ To avoid this issue, researchers should consider these possibilities and adjust knitr::include_graphics(path="images/PetExample2.png") ``` -Researchers can then code the responses from the open-ended box and get a better understanding of the respondent's choice of preferred pet. Interpreting this question becomes easier as researchers no longer need to qualify the results with the choices provided. +We can then code the responses from the open-ended box and get a better understanding of the respondent's choice of preferred pet. Interpreting this question becomes easier as researchers no longer need to qualify the results with the choices provided. -This is a simple example of how the presentation of the question and options can impact the findings. For more complex topics and questions, researchers must thoroughly consider how to mitigate any impacts from the presentation, formatting, wording, and other aspects. As survey analysts, reviewing not only the data but also the wording of the questions is crucial to ensure the results are presented in a manner consistent with the question asked. Chapter \@ref(c03-survey-data-documentation) provides further details on how to review existing survey documentation to inform our analyses. +This is a simple example of how the presentation of the question and options can impact the findings. For more complex topics and questions, we must thoroughly consider how to mitigate any impacts from the presentation, formatting, wording, and other aspects. For survey analysts, reviewing not only the data but also the wording of the questions is crucial to ensure the results are presented in a manner consistent with the question asked. Chapter \@ref(c03-survey-data-documentation) provides further details on how to review existing survey documentation to inform our analyses and Chapter \@ref(c08-communicating-results) goes into more details on communicating results. ## Data collection {#overview-datacollection} -Once the data collection starts, researchers try to stick to the data collection protocol designed during pre-survey planning. However, effective researchers also prepare to adjust their plans and adapt as needed to the current progress of data collection [@Schouten2018]. Some extreme examples could be natural disasters that could prevent mailings or interviewers getting to the sample members. This could cause an in-person survey needing to quickly pivot to a self-administered survey, or the field period could be delayed, for example. Others could be smaller in that something newsworthy occurs connected to the survey, so researchers could choose to play this up in communication materials. In addition to these external factors, there could be factors unique to the survey, such as lower response rates for a specific sub-group, so the data collection protocol may need to find ways to improve response rates for that specific group. +Once the data collection starts, we try to stick to the data collection protocol designed during pre-survey planning. However, effective researchers also prepare to adjust their plans and adapt as needed to the current progress of data collection [@Schouten2018]. Some extreme examples could be natural disasters that could prevent mailings or interviewers from getting to the sample members. This could cause an in-person survey needing to quickly pivot to a self-administered survey, or the field period could be delayed, for example. Others could be smaller in that something newsworthy occurs connected to the survey, so we could choose to play this up in communication materials. In addition to these external factors, there could be factors unique to the survey, such as lower response rates for a specific sub-group, so the data collection protocol may need to find ways to improve response rates for that specific group. ## Post-survey processing {#overview-post} -After data collection, various activities need to be completed before we can analyze the survey. Multiple decisions made during this post-survey phase can assist researchers in reducing different error sources, such as weighting to account for the sample selection. Knowing the decisions researchers made in creating the final analytic data can impact how analysts use the data and interpret the results. +After data collection, various activities need to be completed before we can analyze the survey. Multiple decisions made during this post-survey phase can assist us in reducing different error sources, such as weighting to account for the sample selection. Knowing the decisions made in creating the final analytic data can impact how we use the data and interpret the results. ### Data cleaning and imputation {#overview-post-cleaning} -Post-survey cleaning is one of the first steps researchers do to get the survey responses into a dataset for use by analysts. Data cleaning can consist of correcting inconsistent data (e.g., with skip pattern errors or multiple questions throughout the survey being consistent with each other), editing numeric entries or open-ended responses for grammar and consistency, or recoding open-ended questions into categories for analysis. There is no universal set of fixed rules that every project must adhere to. Instead, each project or research study should establish its own guidelines and procedures for handling various cleaning scenarios based on its specific objectives. +Post-survey cleaning is one of the first steps to get the survey responses into an analytic dataset. Data cleaning can consist of correcting inconsistent data (e.g., with skip pattern errors or multiple questions throughout the survey being consistent with each other), editing numeric entries or open-ended responses for grammar and consistency, or recoding open-ended questions into categories for analysis. There is no universal set of fixed rules that every survey must adhere to. Instead, each survey or research study should establish its own guidelines and procedures for handling various cleaning scenarios based on its specific objectives. -Researchers should use their best judgment to ensure data integrity, and all decisions should be documented and available to those using the data in the analysis. Each decision a researcher makes impacts *processing error*, so often, researchers have multiple people review these rules or recode open-ended data and adjudicate any differences in an attempt to reduce this error. +We should use our best judgment to ensure data integrity, and all decisions should be documented and available to those using the data in the analysis. Each decision we make impacts *processing error*, so often, multiple people review these rules or recode open-ended data and adjudicate any differences in an attempt to reduce this error. -Another crucial step in post-survey processing is *imputation*. Often, there is item nonresponse where respondents do not answer specific questions. If the questions are crucial to analysis efforts or the research question, researchers may implement imputation to reduce *item nonresponse error*. Imputation is a technique for replacing missing or incomplete data values with estimated values. However, as imputation is a way of assigning a value to missing data based on an algorithm or model, it can also introduce *processing error*, so researchers should consider the overall implications of imputing data compared to having item nonresponse. There are multiple ways to impute data. We recommend reviewing other resources like @Kim2021 for more information. +Another crucial step in post-survey processing is *imputation*. Often, there is item nonresponse where respondents do not answer specific questions. If the questions are crucial to analysis efforts or the research question, we may implement imputation to reduce *item nonresponse error*. Imputation is a technique for replacing missing or incomplete data values with estimated values. However, as imputation is a way of assigning values to missing data based on an algorithm or model, it can also introduce *processing error*, so we should consider the overall implications of imputing data compared to having item nonresponse. There are multiple ways to impute data. We recommend reviewing other resources like @Kim2021 for more information. #### Example: Number of pets in a household {.unnumbered #overview-post-cleaning-ex} -Let's return to the question we created to ask about [animal preference](#overview-design-questionnaire-ex). The "other specify" invites respondents to specify the type of animal they prefer to have as a pet. If respondents entered answers such as "puppy," "turtle," "rabit," "rabbit," "bunny," "ant farm," "snake," "Mr. Purr," then researchers may wish to categorize these write-in responses to help with analysis. In this example, "puppy" could be assumed to be a reference to a "Dog", and could be recoded there. The misspelling of "rabit" could be coded along with "rabbit" and "bunny" into a single category of "Bunny or Rabbit". These are relatively standard decisions that a researcher could make. The remaining write-in responses could be categorized in a few different ways. "Mr. Purr," which may be someone's reference to their own cat, could be recoded as "Cat", or it could remain as "Other" or some category that is "Unknown". Depending on the number of responses related to each of the others, they could all be combined into a single "Other" category, or maybe categories such as "Reptiles" or "Insects" could be created. Each of these decisions may impact the interpretation of the data, so our researchers should document the types of responses that fall into each of the new categories and any decisions made. +Let's return to the question we created to ask about [animal preference](#overview-design-questionnaire-ex). The "other specify" invites respondents to specify the type of animal they prefer to have as a pet. If respondents entered answers such as "puppy," "turtle," "rabit," "rabbit," "bunny," "ant farm," "snake," "Mr. Purr," then we may wish to categorize these write-in responses to help with analysis. In this example, "puppy" could be assumed to be a reference to a "Dog", and could be recoded there. The misspelling of "rabit" could be coded along with "rabbit" and "bunny" into a single category of "Bunny or Rabbit". These are relatively standard decisions that we can make. The remaining write-in responses could be categorized in a few different ways. "Mr. Purr," which may be someone's reference to their own cat, could be recoded as "Cat", or it could remain as "Other" or some category that is "Unknown". Depending on the number of responses related to each of the others, they could all be combined into a single "Other" category, or maybe categories such as "Reptiles" or "Insects" could be created. Each of these decisions may impact the interpretation of the data, so we should document the types of responses that fall into each of the new categories and any decisions made. ### Weighting {#overview-post-weighting} -We can address some of the error sources identified in the previous sections using *weighting*. During the weighting process, weights are created for each respondent record. These weights allow the survey responses to generalize to the population. A weight, generally, reflects how many units in the population each respondent represents, and, often the weight is constructed such that the sum of the weights is the size of the population. +We can address some error sources identified in the previous sections using *weighting*. During the weighting process, weights are created for each respondent record. These weights allow the survey responses to generalize to the population. A weight, generally, reflects how many units in the population each respondent represents. Often, the weight is constructed such that the sum of the weights is the size of the population. -Weights can address coverage, sampling, and nonresponse errors. Many published surveys include an "analysis weight" variable that combines these adjustments. However, weighting itself can also introduce *adjustment error*, so researchers need to balance which types of errors should be corrected with weighting. The construction of weights is outside the scope of this book, and researchers should reference other materials if interested in constructing their own [@Valliant2018weights]. Instead, this book assumes the survey has been completed, weights are constructed, and data is available to users. +Weights can address coverage, sampling, and nonresponse errors. Many published surveys include an "analysis weight" variable that combines these adjustments. However, weighting itself can also introduce *adjustment error*, so we need to balance which types of errors should be corrected with weighting. The construction of weights is outside the scope of this book, so we recommend referencing other materials if interested in weight construction [@Valliant2018weights]. Instead, this book assumes the survey has been completed, weights are constructed, and data are available to users. #### Example: Number of pets in a household {.unnumbered #overview-post-weighting-ex} -In the simple example of our survey, we decided to obtain a random sample from each state to select our sample members. Knowing this sampling design, our researcher can include selection weights for analysis that account for how the sample members were selected for the survey. Additionally, the sampling frame may have the type of building associated with each address, so we could include the building type as a potential nonresponse weighting variable, along with some interviewer observations that may be related to our research topic of the average number of pets in a household. Combining these weights, we can create an analytic weight that researchers need to use when analyzing the data. +In the simple example of our survey, we decided to obtain a random sample from each state to select our sample members. Knowing this sampling design, we can include selection weights for analysis that account for how the sample members were selected for the survey. Additionally, the sampling frame may have the type of building associated with each address, so we could include the building type as a potential nonresponse weighting variable, along with some interviewer observations that may be related to our research topic of the average number of pets in a household. Combining these weights, we can create an analytic weight that analysts need to use when analyzing the data. ### Disclosure {#overview-post-disclosure} -Before data is released publicly, researchers need to ensure that individual respondents can not be identified by the data when confidentiality is required. There are a variety of different methods that can be used. Here we describe a few of the most commonly used: +Before data is released publicly, we need to ensure that individual respondents can not be identified by the data when confidentiality is required. There are a variety of different methods that can be used. Here we describe a few of the most commonly used: -- **Data swapping**: Researchers may swap specific data values across different respondents so that it does not impact insights from the data but ensures that specific individuals cannot be identified. -- **Top/bottom coding**: Researchers may choose top or bottom coding to mask extreme values. For example, researchers may top-code income values such that households with income greater than \$500,000 are coded as "\$500,000 or more" with other incomes being presented as integers between \$0 and \$499,999. This can impact analyses at the tails of the distribution. -- **Coarsening**: Researchers may use coarsening to mask unique values. For example, a survey question may ask for a precise income but the public data may include data as a categorical variable. Another example commonly used in survey practice is to coarsen geographic variables. Data collectors likely know the precise address of sample members but the public data may only include the state or even region of respondents. -- **Perturbation**: Researchers may add random noise to outcomes. As with swapping, this is done so that it does not impact insights from the data but ensures that specific individuals cannot be identified. +- **Data swapping**: We may swap specific data values across different respondents so that it does not impact insights from the data but ensures that specific individuals cannot be identified. +- **Top/bottom coding**: We may choose top or bottom coding to mask extreme values. For example, we may top-code income values such that households with income greater than \$500,000 are coded as "\$500,000 or more," with other incomes being presented as integers between \$0 and \$499,999. This can impact analyses at the tails of the distribution. +- **Coarsening**: We may use coarsening to mask unique values. For example, a survey question may ask for a precise income, but the public data may include income as a categorical variable. Another example commonly used in survey practice is to coarsen geographic variables. Data collectors likely know the precise address of sample members, but the public data may only include the state or even region of respondents. +- **Perturbation**: We may add random noise to outcomes. As with swapping, this is done so that it does not impact insights from the data but ensures that specific individuals cannot be identified. -There is as much art as there is science to the methods used for disclosure. In the survey documentation, researchers will only provide high-level comments about the disclosure and not specific details. This ensures nobody can reverse the disclosure and thus identify individuals. For more information on different disclosure methods, please see @Skinner2009 and the [AAPOR Standards](https://aapor.org/standards-and-ethics/disclosure-standards/). +There is as much art as there is science to the methods used for disclosure. Only high-level comments about the disclosure are provided in the survey documentation, not specific details. This ensures nobody can reverse the disclosure and thus identify individuals. For more information on different disclosure methods, please see @Skinner2009 and the [AAPOR Standards](https://aapor.org/standards-and-ethics/disclosure-standards/). ### Documentation {#overview-post-documentation} -Documentation is a critical step of the survey life cycle. Researchers systematically record all the details, decisions, procedures, and methodologies to ensure transparency, reproducibility, and the overall quality of survey research. +Documentation is a critical step of the survey life cycle. We should systematically record all the details, decisions, procedures, and methodologies to ensure transparency, reproducibility, and the overall quality of survey research. Proper documentation allows analysts to understand, reproduce, and evaluate the study's methods and findings. Chapter \@ref(c03-survey-data-documentation) dives into how analysts should use survey data documentation. ## Post-survey data analysis and reporting -After completing the survey life cycle, the data is ready for analysts to use. The rest of this book continues from this point. For more information on the survey life cycle, please explore the references cited throughout this chapter. +After completing the survey life cycle, the data are ready for analysts to use. Chapter \@ref(c04-getting-started) continues from this point. For more information on the survey life cycle, please explore the references cited throughout this chapter. diff --git a/03-survey-data-documentation.Rmd b/03-survey-data-documentation.Rmd index 33e7e43e..6fe36925 100644 --- a/03-survey-data-documentation.Rmd +++ b/03-survey-data-documentation.Rmd @@ -23,18 +23,18 @@ The technical documentation, also known as user guides or methodology/analysis g * **Introduction:** The introduction orients us to the survey. This section provides the project's background, the study's purpose, and the main research questions. * **Study design:** The study design section describes how researchers prepared and administered the survey. - * **Sample:** The sample section describes the sample frame, any known sampling errors, and the limitations of the sample. This section can contain recommendations on how to use sampling weights. Look for weight information, whether the survey design contains strata, clusters/PSUs, or replicate weights. Also look for population sizes, finite population correction, or replicate weight scaling information. Additional detail on sample designs is available in Chapter \@ref(c10-sample-designs-replicate-weights). + * **Sample:** The sample section describes the sample frame, any known sampling errors, and the sample's limitations. This section can contain recommendations on how to use sampling weights. Look for weight information, whether the survey design contains strata, clusters/PSUs, or replicate weights. Also, look for population sizes, finite population correction, or replicate weight scaling information. Additional detail on sample designs is available in Chapter \@ref(c10-sample-designs-replicate-weights). * **Notes on fielding:** Any additional notes on fielding, such as response rates, may be found in the technical documentation. -The technical documentation may include other helpful resources. Some technical documentation includes syntax for SAS, SUDAAN, Stata, and/or R, so we do not have to create this code from scratch. +The technical documentation may include other helpful resources. For example, some technical documentation includes syntax for SAS, SUDAAN, Stata, and/or R, so we do not have to create this code from scratch. ### Questionnaires -A questionnaire is a series of questions used to collect information from people in a survey. It can ask about opinions, behaviors, demographics, or even just numbers like the count of lightbulbs, square footage, or farm size. Questionnaires can employ different types of questions, such as closed-ended (e.g., select one or check all that apply), open-ended (e.g., numeric or text), Likert scales (e.g., a 5- or 7-point scale specifying a respondent's level of agreement to a statement), or ranking questions (e.g., a list of options that a respondent ranks by preference). It may randomize the display order of responses or include instructions that help respondents understand the questions. A survey may have one questionnaire or multiple, depending on its scale and scope. +A questionnaire is a series of questions used to collect information from people in a survey. It can ask about opinions, behaviors, demographics, or even just numbers like the count of lightbulbs, square footage, or farm size. Questionnaires can employ different types of questions, such as closed-ended (e.g., select one or check all that apply), open-ended (e.g., numeric or text), Likert scales (e.g., a 5- or 7-point scale specifying a respondent's level of agreement to a statement), or ranking questions (e.g., a list of options that a respondent ranks by preference.) It may randomize the display order of responses or include instructions that help respondents understand the questions. A survey may have one questionnaire or multiple, depending on its scale and scope. -The questionnaire is another important resource for understanding and interpreting the survey data (see Section \@ref(overview-design-questionnaire)), and we should use it alongside any analysis. It provides details about each of the questions asked in the survey, such as question name, question wording, response options, skip logic, randomizations, display specification, mode differences, and the universe (the subset of respondents that were asked a question). +The questionnaire is another important resource for understanding and interpreting the survey data (see Section \@ref(overview-design-questionnaire)), and we should use it alongside any analysis. It provides details about each of the questions asked in the survey, such as question name, question wording, response options, skip logic, randomizations, display specifications, mode differences, and the universe (the subset of respondents who were asked a question.) -Below, in Figure \@ref(fig:understand-que-examp), we show an example from the ANES 2020 questionnaire [@anes-svy]. The figure shows a question's question name (`POSTVOTE_RVOTE`), description (Did R Vote?), full wording of the question and responses, response order, universe, question logic (this question was only asked if `vote_pre` = 0), and other specifications. The section also includes the variable name, which we can link to the codebook. +In Figure \@ref(fig:understand-que-examp), we show an example from the ANES 2020 questionnaire [@anes-svy]. The figure shows the question name (`POSTVOTE_RVOTE`), description (Did R Vote?), full wording of the question and responses, response order, universe, question logic (this question was only asked if `vote_pre` = 0), and other specifications. The section also includes the variable name, which we can link to the codebook. ```{r} #| label: understand-que-examp @@ -56,13 +56,13 @@ The content and structure of questionnaires vary depending on the specific surve knitr::include_graphics(path = "images/questionnaire-example-2.jpg") ``` -We should factor in the details of a survey when conducting our analyses. For example, surveys that use various modes (e.g., web and mail) may have differences in question wording or skip logic, as web surveys can include fills or automate skip logic. These variations could warrant separate analyses for each mode. +We should factor in the details of a survey when conducting our analyses. For example, surveys that use various modes (e.g., web and mail) may have differences in question wording or skip logic, as web surveys can include fills or automate skip logic. If large enough, these variations could warrant separate analyses for each mode. ### Codebooks -While a questionnaire provides information about the questions posed to respondents, the codebook explains how the survey data was coded and recorded. It lists details such as variable names, variable labels, variable meanings, codes for missing data, value labels, and value types (whether categorical or continuous, etc.). The codebook helps us understand and use the variables appropriately in our analysis. In particular, the codebook (as opposed to the questionnaire) often includes information on missing data. Note that the term *data dictionary* is sometimes used interchangeably with codebook, but a data dictionary may include more details on the structure and elements of the data. +While a questionnaire provides information about the questions posed to respondents, the codebook explains how the survey data were coded and recorded. It lists details such as variable names, variable labels, variable meanings, codes for missing data, value labels, and value types (whether categorical, continuous, etc.) The codebook helps us understand and use the variables appropriately in our analysis. In particular, the codebook (as opposed to the questionnaire) often includes information on missing data. Note that the term *data dictionary* is sometimes used interchangeably with codebook, but a data dictionary may include more details on the structure and elements of the data. -Figure \@ref(fig:understand-codebook-examp) is a question from the ANES 2020 codebook [@anes-cb]. This section indicates a particular variable's name (`V202066`), question wording, value labels, universe, and associated survey question (`POSTVOTE_RVOTE`). +Figure \@ref(fig:understand-codebook-examp) is a question from the ANES 2020 codebook [@anes-cb]. This section indicates a variable's name (`V202066`), question wording, value labels, universe, and associated survey question (`POSTVOTE_RVOTE`.) ```{r} #| label: understand-codebook-examp @@ -82,7 +82,7 @@ An erratum (singular) or errata (plural) is a document that lists errors found i * Issuing a corrected data table after realizing a typo or mistake in a table cell * Reporting incorrectly programmed skips in an electronic survey where questions are skipped by the respondent when they should not have been -The 2004 ANES dataset released an erratum, notifying analysts to remove a specific row from the data file due to the inclusion of a respondent who should not have been part of the sample. Adhering to an issued erratum helps us increase the accuracy and reliability of analysis. +For example, the 2004 ANES dataset released an erratum, notifying analysts to remove a specific row from the data file due to the inclusion of a respondent who should not have been part of the sample. Adhering to an issued erratum helps us increase the accuracy and reliability of analysis. ### Additional resources @@ -90,11 +90,11 @@ Survey documentation may include additional material, such as interviewer instru ## Missing data coding -For some observations in a dataset, there may be missing data. This can be by design or from nonresponse, and these concepts are detailed in Chapter \@ref(c11-missing-data). In that chapter, we also discuss how to analyze data with missing data. In this section, we discuss how to understand documentation related to missing data. +Some observations in a dataset may have missing data. This can be due to design or nonresponse, and these concepts are detailed in Chapter \@ref(c11-missing-data). In that chapter, we also discuss how to analyze data with missing values. This chapter walks through how to understand documentation related to missing data. -The survey documentation, often the codebook, represents the missing data with a code. The codebook may list different codes depending on why certain data is missing. In the example of variable `V202066` from the ANES (Figure \@ref(fig:understand-codebook-examp)), `-9` represents "Refused," `-7` means that the response was deleted due to an incomplete interview, `-6` means that there is no response because there was no follow-up interview, and `-1` means "Inapplicable" (due to the designed skip pattern). +The survey documentation, often the codebook, represents the missing data with a code. The codebook may list different codes depending on why certain data points are missing. In the example of variable `V202066` from the ANES (Figure \@ref(fig:understand-codebook-examp)), `-9` represents "Refused," `-7` means that the response was deleted due to an incomplete interview, `-6` means that there is no response because there was no follow-up interview, and `-1` means "Inapplicable" (due to a designed skip pattern.) -As another example, there may be a summary variable that describes the missingness of a set of variables - particularly with "select all that apply" or "multiple response" questions. In the National Crime Victimization Survey (NCVS), respondents who are victims of a crime and saw the offender are asked if the offender have a weapon and then asked what the type of weapon was. This part of the questionnaire from 2021 is shown in Figure \@ref(fig:understand-ncvs-weapon-q). +As another example, there may be a summary variable that describes the missingness of a set of variables - particularly with "select all that apply" or "multiple response" questions. In the National Crime Victimization Survey (NCVS), respondents who are victims of a crime and saw the offender are asked if the offender had a weapon and then asked what the type of weapon was. This part of the questionnaire from 2021 is shown in Figure \@ref(fig:understand-ncvs-weapon-q). ```{r} #| label: understand-ncvs-weapon-q @@ -105,7 +105,7 @@ As another example, there may be a summary variable that describes the missingne knitr::include_graphics(path="images/questionnaire-ncvs-weapon.jpg") ``` -The NCVS codebook includes coding for all multiple response variables of a "lead in" variable that summarizes the individual options. For question 23a on the weapon type, the lead in variable is V4050 which is shown in \@ref(fig:understand-ncvs-weapon-cb). This variable is then followed by a set of variables for each weapon type. An example of one of the individual variables from the codebook, the handgun, is shown in \@ref(fig:understand-ncvs-weapon-cb-hg). We will dive in more to this example in Chapter \@ref(c11-missing-data) of how to analyze this variable. +The NCVS codebook includes coding for all multiple response variables of a "lead in" variable that summarizes the individual options. For question 23a on the weapon type, the lead-in variable is V4050, which is shown in \@ref(fig:understand-ncvs-weapon-cb). This variable is then followed by a set of variables for each weapon type. An example of one of the individual variables from the codebook, the handgun, is shown in \@ref(fig:understand-ncvs-weapon-cb-hg). We will dive into how to analyze this variable in Chapter \@ref(c11-missing-data). ```{r} #| label: understand-ncvs-weapon-cb @@ -124,21 +124,19 @@ knitr::include_graphics(path="images/codebook-ncvs-weapon-li.jpg") knitr::include_graphics(path="images/codebook-ncvs-weapon-handgun.jpg") ``` -When data is read into R, some values may be system missing, that is they are coded as `NA` even if that is not evident in a codebook. We will discuss in Chapter \@ref(c11-missing-data) how to analyze data with `NA` values and review how R handles missing data in calculations. +When data are read into R, some values may be system missing, that is they are coded as `NA` even if that is not evident in a codebook. We discuss in Chapter \@ref(c11-missing-data) how to analyze data with `NA` values and review how R handles missing data in calculations. ## Example: American National Election Studies (ANES) 2020 survey documentation -Let's look at the survey documentation for the American National Election Studies (ANES) 2020. The survey website is located at [https://electionstudies.org/data-center/2020-time-series-study/](https://electionstudies.org/data-center/2020-time-series-study/). - -Navigating to "User Guide and Codebook" [@anes-cb], we can download the PDF that contains the survey documentation, titled "ANES 2020 Time Series Study Full Release: User Guide and Codebook". Do not be daunted by the 796-page PDF. We will focus on the most critical information. +Let's look at the survey documentation for the American National Election Studies (ANES) 2020 and the documentation from their [website](https://electionstudies.org/data-center/2020-time-series-study/). Navigating to "User Guide and Codebook" [@anes-cb], we can download the PDF that contains the survey documentation, titled "ANES 2020 Time Series Study Full Release: User Guide and Codebook". Do not be daunted by the 796-page PDF. Below, we focus on the most critical information. #### Introduction {-} -The first section in the User Guide explains that the ANES 2020 Times Series Study continues a series of election surveys conducted since 1948. These surveys contain data on public opinion and voting behavior in the U.S. presidential elections. The introduction also includes information about the modes used for data collection (web, live video interviewing, or CATI). Additionally, there is a summary of the number of pre-election interviews (8,280) and post-election re-interviews (7,449). +The first section in the User Guide explains that the ANES 2020 Times Series Study continues a series of election surveys conducted since 1948. These surveys contain data on public opinion and voting behavior in the U.S. presidential elections. The introduction also includes information about the modes used for data collection (web, live video interviewing, or CATI.) Additionally, there is a summary of the number of pre-election interviews (8,280) and post-election re-interviews (7,449.) #### Sample design and respondent recruitment {-} -The section "Sample Design and Respondent Recruitment" provides more detail about the survey's sequential mixed-mode design. All three modes were conducted one after another and not at the same time. Additionally, it indicates that for the 2020 survey, they resampled all respondents who participated in 2016 ANES, along with a newly-drawn cross-section: +The section "Sample Design and Respondent Recruitment" provides more detail about the survey's sequential mixed-mode design. All three modes were conducted one after another and not at the same time. Additionally, it indicates that for the 2020 survey, they resampled all respondents who participated in the 2016 ANES, along with a newly drawn cross-section: > The target population for the fresh cross-section was the 231 million non-institutional U.S. citizens aged 18 or older living in the 50 U.S. states or the District of Columbia. @@ -150,7 +148,7 @@ The section "Data Analysis, Weights, and Variance Estimation" includes informati > For analysis of the complete set of cases using pre-election data only, including all cases and representative of the 2020 electorate, use the full sample pre-election weight, **V200010a**. For analysis including post-election data for the complete set of participants (i.e., analysis of post-election data only or a combination of pre- and post-election data), use the full sample post-election weight, **V200010b**. Additional weights are provided for analysis of subsets of the data... -The document provides more information about the variables, summarized in Table \@ref(tab:aneswgts). +The document provides more information about the design variables, summarized in Table \@ref(tab:aneswgts). Table: (\#tab:aneswgts) Weight and variance information for ANES @@ -165,4 +163,4 @@ The user guide mentions a supplemental document called "How to Analyze ANES Surv > The target population for the fresh cross-section was the 231 million non-institutional U.S. citizens aged 18 or older living in the 50 U.S. states or the District of Columbia. -The documentation suggests that the population should equal around 231 million, but this is a very imprecise count. Upon further investigation in the available resources, we can find the methodology file titled "Methodology Report for the ANES 2020 Time Series Study" [@anes-2020-tech]. This file states that we can use the population total from the Current Population Survey (CPS), a monthly survey sponsored by the U.S. Census Bureau and the U.S. Bureau of Labor Statistics. The CPS provides a more accurate population estimate for a specific month. Therefore, we can use the CPS to get the total population number for March 2020, the time in which the ANES was conducted. Chapter \@ref(c04-getting-started) goes into detailed instructions on how to calculate and adjust this value in the data. \ No newline at end of file +The documentation suggests that the population should equal around 231 million, but this is a very imprecise count. Upon further investigation of the available resources, we can find the methodology file titled "Methodology Report for the ANES 2020 Time Series Study" [@anes-2020-tech]. This file states that we can use the population total from the Current Population Survey (CPS), a monthly survey sponsored by the U.S. Census Bureau and the U.S. Bureau of Labor Statistics. The CPS provides a more accurate population estimate for a specific month. Therefore, we can use the CPS to get the total population number for March 2020, when the ANES was conducted. Chapter \@ref(c04-getting-started) goes into detailed instructions on how to calculate and adjust this value in the data. diff --git a/04-set-up.Rmd b/04-set-up.Rmd index f72be534..e1907420 100644 --- a/04-set-up.Rmd +++ b/04-set-up.Rmd @@ -3,7 +3,7 @@ # Getting started {#c04-getting-started} ```{r} -#| label: set-up-styler +#| label: setup-styler #| echo: false #| message: false knitr::opts_chunk$set(tidy = 'styler') @@ -13,13 +13,13 @@ library(tidyselect) ## Introduction -This chapter provides an overview of the packages, data, and design objects we use frequently throughout this book. As mentioned in Chapter \@ref(c02-overview-surveys), understanding how a survey was conducted helps us make sense of the results and interpret findings. Therefore, we provide background on the datasets used in examples and exercises. Next, we walk through how to create the survey design objects necessary to begin analysis. Finally, we provide an overview of the {srvyr} package and the steps needed for analysis. If you have questions or face issues while going through the book, please report them in the book's [GitHub repository](https://github.com/tidy-survey-r/tidy-survey-book). +This chapter provides an overview of the packages, data, and design objects we use frequently throughout this book. As mentioned in Chapter \@ref(c02-overview-surveys), understanding how a survey was conducted helps us make sense of the results and interpret findings. Therefore, we provide background on the datasets used in examples and exercises. Next, we walk through how to create the survey design objects necessary to begin an analysis. Finally, we provide an overview of the {srvyr} package and the steps needed for analysis. Please report any bugs and issues while going through the book to the book's [GitHub repository](https://github.com/tidy-survey-r/tidy-survey-book). ## Setup -The Setup section provides details on the required packages and data, as well as the steps for preparing survey design objects. For a streamlined learning experience, we recommend taking the time to walk through the code provided and making sure everything is properly set up. +This section provides details on the required packages and data, as well as the steps for preparing survey design objects. For a streamlined learning experience, we recommend taking the time to walk through the code provided here and making sure everything is properly set up. -### Packages +### Packages {#setup-load-pkgs} We use several packages throughout the book, but let's install and load specific ones for this chapter. Many functions in the examples and exercises are from three packages: {tidyverse}, {survey}, and {srvyr}. If they are not already installed, use the code below. The {tidyverse} and {survey} packages can both be installed from the Comprehensive R Archive Network (CRAN) [@lumley2010complex; @tidyverse2019]. We use the GitHub development version of {srvyr} because of its additional functionality compared to the one on CRAN [@R-srvyr]. Install the package directly from GitHub using the {remotes} package: @@ -52,7 +52,7 @@ library(srvyr) library(srvyrexploR) ``` -The packages {broom}, {gt}, and {gtsummary} play a role in displaying output and creating formatted tables [@R-gt, @R-broom; @gtsummary]. Install them with the provided code^[Note: {broom} is already included in the tidyverse, so no separate installation is required]: +The packages {broom}, {gt}, and {gtsummary} play a role in displaying output and creating formatted tables [@R-gt, @R-broom; @gtsummarysjo]. Install them with the provided code^[Note: {broom} is already included in the tidyverse, so no separate installation is required.]: ```{r} #| label: setup-install-extra @@ -90,7 +90,7 @@ After installing this package, load it using the `library()` function: library(censusapi) ``` -Note that the {censusapi} package requires a Census API key, available for free from the [U.S. Census Bureau website](https://api.census.gov/data/key_signup.html) (refer to the package documentation for more information). We recommend storing the Census API key in our R environment instead of directly in the code. After obtaining the API key, save it in your R environment by running `Sys.setenv()`: +Note that the {censusapi} package requires a Census API key, available for free from the [U.S. Census Bureau website](https://api.census.gov/data/key_signup.html) (refer to the package documentation for more information). We recommend storing the Census API key in the R environment instead of directly in the code. To do this, run `Sys.setenv()` after obtaining the API key. ```{r} #| label: setup-census-api-setup @@ -104,7 +104,7 @@ There are a few other packages used in the book in limited frequency. We list th ### Data -As mentioned above, the {srvyrexploR} package contains the datasets used in the book. Once installed and loaded, explore the documentation using the `help()` function. Read the descriptions of the datasets to understand what they contain: +The {srvyrexploR} package contains the datasets used in the book. Once installed and loaded, explore the documentation using the `help()` function. Read the descriptions of the datasets to understand what they contain: ```{r} #| label: setup-datapkg-help @@ -112,13 +112,13 @@ As mentioned above, the {srvyrexploR} package contains the datasets used in the help(package = "srvyrexploR") ``` -This book uses two main datasets: the American National Election Studies [ANES -- @debell] and the Residential Energy Consumption Survey [RECS -- @recs-2020-tech] which are included as `anes_2020` and `recs_2020`, respectively, in the {srvyrexploR} package. +This book uses two main datasets: the American National Election Studies [ANES -- @debell] and the Residential Energy Consumption Survey [RECS -- @recs-2020-tech], which are included as `anes_2020` and `recs_2020` in the {srvyrexploR} package, respectively. #### American National Election Studies (ANES) Data {-} -The ANES is a study that collects data from election surveys dating back to 1948. These surveys contain information on public opinion and voting behavior in U.S. presidential elections and some midterm elections^[In the United States, presidential elections are held in years divisible by four. In other even years, there are elections at the federal level for congress which are referred to as midterm elections as they occur at the middle of the term of a president.]. They cover topics such as party affiliation, voting choice, and level of trust in the government. The 2020 survey, the data we use in the book, was fielded online, through live video interviews, or via computer-assisted telephone interviews (CATI). +ANES is a study that collects data from election surveys dating back to 1948. These surveys contain information on public opinion and voting behavior in U.S. presidential elections and some midterm elections^[In the United States, presidential elections are held in years divisible by four. In other even years, there are elections at the federal level for congress which are referred to as midterm elections as they occur at the middle of the term of a president.]. They cover topics such as party affiliation, voting choice, and level of trust in the government. The 2020 survey (data used in this book) was fielded online, through live video interviews, or via computer-assisted telephone interviews (CATI). -When working with new survey data, analysts should review the survey documentation (see Chapter \@ref(c03-survey-data-documentation)) to understand the data collection methods. The original ANES data contains variables starting with `V20` [@debell], so to assist with our analysis throughout the book, we created descriptive variable names. For example, the respondent's age is now in a variable called `Age`, and gender is in a variable called `Gender`. These descriptive variables are included in the {srvyrexploR} package, and Table \@ref(tab:anes-view-tab) displays the list of these renamed variables. A complete overview of all variables can be found in `r if (!knitr:::is_html_output()) 'the online Appendix ('`Appendix \@ref(anes-cb)`r if (!knitr:::is_html_output()) ')'`. +When working with new survey data, we should review the survey documentation (see Chapter \@ref(c03-survey-data-documentation)) to understand the data collection methods. The original ANES data contains variables starting with `V20` [@debell], so to assist with our analysis throughout the book, we created descriptive variable names. For example, the respondent's age is now in a variable called `Age`, and gender is in a variable called `Gender`. These descriptive variables are included in the {srvyrexploR} package, and Table \@ref(tab:anes-view-tab) displays the list of these renamed variables. A complete overview of all variables can be found in `r if (!knitr:::is_html_output()) 'the online Appendix ('`Appendix \@ref(anes-cb)`r if (!knitr:::is_html_output()) ')'`. (ref:anes-view-tab) List of created variables in the ANES Data @@ -156,9 +156,9 @@ From the output, we can see there are `r nrow(anes_2020 %>% select(-matches("^V\ #### Residential Energy Consumption Survey (RECS) Data {-} -RECS is a study that measures energy consumption and expenditure in American households. Funded by the Energy Information Administration, the RECS data are collected through interviews with household members and energy suppliers. These interviews take place in person, over the phone, via mail, and on the web with modes changing over time. The survey has been fielded 14 times between 1950 and 2020. It includes questions about appliances, electronics, heating, air conditioning (A/C), temperatures, water heating, lighting, energy bills, respondent demographics, and energy assistance. +RECS is a study that measures energy consumption and expenditure in American households. Funded by the Energy Information Administration, RECS data are collected through interviews with household members and energy suppliers. These interviews take place in person, over the phone, via mail, and on the web, with modes changing over time. The survey has been fielded 14 times between 1950 and 2020. It includes questions about appliances, electronics, heating, air conditioning (A/C), temperatures, water heating, lighting, energy bills, respondent demographics, and energy assistance. -As mentioned above, analysts should read the survey documentation (see Chapter \@ref(c03-survey-data-documentation)) to understand how the data was collected and implemented. Table \@ref(tab:recs-view-tab) displays the list of variables in the RECS data (not including the weights, which start with `NWEIGHT` and will be described in more detail in Chapter \@ref(c10-sample-designs-replicate-weights)). An overview of all variables can be found in `r if (!knitr:::is_html_output()) 'the online Appendix ('`Appendix \@ref(recs-cb)`r if (!knitr:::is_html_output()) ')'`. +We should read the survey documentation (see Chapter \@ref(c03-survey-data-documentation)) to understand how the data were collected and implemented. Table \@ref(tab:recs-view-tab) displays the list of variables in the RECS data (not including the weights, which start with `NWEIGHT` and are described in more detail in Chapter \@ref(c10-sample-designs-replicate-weights)). An overview of all variables can be found in `r if (!knitr:::is_html_output()) 'the online Appendix ('`Appendix \@ref(recs-cb)`r if (!knitr:::is_html_output()) ')'`. (ref:recs-view-tab) List of Variables in the RECS Data @@ -193,19 +193,19 @@ recs_2020 %>% glimpse() ``` -From the output, we can see that there are `r nrow(recs_2020 %>% select(-matches("^NWEIGHT"))) %>% formatC(big.mark = ",")` rows and `r ncol(recs_2020 %>% select(-matches("^NWEIGHT"))) %>% formatC(big.mark = ",")` non-weight variables in the RECS data. This output also indicates that most of the variables are in double (numeric) format (e.g., `TOTSQFT_EN`), with some factor (e.g., `Region`), Boolean (e.g., `ACUsed`), character (e.g., `REGIONC`), and ordinal (e.g., `YearMade`) variables. +From the output, we can see that the RECS data has `r nrow(recs_2020 %>% select(-matches("^NWEIGHT"))) %>% formatC(big.mark = ",")` rows and `r ncol(recs_2020 %>% select(-matches("^NWEIGHT"))) %>% formatC(big.mark = ",")` non-weight variables. This output also indicates that most of the variables are in double (numeric) format (e.g., `TOTSQFT_EN`), with some factor (e.g., `Region`), Boolean (e.g., `ACUsed`), character (e.g., `REGIONC`), and ordinal (e.g., `YearMade`) variables. ### Design objects {#setup-des-obj} -The design object is the backbone for survey analysis. It is where we specify the sampling design, weights, and other necessary information to ensure we account for errors in the data. Before creating the design object, analysts should carefully review the survey documentation to understand how to create the design object for accurate analysis. +The design object is the backbone for survey analysis. It is where we specify the sampling design, weights, and other necessary information to ensure we account for errors in the data. Before creating the design object, we should carefully review the survey documentation to understand how to create the design object for accurate analysis. -In this chapter, we provide details on how to code the design object for the ANES and RECS data used in the book. However, we only provide a high-level overview to get readers started. For a deeper understanding of creating these design objects for a variety of sampling designs, see Chapter \@ref(c10-sample-designs-replicate-weights). +In this section, we provide details on how to code the design object for the ANES and RECS data used in the book. However, we only provide a high-level overview to get readers started. For a deeper understanding of creating design objects for a variety of sampling designs, see Chapter \@ref(c10-sample-designs-replicate-weights). -While we recommend conducting exploratory data analysis on the original data before diving into complex survey analysis (see Chapter \@ref(c12-recommendations)), the actual analysis and inference should be performed with the survey design objects instead of the original survey data. For example, the ANES data is called `anes_2020`. If we create a survey design object called `anes_des`, our analyses should begin with `anes_des` and not `anes_2020`. Using the survey design object ensures that our calculations are appropriately accounting for the details of the survey design. +While we recommend conducting exploratory data analysis on the original data before diving into complex survey analysis (see Chapter \@ref(c12-recommendations)), the actual survey analysis and inference should be performed with the survey design objects instead of the original survey data. For example, the ANES data is called `anes_2020`. If we create a survey design object called `anes_des`, our survey analyses should begin with `anes_des` and not `anes_2020`. Using the survey design object ensures that our calculations appropriately account for the details of the survey design. #### American National Election Studies (ANES) Design Object {-} -The ANES documentation [@debell] details the sampling and weighting implications for analyzing the survey data. From this documentation and as noted in Chapter \@ref(c03-survey-data-documentation), the 2020 ANES data is weighted to the sample, not the population. To make generalizations about the population, we need to weigh the data against the full population count. The ANES methodology recommends using the Current Population Survey (CPS) to determine the number of non-institutional U.S. citizens aged 18 or older living in the 50 U.S. states or D.C. in March of 2020. +The ANES documentation [@debell] details the sampling and weighting implications for analyzing the survey data. From this documentation and as noted in Chapter \@ref(c03-survey-data-documentation), the 2020 ANES data are weighted to the sample, not the population. To make generalizations about the population, we need to weigh the data against the full population count. The ANES methodology recommends using the Current Population Survey (CPS) to determine the number of non-institutional U.S. citizens aged 18 or older living in the 50 U.S. states or D.C. in March 2020. We can use the {censusapi} package to obtain the information needed for the survey design object. The `getCensus()` function allows us to retrieve the CPS data for March (`cps/basic/mar`) in 2020 (`vintage = 2020`). Additionally, we extract several variables from the CPS: @@ -236,7 +236,7 @@ cps_state <- cps_state_in %>% In the code above, we include `region = "state"`. The default region type for the CPS data is at the state level. While not required, including the region can be helpful for understanding the geographical context of the data. -In `getCensus()`, we filtered the dataset by specifying the month (`HRMONTH == 3`) and year (`HRYEAR4 == 2020`) of our request. Therefore, we expect that all interviews within our output were conducted during that particular month and year. We can confirm that the data is from March 2020 by running the code below: +In `getCensus()`, we filtered the dataset by specifying the month (`HRMONTH == 3`) and year (`HRYEAR4 == 2020`) of our request. Therefore, we expect that all interviews within our output were conducted during that particular month and year. We can confirm that the data are from March 2020 by running the code below: ```{r} #| label: setup-anes-cps-date @@ -260,14 +260,12 @@ To calculate the U.S. population from the filtered data, we sum the person weigh targetpop <- cps_narrow_resp %>% pull(PWSSWGT) %>% sum() -``` -```{r} -#| label: setup-anes-cps-targetpop-print scales::comma(targetpop) ``` -The target population in 2020 is `r scales::comma(targetpop)`. This result gives us what we need to create the survey design object for estimating population statistics. Using the `anes_2020` data, we adjust the weighting variable (`V200010b`) using the target population we just calculated (`targetpop`). We determine the proportion of the total weight for each individual weight (`V200010b / sum(V200010b)`) and then multiply that proportion by the calculated target population. + +The population of interest in 2020 is `r scales::comma(targetpop)`. This result gives us what we need to create the survey design object for estimating population statistics. Using the `anes_2020` data, we adjust the weighting variable (`V200010b`) using the population of interest we just calculated (`targetpop`). We determine the proportion of the total weight for each individual weight (`V200010b / sum(V200010b)`) and then multiply that proportion by the calculated population of interest. ```{r} #| label: setup-anes-adjust @@ -288,13 +286,13 @@ anes_des <- anes_adjwgt %>% anes_des ``` -We can examine this new object to learn more about the survey design, such that the ANES is a "Stratified 1 - level Cluster Sampling design (with replacement) With (101) clusters". Additionally, the output displays the sampling variables and then lists the remaining variables in the dataset. This design object will be used throughout this book to conduct survey analysis. +We can examine this new object to learn more about the survey design, such that the ANES is a "Stratified 1 - level Cluster Sampling design (with replacement) With (101) clusters". Additionally, the output displays the sampling variables and then lists the remaining variables in the dataset. This design object is used throughout this book to conduct survey analysis. #### Residential Energy Consumption Survey (RECS) Design Object {-} -The RECS documentation [@recs-2020-tech] provides information on the survey's sampling and weighting implications for analysis. The documentation shows the 2020 RECS uses Jackknife weights, where the main analytic weight is `NWEIGHT`, and the Jackknife weights are `NWEIGHT1`-`NWEIGHT60`. We can specify these in the weights and repweights arguments in the survey design object code, respectively. +The RECS documentation [@recs-2020-tech] provides information on the survey's sampling and weighting implications for analysis. The documentation shows the 2020 RECS uses Jackknife weights, where the main analytic weight is `NWEIGHT`, and the Jackknife weights are `NWEIGHT1`-`NWEIGHT60`. We can specify these in the `weights` and `repweights` arguments in the survey design object code, respectively. -With Jackknife weights, additional information is required: `type`, `scale`, and `mse`. Chapter \@ref(c10-sample-designs-replicate-weights) goes into depth about each of these arguments, but to quickly get started, the documentation lets us know that `type=JK1`, `scale=59/60`, and `mse = TRUE`. We can use the following code to create the survey design object: +With Jackknife weights, additional information is required: `type`, `scale`, and `mse`. Chapter \@ref(c10-sample-designs-replicate-weights) goes into depth about each of these arguments, but to quickly get started, the RECS documentation lets us know that `type=JK1`, `scale=59/60`, and `mse = TRUE`. We can use the following code to create the survey design object: ```{r} #| label: setup-recs-des @@ -311,11 +309,11 @@ recs_des <- recs_2020 %>% recs_des ``` -Viewing this new object provides information about the survey design, such that the RECS is an "unstratified cluster jacknife (JK1) with 60 replicates and MSE variances". Additionally, the output shows the sampling variables (`NWEIGHT1`-`NWEIGHT60`) and then lists the remaining variables in the dataset. This design object will be used throughout this book to conduct survey analysis. +Viewing this new object provides information about the survey design, such that RECS is an "Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances". Additionally, the output shows the sampling variables (`NWEIGHT1`-`NWEIGHT60`) and then lists the remaining variables in the dataset. This design object is used throughout this book to conduct survey analysis. ## Survey analysis process {#survey-analysis-process} -The section above walked through the installation and loading of several packages, introduced the survey data available in the {srvyrexploR} package, and provided context on preparing survey design objects for the ANES and RECS data. Once the survey design objects are created, there is a general process for analyzing data to create estimates with {srvyr} package: +There is a general process for analyzing data to create estimates with {srvyr} package: 1. Create a `tbl_svy` object (a survey object) using: `as_survey_design()` or `as_survey_rep()` @@ -325,7 +323,7 @@ The section above walked through the installation and loading of several package 4. Within `summarize()`, specify variables to calculate, including means, totals, proportions, quantiles, and more -In Section \@ref(setup-des-obj), we follow Step #1 to create the survey design objects for the ANES and RECS data featured in this book. Additional details on how to create design objects can be found in \@ref(c10-sample-designs-replicate-weights). Then, once we have the design object, we can then filter the data to any subpopulation of interest (if needed). It is important to filter the data **after** creating the design object. This ensures that we are accurately accounting for the survey design in our calculations. Finally, we can use `group_by()`, `summarize()`, and other functions from the {survey} and {srvyr} packages to analyze the survey data by estimating means, totals, and so on. +In Section \@ref(setup-des-obj), we follow Step #1 to create the survey design objects for the ANES and RECS data featured in this book. Additional details on how to create design objects can be found in Chapter \@ref(c10-sample-designs-replicate-weights). Then, once we have the design object, we can filter the data to any subpopulation of interest (if needed). It is important to filter the data **after** creating the design object. This ensures that we are accurately accounting for the survey design in our calculations. Finally, we can use `group_by()`, `summarize()`, and other functions from the {survey} and {srvyr} packages to analyze the survey data by estimating means, totals, and so on. ## Similarities between {dplyr} and {srvyr} functions {#similarities-dplyr-srvyr} @@ -346,25 +344,26 @@ Let's examine the `towny` object's class. We verify that it is a tibble, as indi class(towny) ``` -All tibbles are data.frames but not all data.frames are tibbles. Compared to data.frames, tibbles have some advantages with the printing behavior being a noticeable advantage. +All tibbles are data.frames, but not all data.frames are tibbles. Compared to data.frames, tibbles have some advantages, with the printing behavior being a noticeable advantage. When working with tidyverse style code, we recommend making all your datasets tibbles for ease of analysis. The {survey} package contains datasets related to the California Academic Performance Index, which measures student performance in schools with at least 100 students in California. We can access these datasets by loading the {survey} package and running `data(api)`. -Let's work with the `apistrat` dataset, a stratified simple random sample of three school types (elementary, middle, high) in each stratum. We can follow the process outlined in Section \@ref(setup-des-obj) to create the survey design object. The sample is stratified by the `stype` variable and the sampling weights are found in the `pw` variable. We can use this information to construct the design object, `dstrata`. +Let's work with the `apistrat` dataset, which is a stratified random sample, stratified by school type (`stype`) with three levels: `E` for elementary school, `M` for middle school, and `H` for high school. We first create the survey design object (see Chapter \@ref(c10-sample-designs-replicate-weights) for more information). The sample is stratified by the `stype` variable, and the sampling weights are found in the `pw` variable. We can use this information to construct the design object, `apistrat_des`. ```{r} #| label: setup-api-surveydata data(api) -dstrata <- apistrat %>% - as_survey_design(strata = stype, weights = pw) +apistrat_des <- apistrat %>% + as_survey_design(strata = stype, + weights = pw) ``` -When we check the class of `dstrata`, it is not a typical `data.frame`. Applying the `as_survey_design()` function transforms the data into a `tbl_svy`, a special class specifically for survey design objects. The {srvyr} package is designed to work with the `tbl_svy` class of objects. +When we check the class of `apistrat_des`, it is not a typical `data.frame`. Applying the `as_survey_design()` function transforms the data into a `tbl_svy`, a special class specifically for survey design objects. The {srvyr} package is designed to work with the `tbl_svy` class of objects. ```{r} #| label: setup-api-class -class(dstrata) +class(apistrat_des) ``` Let's look at how {dplyr} works with regular data frames. The example below calculates the mean and median for the `land_area_km2` variable in the `towny` dataset. @@ -376,41 +375,43 @@ towny %>% area_median = median(land_area_km2)) ``` -In the code below, we calculate the mean and median of the variable `api00` using `dstrata`. Note the similarity in the syntax. When we dig into the {srvyr} functions later, we will show that the outputs share a similar structure. Each group (if present) generates one row of output, but with additional columns. By default, the standard error of the statistic is also calculated in addition to the statistic itself. +In the code below, we calculate the mean and median of the variable `api00` using `apistrat_des`. Note the similarity in the syntax. However, the standard error of the statistic is also calculated in addition to the statistic itself. ```{r} #| label: setup-srvyr-examp -dstrata %>% +apistrat_des %>% summarize(api00_mean = survey_mean(api00), api00_med = survey_median(api00)) ``` -The functions in {srvyr} also play nicely with other tidyverse functions. For example, if we wanted to select columns with shared characteristics, we can use {tidyselect} functions such as `starts_with()`, `num_range()`, etc [@R-tidyselect]. In the examples below, we use a combination of `across()` and `starts_with()` to calculate the mean of variables starting with "population" in the `towny` data frame and those beginning with `api` in the `dstrata` survey object. +The functions in {srvyr} also play nicely with other tidyverse functions. For example, if we wanted to select columns with shared characteristics, we can use {tidyselect} functions such as `starts_with()`, `num_range()`, etc. [@R-tidyselect]. In the examples below, we use a combination of `across()` and `starts_with()` to calculate the mean of variables starting with "population" in the `towny` data frame and those beginning with `api` in the `apistrat_des` survey object. ```{r} #| label: setup-dplyr-select towny %>% - summarize(across(starts_with("population"), ~mean(.x, na.rm=TRUE))) + summarize(across(starts_with("population"), + ~mean(.x, na.rm=TRUE))) ``` ```{r} #| label: setup-srvyr-select -dstrata %>% - summarize(across(starts_with("api"), survey_mean)) +apistrat_des %>% + summarize(across(starts_with("api"), + survey_mean)) ``` We have the flexibility to use {dplyr} verbs such as `mutate()`, `filter()`, and `select()` on our survey design object. As mentioned in Section \@ref(survey-analysis-process), these steps should be performed on the survey design object. This ensures our survey design is properly considered in all our calculations. ```{r} #| label: setup-srvyr-mutate -dstrata_mod <- dstrata %>% +apistrat_des_mod <- apistrat_des %>% mutate(api_diff = api00 - api99) %>% filter(stype == "E") %>% select(stype, api99, api00, api_diff, api_students = api.stu) -dstrata_mod +apistrat_des_mod -dstrata +apistrat_des ``` Several functions in {srvyr} must be called within `srvyr::summarize()`, with the exception of `srvyr::survey_count()` and `srvyr::survey_tally()`. This is similar to how `dplyr::count()` and `dplyr::tally()` are not called within `dplyr::summarize()`. The `summarize()` function can be used in conjunction with the `group_by()` function or `by/.by` arguments, which applies the functions on a group-by-group basis to create grouped summaries. @@ -427,7 +428,7 @@ We use a similar setup to summarize data in {srvyr}: ```{r} #| label: setup-srvyr-groupby -dstrata %>% +apistrat_des %>% group_by(stype) %>% summarize(api00_mean = survey_mean(api00), api00_median = survey_median(api00)) @@ -448,13 +449,13 @@ However, the `.by` syntax is not yet available in {srvyr}: ```{r} #| label: setup-srvyr-by-alt #| error: true -dstrata %>% +apistrat_des %>% summarize(api00_mean = survey_mean(api00), api00_median = survey_median(api00), .by=stype) ``` -As mentioned above, {srvyr} functions are meant for `tbl_svy` objects. Attempting to perform data manipulation on non-`tbl_svy` objects, like the `towny` example shown below, will result in an error. Running the code will let you know what the issue is: `Survey context not set`. +As mentioned above, {srvyr} functions are meant for `tbl_svy` objects. Attempting to manipulate data on non-`tbl_svy` objects, like the `towny` example shown below, results in an error. Running the code lets us know what the issue is: `Survey context not set`. ```{r} #| label: setup-nsobj-error @@ -463,7 +464,7 @@ towny %>% summarize(area_mean = survey_mean(land_area_km2)) ``` -A few functions in {srvyr} have counterparts in {dplyr}, such as `srvyr::summarize()` and `srvyr::group_by()`. Unlike {srvyr}-specific verbs, {srvyr} recognizes these parallel functions if applied to a non-survey object. Instead of causing an error, the package will provide the equivalent output from {dplyr}: +A few functions in {srvyr} have counterparts in {dplyr}, such as `srvyr::summarize()` and `srvyr::group_by()`. Unlike {srvyr}-specific verbs, {srvyr} recognizes these parallel functions if applied to a non-survey object. Instead of causing an error, the package provides the equivalent output from {dplyr}: ```{r} #| label: setup-nsobj-noerr @@ -471,4 +472,4 @@ towny %>% srvyr::summarize(area_mean = mean(land_area_km2)) ``` -Because this book focuses on survey analysis, most of our pipes will stem from a survey object. When we load the {dplyr} and {srvyr} packages, the functions will automatically figure out the class of data and use the appropriate one from {dplyr} or {srvyr}. Therefore, we do not need to include the namespace for each function (e.g., `srvyr::summarize()`). +Because this book focuses on survey analysis, most of our pipes stem from a survey object. When we load the {dplyr} and {srvyr} packages, the functions automatically figure out the class of data and use the appropriate one from {dplyr} or {srvyr}. Therefore, we do not need to include the namespace for each function (e.g., `srvyr::summarize()`). diff --git a/05-descriptive-analysis.Rmd b/05-descriptive-analysis.Rmd index e5a455b1..0ac64549 100644 --- a/05-descriptive-analysis.Rmd +++ b/05-descriptive-analysis.Rmd @@ -25,7 +25,7 @@ library(srvyrexploR) library(broom) ``` -We will be using data from ANES and RECS described in Chapter \@ref(c04-getting-started). As a reminder, here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter \@ref(c04-getting-started) for more information). +We are using data from ANES and RECS described in Chapter \@ref(c04-getting-started). As a reminder, here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter \@ref(c04-getting-started) for more information.) ```{r} #| label: desc-anes-des @@ -63,33 +63,33 @@ recs_des <- recs_2020 %>% ## Introduction -Descriptive analyses, such as basic counts, cross-tabulations, or means, are one of the first steps in making sense of our survey results. By reviewing the findings, analysts can glean insight into the data, the underlying population, and any unique aspects of the data or population. For example, if only 10% of the survey respondents are male, it could indicate a unique population, a potential error or bias, an intentional survey sampling method, or other factors. Additionally, descriptive analyses allow analysts to provide summaries like means, proportions, or other measures to make estimates about the population. These analyses lay the groundwork for the next steps of running statistical tests or developing models. +Descriptive analyses, such as basic counts, cross-tabulations, or means, are among the first steps in making sense of our survey results. By reviewing the findings, we can glean insight into the data, the underlying population, and any unique aspects of the data or population. For example, if only 10% of the survey respondents are male, it could indicate a unique population, a potential error or bias, an intentional survey sampling method, or other factors. Additionally, descriptive analyses allow us to provide summaries like means, proportions, or other measures to make estimates about the population. These analyses lay the groundwork for the next steps of running statistical tests or developing models. -We will discuss many different types of descriptive analyses in this chapter. However, it is important to know what type of data we are working with and which statistics are appropriate. In survey data, we typically consider data as one of four main types: +We discuss many different types of descriptive analyses in this chapter. However, it is important to know what type of data we are working with and which statistics are appropriate. In survey data, we typically consider data as one of four main types: * Categorical/nominal data: variables with levels or descriptions that cannot be ordered, such as the region of the country (North, South, East, and West) * Ordinal data: variables that can be ordered, such as those from a Likert scale (strongly disagree, disagree, agree, and strongly agree) * Discrete data: variables that are counted or measured, such as number of children * Continuous data, variables that are measured and whose values can lie anywhere on an interval, such as income -This chapter will discuss how to analyze *measures of distribution* (e.g., cross-tabulations), *central tendency* (e.g., means), *relationship* (e.g., ratios), and *dispersion* (e.g., standard deviation) using functions from the {srvyr} package [@R-srvyr]. +This chapter discusses how to analyze *measures of distribution* (e.g., cross-tabulations), *central tendency* (e.g., means), *relationship* (e.g., ratios), and *dispersion* (e.g., standard deviation) using functions from the {srvyr} package [@R-srvyr]. -**Measures of distribution** describe how often an event or response occurs. These measures include counts and totals. We will cover the following functions: +**Measures of distribution** describe how often an event or response occurs. These measures include counts and totals. We cover the following functions: * Count of observations (`survey_count()` and `survey_tally()`) * Summation of variables (`survey_total()`) -**Measures of central tendency** find the central (or average) responses. These measures include means and medians. We will cover the following functions: +**Measures of central tendency** find the central (or average) responses. These measures include means and medians. We cover the following functions: * Means and proportions (`survey_mean()` and `survey_prop()`) * Quantiles and medians (`survey_quantile()` and `survey_median()`) -**Measures of relationship** describe how variables relate to each other. These measures include correlations and ratios. We will cover the following functions: +**Measures of relationship** describe how variables relate to each other. These measures include correlations and ratios. We cover the following functions: * Correlations (`survey_corr()`) * Ratios (`survey_ratio()`) -**Measures of dispersion** describe how data spreads around the central tendency for continuous variables. These measures include standard deviations and variances. We will cover the following functions: +**Measures of dispersion** describe how data spread around the central tendency for continuous variables. These measures include standard deviations and variances. We cover the following functions: * Variances and standard deviations (`survey_var()` and `survey_sd()`) @@ -100,15 +100,15 @@ To incorporate each of these survey functions, recall the general process for su 3. Specify domains of analysis using `srvyr::group_by()`, if needed. 4. Analyze the data with survey-specific functions. -This chapter will walk through how to apply the survey functions in Step 4. Note that unless otherwise specified, our estimates will be weighted as a result of setting up the survey design object. +This chapter walks through how to apply the survey functions in Step 4. Note that unless otherwise specified, our estimates are weighted as a result of setting up the survey design object. -To look at the data by different subgroups, we can choose to filter and/or group the data. It is very important that we filter and group the data only *after* creating the design object. This ensures that the results accurately reflect the survey design. If we filter or group data before creating the survey design object, the data for those cases is not included in the survey design information and estimations of the variance, leading to inaccurate results. +To look at the data by different subgroups, we can choose to filter and/or group the data. It is very important that we filter and group the data only *after* creating the design object. This ensures that the results accurately reflect the survey design. If we filter or group data before creating the survey design object, the data for those cases are not included in the survey design information and estimations of the variance, leading to inaccurate results. -For the sake of simplicity, we've removed cases with missing values in the examples below. If you want a more detailed explanation on how to handle missing data, please refer to Chapter \@ref(c11-missing-data). +For the sake of simplicity, we've removed cases with missing values in the examples below. For a more detailed explanation of how to handle missing data, please refer to Chapter \@ref(c11-missing-data). ## Counts and cross-tabulations -Using `survey_count()` and `survey_tally()`, we can calculate the estimated population counts for a given variable or combination of variables. These summaries, often referred to as cross-tabulations or crosstabs, are applied to categorical data. They help in estimating counts of the population size for different groups based on the survey data. +Using `survey_count()` and `survey_tally()`, we can calculate the estimated population counts for a given variable or combination of variables. These summaries, often referred to as cross-tabulations or cross-tabs, are applied to categorical data. They help in estimating counts of the population size for different groups based on the survey data. ### Syntax {#desc-count-syntax} @@ -136,7 +136,7 @@ The arguments are: * `.drop`: whether to drop empty groups * `vartype`: type(s) of variation estimate to calculate including any of `c("se", "ci", "var", "cv")`, defaults to `se` (standard error) (see \@ref(desc-count-syntax) for more information) -To generate a count or crosstabs by different variables, we include them in the (`...`) argument. This argument can take any number of variables and will break down the counts by all combinations of the provided variables. This is similar to `dplyr::count()`. To obtain an estimate of the overall population, we can exclude any variables from the (`...`) argument or use the `survey_tally()` function. While the `survey_tally()` function has a similar syntax to the `survey_count()` function, it does not include the (`...`) or the `.drop` arguments: +To generate a count or cross-tabs by different variables, we include them in the (`...`) argument. This argument can take any number of variables and breaks down the counts by all combinations of the provided variables. This is similar to `dplyr::count()`. To obtain an estimate of the overall population, we can exclude any variables from the (`...`) argument or use the `survey_tally()` function. While the `survey_tally()` function has a similar syntax to the `survey_count()` function, it does not include the (`...`) or the `.drop` arguments: ```r survey_tally( @@ -155,7 +155,7 @@ Both functions include the `vartype` argument with four different values: * Output has a column with the variable name specified in the `name` argument with a suffix of "_se" * `ci`: confidence interval * The lower and upper limits of a confidence interval - * Output has a column with the variable name specified in the `name` argument with a suffix of "_low" and "_upp" + * Output has two columns with the variable name specified in the `name` argument with a suffix of "_low" and "_upp" * By default, this is a 95% confidence interval but can be changed by using the argument level and specifying a number between 0 and 1. For example, `level=0.8` would produce a 80% confidence interval. * `var`: variance * The estimated variance of the estimate @@ -167,13 +167,13 @@ Both functions include the `vartype` argument with four different values: The confidence intervals are always calculated using a symmetric t-distribution based method, given by the formula: $$ \text{estimate} \pm t^*_{df}\times SE$$ -where $t^*_{df}$ is the critical value from a t-distribution based on the confidence level and the degrees of freedom. By default, the degrees of freedom are based on the design or number of replicates, but they can be specified using the `df` argument. For survey design objects, the degrees of freedom are calculated as the number of PSUs minus the number of strata. For replicate-based objects, the degrees of freedom are calculated as one less than the rank of the matrix of replicate weight, where the number of replicates is typically the rank. Note that specifying `df = Inf` is equivalent to using a normal (z-based) confidence interval -- this is the default in {survey}. These variability types are the same for most of the survey functions, and we will provide examples using different variability types throughout this chapter. +where $t^*_{df}$ is the critical value from a t-distribution based on the confidence level and the degrees of freedom. By default, the degrees of freedom are based on the design or number of replicates, but they can be specified using the `df` argument. For survey design objects, the degrees of freedom are calculated as the number of primary sampling units (PSUs or clusters) minus the number of strata (see Chapter \@ref(c10-sample-designs-replicate-weights) for more information on PSUs, strata, and sample designs.) For replicate-based objects, the degrees of freedom are calculated as one less than the rank of the matrix of replicate weight, where the number of replicates is typically the rank. Note that specifying `df = Inf` is equivalent to using a normal (z-based) confidence interval -- this is the default in {survey}. These variability types are the same for most of the survey functions, and we provide examples using different variability types throughout this chapter. ### Examples #### Example 1: Estimated population count {.unnumbered} -If we want to obtain the estimated number of households in the U.S. (the target population) using the Residential Energy Consumption Survey (RECS) data, we can use `survey_count()`. If we do not specify any variables in the `survey_count()` function, it will output the estimated population count (`n`) and its corresponding standard error (`n_se`). +If we want to obtain the estimated number of households in the U.S. (the population of interest) using the Residential Energy Consumption Survey (RECS) data, we can use `survey_count()`. If we do not specify any variables in the `survey_count()` function, it outputs the estimated population count (`n`) and its corresponding standard error (`n_se`.) ```{r} #| label: desc-count-overall @@ -200,7 +200,7 @@ recs_des %>% survey_tally() ``` -#### Example 2: Estimated counts by subgroups (crosstabs) {.unnumbered} +#### Example 2: Estimated counts by subgroups (cross-tabs) {.unnumbered} To calculate the estimated number of observations for specific subgroups, such as Region and Division, we can include the variables of interest in the `survey_count()` function. In the example below, we calculate the estimated number of housing units by region and division. The argument `name =` in `survey_count()` allows us to change the name of the count variable in the output from the default `n` to `N`. @@ -223,9 +223,9 @@ recs_des %>% )) ``` -When we run the crosstab, we see there are an estimated `r .est_pop_div %>% filter(Division=="New England") %>% pull(N)` housing units in the New England Division. +When we run the cross-tab, we see there are an estimated `r .est_pop_div %>% filter(Division=="New England") %>% pull(N)` housing units in the New England Division. -The code will result in an error if we try to use the `survey_count()` syntax with `survey_tally()`: +The code results in an error if we try to use the `survey_count()` syntax with `survey_tally()`: ```{r} #| label: desc-tally-group-bad @@ -234,7 +234,7 @@ recs_des %>% survey_tally(Region, Division, name = "N") ``` -Use a `group_by()` function prior to using `survey_tally()` to successfully run the crosstab: +Use a `group_by()` function prior to using `survey_tally()` to successfully run the cross-tab: ```{r} #| label: desc-tally-group-good @@ -275,7 +275,7 @@ The arguments are: #### Example 1: Estimated population count {.unnumbered} -To calculate a population count estimate with `survey_total()`, we leave the argument `x` empty as shown in the example below: +To calculate a population count estimate with `survey_total()`, we leave the argument `x` empty, as shown in the example below: ```{r} #| label: desc-tot-nox @@ -283,7 +283,7 @@ recs_des %>% summarize(Tot = survey_total()) ``` -The estimated number of households in the U.S. is `r scales::comma(recs_des %>% summarize(Tot = survey_total()) %>% pull(Tot))`. Note that this result obtained from `recs_des %>% summarize(survey_total())` is equivalent to the ones from the `survey_count()` and `survey_tally()` functions. However, the `survey_total()` function is called within `summarize`, whereas `survey_count()` and `survey_tally()` are not. +The estimated number of households in the U.S. is `r scales::comma(recs_des %>% summarize(Tot = survey_total()) %>% pull(Tot))`. Note that this result obtained from `survey_total()` is equivalent to the ones from the `survey_count()` and `survey_tally()` functions. However, the `survey_total()` function is called within `summarize()`, whereas `survey_count()` and `survey_tally()` are not. #### Example 2: Overall summation of continuous variables {.unnumbered} @@ -336,13 +336,11 @@ recs_des %>% ))) ``` -The survey results estimate that households in the Northeast spent `r .elbil_reg %>% filter(Region=="Northeast") %>% pull(elec_bill)` with a confidence interval of (`r .elbil_reg %>% filter(Region=="Northeast") %>% pull(elec_bill_low)`, `r .elbil_reg %>% filter(Region=="Northeast") %>% pull(elec_bill_upp)`) on electricity in 2020 while households in the South spent an estimated `r .elbil_reg %>% filter(Region=="South") %>% pull(elec_bill)` with a confidence interval of (`r .elbil_reg %>% filter(Region=="Northeast") %>% pull(elec_bill_low)`, `r .elbil_reg %>% filter(Region=="South") %>% pull(elec_bill_upp)`). - -As we calculate these numbers, we may notice that the confidence interval of the South is larger than those of other regions. This implies that we have less certainty about the true value of electricity spending in the South. A larger confidence interval could be due to a variety of factors, such as a wider range of electricity spending in the South. We could try to analyze smaller regions within the South to identify areas that are contributing to more variability. Descriptive analyses serve as a valuable starting point for more in-depth exploration and analysis. +The survey results estimate that households in the Northeast spent `r .elbil_reg %>% filter(Region=="Northeast") %>% pull(elec_bill)` with a confidence interval of (`r .elbil_reg %>% filter(Region=="Northeast") %>% pull(elec_bill_low)`, `r .elbil_reg %>% filter(Region=="Northeast") %>% pull(elec_bill_upp)`) on electricity in 2020, while households in the South spent an estimated `r .elbil_reg %>% filter(Region=="South") %>% pull(elec_bill)` with a confidence interval of (`r .elbil_reg %>% filter(Region=="South") %>% pull(elec_bill_low)`, `r .elbil_reg %>% filter(Region=="South") %>% pull(elec_bill_upp)`.) ## Means and proportions {#desc-meanprop} -Means and proportions form the backbone of many research studies. These estimates are often the first things we look for when reviewing research on a given topic. The `survey_mean()` and `survey_prop()` functions calculate means and proportions while taking into account the survey design elements. The `survey_mean()` function should be used on continuous variables of survey data, while the `survey_prop()` function should be used on categorical variables. These topics are grouped together because a proportion is a mean of a logical (Boolean) variable. +Means and proportions form the foundation of many research studies. These estimates are often the first things we look for when reviewing research on a given topic. The `survey_mean()` and `survey_prop()` functions calculate means and proportions while taking into account the survey design elements. The `survey_mean()` function should be used on continuous variables of survey data, while the `survey_prop()` function should be used on categorical variables. ### Syntax {#desc-meanprop-syntax} @@ -365,7 +363,8 @@ survey_prop( vartype = c("se", "ci", "var", "cv"), level = 0.95, proportion = TRUE, - prop_method = c("logit", "likelihood", "asin", "beta", "mean", "xlogit"), + prop_method = + c("logit", "likelihood", "asin", "beta", "mean", "xlogit"), deff = FALSE, df = NULL ) @@ -382,18 +381,18 @@ Both functions have the following arguments and defaults: There are two main differences in the syntax. The `survey_mean()` function includes the first argument `x`, representing the variable or expression on which the mean should be calculated. The `survey_prop()` does not have an argument to include the variables directly. Instead, prior to `summarize()`, we must use the `group_by()` function to specify the variables of interest for `survey_prop()`. For `survey_mean()`, including a `group_by()` function allows us to obtain the means by different groups. -The other main difference is with the `proportion` argument. The `survey_mean()` function can be used to calculate both means and proportions. Its `proportion` argument defaults to `FALSE`, indicating it is used for calculating means. If we wish to calculate a proportion using `survey_mean()`, we will need to set the `proportion` argument to `TRUE`. In the `survey_prop()` function, the `proportion` argument defaults to `TRUE` because the function is specifically designed for calculating proportions. +The other main difference is with the `proportion` argument. The `survey_mean()` function can be used to calculate both means and proportions. Its `proportion` argument defaults to `FALSE`, indicating it is used for calculating means. If we wish to calculate a proportion using `survey_mean()`, we need to set the `proportion` argument to `TRUE`. In the `survey_prop()` function, the `proportion` argument defaults to `TRUE` because the function is specifically designed for calculating proportions. -In section \@ref(desc-count-syntax), we provide an overview of different variability types. The confidence interval used for most measures, such as means and counts, is referred to as a Wald-type interval. However, for proportions, a Wald-type interval with a symmetric t-based confidence interval may not provide accurate coverage, especially when dealing with small sample sizes or proportions "near" 0 or 1. We can use other methods to calculate confidence intervals, which we specify using the `prop_method` option in `survey_prop()`. The options include: +In Section \@ref(desc-count-syntax), we provide an overview of different variability types. The confidence interval used for most measures, such as means and counts, is referred to as a Wald-type interval. However, for proportions, a Wald-type interval with a symmetric t-based confidence interval may not provide accurate coverage, especially when dealing with small sample sizes or proportions "near" 0 or 1. We can use other methods to calculate confidence intervals, which we specify using the `prop_method` option in `survey_prop()`. The options include: * `logit`: fits a logistic regression model and computes a Wald-type interval on the log-odds scale, which is then transformed to the probability scale. This is the default method. * `likelihood`: uses the (Rao-Scott) scaled chi-squared distribution for the log-likelihood from a binomial distribution. * `asin`: uses the variance-stabilizing transformation for the binomial distribution, the arcsine square root, and then back-transforms the interval to the probability scale * `beta`: uses the incomplete beta function with an effective sample size based on the estimated variance of the proportion. * `mean`: the Wald-type interval ($\pm t_{df}^*\times SE$) - * `xlogit`: uses a logit transformation of the proportion, calculates a Wald-type interval, and then back-transforms to the probability scale. This method is the same as those used in SUDAAN and SPSS. + * `xlogit`: uses a logit transformation of the proportion, calculates a Wald-type interval, and then back-transforms to the probability scale. This method is the same as those used by default in SUDAAN and SPSS. -Each option will yield slightly different confidence interval bounds when dealing with proportions. Please note that when working with `survey_mean()`, we do not need to specify a method unless the `proportion` argument is `TRUE`. If `proportion` is `FALSE`, it calculates a symmetric `mean` type of confidence interval. +Each option yields slightly different confidence interval bounds when dealing with proportions. Please note that when working with `survey_mean()`, we do not need to specify a method unless the `proportion` argument is `TRUE`. If `proportion` is `FALSE`, it calculates a symmetric `mean` type of confidence interval. ### Examples @@ -403,6 +402,7 @@ If we are interested in obtaining the proportion of people in each region in the ```{r} #| label: desc-p-ex1 +#| message: false recs_des %>% group_by(Region) %>% summarize(p = survey_prop()) @@ -419,7 +419,7 @@ recs_des %>% `r .preg %>% filter(Region=="Northeast") %>% pull(p) %>% signif(3)`% of the households are in the Northeast, `r .preg %>% filter(Region=="Midwest") %>% pull(p) %>% signif(3)`% in the Midwest, and so on. Note that the proportions in column `p` add up to one. -The `survey_prop()` function is essentially the same as using `survey_mean()` with a categorical variable and without specifying a numeric variable in the `x` argument. The following code will give us the same results as above: +The `survey_prop()` function is essentially the same as using `survey_mean()` with a categorical variable and without specifying a numeric variable in the `x` argument. The following code gives us the same results as above: ```{r} #| label: desc-p-ex2 @@ -430,7 +430,7 @@ recs_des %>% #### Example 2: Conditional proportions {.unnumbered} -We can also obtain proportions by more than one variable. In the following example, we look at the proportion of housing units by Region and whether air conditioning is used (`ACUsed`).^[Question text: Is any air conditioning equipment used in your home?] +We can also obtain proportions by more than one variable. In the following example, we look at the proportion of housing units by Region and whether air conditioning (A/C) is used (`ACUsed`.)^[Question text: Is any air conditioning equipment used in your home?] ```{r} #| label: desc-pmulti-ex1 @@ -439,7 +439,7 @@ recs_des %>% summarize(p = survey_prop()) ``` -When specifying multiple variables, the proportions are conditional. In the results above, notice that the proportions sum to 1 within each region. This can be interpreted as the proportion of housing units with air conditioning *within* each region. For example, in the Northeast region, approximately `r scales::percent(recs_des %>% group_by(Region, ACUsed) %>% summarize(p = survey_prop()) %>% filter(Region == "Northeast", ACUsed == "FALSE") %>% pull(p), accuracy = 0.1)` of housing units don't have air conditioning, while around `r scales::percent(recs_des %>% group_by(Region, ACUsed) %>% summarize(p = survey_prop()) %>% filter(Region == "Northeast", ACUsed == "TRUE") %>% pull(p), accuracy = 0.1)` have air conditioning. +When specifying multiple variables, the proportions are conditional. In the results above, notice that the proportions sum to 1 within each region. This can be interpreted as the proportion of housing units with A/C *within* each region. For example, in the Northeast region, approximately `r scales::percent(recs_des %>% group_by(Region, ACUsed) %>% summarize(p = survey_prop()) %>% filter(Region == "Northeast", ACUsed == "FALSE") %>% pull(p), accuracy = 0.1)` of housing units don't have A/C, while around `r scales::percent(recs_des %>% group_by(Region, ACUsed) %>% summarize(p = survey_prop()) %>% filter(Region == "Northeast", ACUsed == "TRUE") %>% pull(p), accuracy = 0.1)` have A/C. #### Example 3: Joint proportions {.unnumbered} @@ -452,7 +452,7 @@ recs_des %>% summarize(p = survey_prop()) ``` -As noted earlier, we can use both the `survey_prop()` and `survey_mean()` functions, and they will produce the same results. +In this case, all proportions sum to 1, not just within regions. This means that `r scales::percent(recs_des %>% group_by(interact(Region, ACUsed)) %>% summarize(p = survey_prop()) %>% filter(Region == "Northeast", ACUsed == "TRUE") %>% pull(p), accuracy = 0.1)` of the population lives in the Northeast and has A/C. As noted earlier, we can use both the `survey_prop()` and `survey_mean()` functions, and they produce the same results. #### Example 4: Overall mean {.unnumbered} @@ -506,7 +506,7 @@ Households from the West spent approximately `r .elbill_mn_reg %>% filter(Region To better understand the distribution of a continuous variable like income, we can calculate quantiles at specific points. For example, computing estimates of the quartiles (25%, 50%, 75%) helps us understand how income is spread across the population. We use the `survey_quantile()` function to calculate quantiles in survey data. -Medians are useful for finding the midpoint of a continuous distribution when the data is skewed, as medians are less affected by outliers than means. The median is the same as the 50th percentile, meaning the value where 50% of the data is higher and 50% is lower. Because medians are a special, common case of quantiles, we have a dedicated function called `survey_median()` for calculating the median in survey data. Alternatively, we can use the `survey_quantile()` function with the `quantiles` argument set to `0.5` to achieve the same result. +Medians are useful for finding the midpoint of a continuous distribution when the data are skewed, as medians are less affected by outliers compared to means. The median is the same as the 50th percentile, meaning the value where 50% of the data are higher and 50% are lower. Because medians are a special, common case of quantiles, we have a dedicated function called `survey_median()` for calculating the median in survey data. Alternatively, we can use the `survey_quantile()` function with the `quantiles` argument set to `0.5` to achieve the same result. ### Syntax @@ -519,7 +519,8 @@ survey_quantile( na.rm = FALSE, vartype = c("se", "ci", "var", "cv"), level = 0.95, - interval_type = c("mean", "beta", "xlogit", "asin", "score", "quantile"), + interval_type = + c("mean", "beta", "xlogit", "asin", "score", "quantile"), qrule = c("math", "school", "shahvaish", "hf1", "hf2", "hf3", "hf4", "hf5", "hf6", "hf7", "hf8", "hf9"), df = NULL @@ -530,7 +531,8 @@ survey_median( na.rm = FALSE, vartype = c("se", "ci", "var", "cv"), level = 0.95, - interval_type = c("mean", "beta", "xlogit", "asin", "score", "quantile"), + interval_type = + c("mean", "beta", "xlogit", "asin", "score", "quantile"), qrule = c("math", "school", "shahvaish", "hf1", "hf2", "hf3", "hf4", "hf5", "hf6", "hf7", "hf8", "hf9"), df = NULL @@ -544,17 +546,17 @@ The arguments available in both functions are: * `vartype`: type(s) of variation estimate to calculate, defaults to `se` (standard error) * `level`: a number or a vector indicating the confidence level, defaults to 0.95 * `interval_type`: method for calculating a confidence interval - * `qrule`: rule for defining quantiles. The default is the lower end of the quantile interval ("math"). The midpoint of the quantile interval is the "school" rule. "hf1" to "hf9" are weighted analogs to type=1 to 9 in `quantile()`. "shahvaish" corresponds to a rule proposed by @shahvaish. See `vignette("qrule", package="survey")` for more information. + * `qrule`: rule for defining quantiles. The default is the lower end of the quantile interval ("math".) The midpoint of the quantile interval is the "school" rule. "hf1" to "hf9" are weighted analogs to type=1 to 9 in `quantile()`. "shahvaish" corresponds to a rule proposed by @shahvaish. See `vignette("qrule", package="survey")` for more information. * `df`: (for `vartype = 'ci'`), a numeric value indicating degrees of freedom for the t-distribution The only difference between `survey_quantile()` and `survey_median()` is the inclusion of the `quantiles` argument in the `survey_quantile()` function. This argument takes a vector with values between 0 and 1 to indicate which quantiles to calculate. For example, if we wanted the quartiles of a variable, we would provide `quantiles = c(0.25, 0.5, 0.75)`. While we can specify quantiles of 0 and 1, which represent the minimum and maximum, this is not recommended. It only returns the minimum and maximum of the respondents and cannot be extrapolated to the population as there is no valid definition of standard error. -In Section \@ref(desc-count-syntax), we provide an overview of the different variability types. The interval used in confidence intervals for most measures, such as means and counts, is referred to as a Wald-type interval. However, this is not always the most accurate interval for quantiles. Similar to confidence intervals for proportions, quantiles have various interval types including asin, beta, mean, and xlogit (see Section \@ref(desc-meanprop-syntax)). Quantiles also have two more methods available: +In Section \@ref(desc-count-syntax), we provide an overview of the different variability types. The interval used in confidence intervals for most measures, such as means and counts, is referred to as a Wald-type interval. However, this is not always the most accurate interval for quantiles. Similar to confidence intervals for proportions, quantiles have various interval types, including asin, beta, mean, and xlogit (see Section \@ref(desc-meanprop-syntax).) Quantiles also have two more methods available: * `score`: the Francisco and Fuller confidence interval based on inverting a score test (only available for design-based survey objects and not replicate-based objects) * `quantile`: based on the replicates of the quantile. This is not valid for jackknife-type replicates but is available for bootstrap and BRR replicates. -One note with the `score` method is that when there are numerous ties in the data, this method may produce confidence intervals that do not contain the estimate. When dealing with a high propensity for ties (e.g., many respondents have the same age), it is recommended to use another method. SUDAAN, for example, uses the `score` method but adds noise to the values to prevent issues. The documentation in the {survey} package indicates in general, the `score` method may have poorer performance compared to the beta and logit intervals [@lumley2010complex]. +One note with the `score` method is that when there are numerous ties in the data, this method may produce confidence intervals that do not contain the estimate. When dealing with a high propensity for ties (e.g., many respondents are the same age), it is recommended to use another method. SUDAAN, for example, uses the `score` method but adds noise to the values to prevent issues. The documentation in the {survey} package indicates, in general, that the `score` method may have poorer performance compared to the beta and logit intervals [@lumley2010complex]. ### Examples @@ -564,6 +566,15 @@ Quantiles provide insights into the distribution of a variable. Let's look into ```{r} #| label: desc-quantile-oa +#| eval: FALSE +recs_des %>% + summarize(elec_bill = survey_quantile(DOLLAREL, + quantiles = c(0.25, .5, 0.75))) +``` + +```{r} +#| label: desc-quantile-oa-print +#| echo: FALSE recs_des %>% summarize(elec_bill = survey_quantile(DOLLAREL, quantiles = c(0.25, .5, 0.75))) %>% @@ -584,7 +595,7 @@ recs_des %>% ) ``` -The output above shows the values for the three quartiles and their respective standard errors: the 25th percentile is `r .elbill_quant %>% pull(elec_bill_q25)` with a standard error of `r .elbill_quant %>% pull(elec_bill_q25_se)`, the 50th percentile (median) is `r .elbill_quant %>% pull(elec_bill_q50)` with a standard error of `r .elbill_quant %>% pull(elec_bill_q50_se)`, and the 75th percentile is `r .elbill_quant %>% pull(elec_bill_q75)` with a standard error of `r .elbill_quant %>% pull(elec_bill_q75_se)`. +The output above shows the values for the three quartiles of electric bill costs and their respective standard errors: the 25th percentile is `r .elbill_quant %>% pull(elec_bill_q25)` with a standard error of `r .elbill_quant %>% pull(elec_bill_q25_se)`, the 50th percentile (median) is `r .elbill_quant %>% pull(elec_bill_q50)` with a standard error of `r .elbill_quant %>% pull(elec_bill_q50_se)`, and the 75th percentile is `r .elbill_quant %>% pull(elec_bill_q75)` with a standard error of `r .elbill_quant %>% pull(elec_bill_q75_se)`. #### Example 2: Quartiles by subgroup {.unnumbered} @@ -592,6 +603,16 @@ We can estimate the quantiles of electric bills by region by using the `group_by ```{r} #| label: desc-quantile-reg +#| eval: false +recs_des %>% + group_by(Region) %>% + summarize(elec_bill = survey_quantile(DOLLAREL, + quantiles = c(0.25, .5, 0.75))) +``` + +```{r} +#| label: desc-quantile-reg-print +#| echo: false recs_des %>% group_by(Region) %>% summarize(elec_bill = survey_quantile(DOLLAREL, @@ -615,11 +636,11 @@ recs_des %>% ``` -The 25th percentile for the Northeast region is `r .elbill_quant_gp %>% filter(Region=="Northeast") %>% pull(elec_bill_q25)` while it is `r .elbill_quant_gp %>% filter(Region=="South") %>% pull(elec_bill_q25)` for the South. +The 25th percentile for the Northeast region is `r .elbill_quant_gp %>% filter(Region=="Northeast") %>% pull(elec_bill_q25)`, while it is `r .elbill_quant_gp %>% filter(Region=="South") %>% pull(elec_bill_q25)` for the South. #### Example 3: Minimum and maximum {.unnumbered} -As mentioned in the syntax section, we can specify quantiles of `0` (minimum) and `1` (maximum) and R will calculate these values. However, these are only the minimum and maximum values in the data, and there is not enough information to determine their standard errors: +As mentioned in the syntax section, we can specify quantiles of `0` (minimum) and `1` (maximum), and R calculates these values. However, these are only the minimum and maximum values in the data, and there is not enough information to determine their standard errors: ```{r} #| label: desc-quantile-minmax @@ -641,7 +662,7 @@ recs_des %>% ) ``` -The minimum cost of electricity in the dataset is `r .elbill_minmax %>% pull(elec_bill_q00)` while the maximum is `r .elbill_minmax %>% pull(elec_bill_q100)`, but the standard error is shown as `NaN` and 0, respectively. Notice that the minimum cost is a negative number which may be surprising but some housing units with solar power sell their energy back to the grid and make money which is recorded as a negative expenditure. +The minimum cost of electricity in the dataset is `r .elbill_minmax %>% pull(elec_bill_q00)` while the maximum is `r .elbill_minmax %>% pull(elec_bill_q100)`, but the standard error is shown as `NaN` and 0, respectively. Notice that the minimum cost is a negative number. This may be surprising, but some housing units with solar power sell their energy back to the grid and earn money, which is recorded as a negative expenditure. #### Example 4: Overall median {.unnumbered} @@ -663,7 +684,7 @@ recs_des %>% ))) ``` -Nationally, the median household spent `r pull(.elbill_med, elec_bill)` in 2020. This is the same result as we obtained using the `survey_quantile()` function. Interestingly, the average electric bill for households that we calculated in section \@ref(desc-meanprop) is `r pull(.elbill_mn, elec_bill)`, but the estimated median electric bill is `r pull(.elbill_med, elec_bill)` indicating the distribution is likely right-skewed. +Nationally, the median household spent `r pull(.elbill_med, elec_bill)` in 2020. This is the same result as we obtained using the `survey_quantile()` function. Interestingly, the average electric bill for households that we calculated in Section \@ref(desc-meanprop) is `r pull(.elbill_mn, elec_bill)`, but the estimated median electric bill is `r pull(.elbill_med, elec_bill)`, indicating the distribution is likely right-skewed. #### Example 5: Medians by subgroup {.unnumbered} @@ -687,7 +708,7 @@ recs_des %>% ))) ``` -Households from the Northeast spent `r .elbill_med_reg %>% filter(Region=="Northeast") %>% pull(elec_bill)` on electricity, and in the South, they spent an average of `r .elbill_med_reg %>% filter(Region=="South") %>% pull(elec_bill)`. +We estimate that households in the Northeast spent a median of `r .elbill_med_reg %>% filter(Region=="Northeast") %>% pull(elec_bill)` on electricity, and in the South, they spent a median of `r .elbill_med_reg %>% filter(Region=="South") %>% pull(elec_bill)`. ## Ratios @@ -735,6 +756,19 @@ Suppose we wanted to find the ratio of dollars spent on liquid propane per unit ```{r} #| label: desc-ratio-1 +#| eval: false +recs_des %>% + summarize( + DOLLARLP_Tot = survey_total(DOLLARLP, vartype = NULL), + BTULP_Tot = survey_total(BTULP, vartype = NULL), + DOL_BTU_Rat = survey_ratio(DOLLARLP, BTULP), + DOL_BTU_Avg = survey_mean(DOLLARLP / BTULP, na.rm = TRUE) + ) +``` + +```{r} +#| label: desc-ratio-1-print +#| echo: false recs_des %>% summarize( DOLLARLP_Tot = survey_total(DOLLARLP, vartype = NULL), @@ -771,7 +805,7 @@ rat <- pull(.rat_out, DOL_BTU_Rat) %>% signif(3) avg <- pull(.rat_out, DOL_BTU_Avg) %>% signif(3) ``` -The ratio of the total spent on liquid propane to the total consumption was `r rat`, but the average rate was `r avg`. With a bit of calculation, we can show that the ratio is the ratio of the totals `DOLLARLP_Tot`/`BTULP_Tot`=`r num`/`r den`=`r rat`. Although the ratio can be calculated manually in this manner, the standard error requires the use of the `survey_ratio()` function. The average can be interpreted as the average rate paid by a household. +The ratio of the total spent on liquid propane to the total consumption was `r rat`, but the average rate was `r avg`. With a bit of calculation, we can show that the ratio is the ratio of the totals `DOLLARLP_Tot`/`BTULP_Tot`=`r num`/`r den`=`r rat`. Although the estimated ratio can be calculated manually in this manner, the standard error requires the use of the `survey_ratio()` function. The average can be interpreted as the average rate paid by a household. #### Example 2: Ratios by subgroup {.unnumbered} @@ -785,11 +819,11 @@ recs_des %>% arrange(DOL_BTU_Rat) ``` -Although not a formal statistical test, it appears that the cost ratios for liquid propane are the lowest in the Midwest (`r round(recs_des %>% group_by(Region) %>% summarize(DOL_BTU_Rat = survey_ratio(DOLLARLP, BTULP)) %>% filter(Region == "Midwest") %>% pull(DOL_BTU_Rat), 4)`). +Although not a formal statistical test, it appears that the cost ratios for liquid propane are the lowest in the Midwest (`r round(recs_des %>% group_by(Region) %>% summarize(DOL_BTU_Rat = survey_ratio(DOLLARLP, BTULP)) %>% filter(Region == "Midwest") %>% pull(DOL_BTU_Rat), 4)`.) ## Correlations -The correlation is a measure of the linear relationship between two continuous variables, which ranges between -1 and 1. The most commonly used method is Pearson's correlation (referred to as correlation henceforth). A sample correlation for a simple random sample is calculated as follows: +The correlation is a measure of the linear relationship between two continuous variables, which ranges between -1 and 1. The most commonly used method is Pearson's correlation (referred to as correlation henceforth.) A sample correlation for a simple random sample is calculated as follows: $$\frac{\sum (x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum (x_i-\bar{x})^2} \sqrt{\sum(y_i-\bar{y})^2}} $$ @@ -823,7 +857,7 @@ The arguments are: #### Example 1: Overall correlation {.unnumbered} -We can calculate the correlation between total square footage of homes (`TOTSQFT_EN`)^[Question text: What is the square footage of your home?] and electricity consumption (`BTUEL`)^[BTUEL is derived from the supplier side component of the survey where `BTUEL` represents the electricity consumption in British thermal units (Btus) converted from kilowatt hours (kWh) in a year]. +We can calculate the correlation between the total square footage of homes (`TOTSQFT_EN`)^[Question text: What is the square footage of your home?] and electricity consumption (`BTUEL`.)^[BTUEL is derived from the supplier side component of the survey where `BTUEL` represents the electricity consumption in British thermal units (Btus) converted from kilowatt hours (kWh) in a year.] ```{r} #| label: desc-corr-1 @@ -832,11 +866,11 @@ recs_des %>% summarize(SQFT_Elec_Corr = survey_corr(TOTSQFT_EN, BTUEL)) ``` -The correlation between total square footage of homes and electricity consumption is `r recs_des %>% summarize(SQFT_Elec_Corr = survey_corr(TOTSQFT_EN, BTUEL)) %>% pull(SQFT_Elec_Corr) %>% round(3)`, indicating a moderate positive relationship. +The correlation between the total square footage of homes and electricity consumption is `r recs_des %>% summarize(SQFT_Elec_Corr = survey_corr(TOTSQFT_EN, BTUEL)) %>% pull(SQFT_Elec_Corr) %>% round(3)`, indicating a moderate positive relationship. #### Example 2: Correlations by subgroup {.unnumbered} -Like with other statistics, we can explore the correlation between total square footage and electricity consumption based on subgroups, such as whether air conditioning is used (`ACUsed`). +We can explore the correlation between total square footage and electricity consumption based on subgroups, such as whether air conditioning (A/C) is used (`ACUsed`.) ```{r} #| label: desc-corr-2 @@ -846,7 +880,7 @@ recs_des %>% summarize(SQFT_Elec_Corr = survey_corr(TOTSQFT_EN, DOLLAREL)) ``` -For homes without air conditioning, there is a moderate positive correlation between total square footage with electricity consumption (`r recs_des %>% group_by(ACUsed) %>% summarize(SQFT_Elec_Corr = survey_corr(TOTSQFT_EN, DOLLAREL)) %>% filter(ACUsed == FALSE) %>% pull(SQFT_Elec_Corr) %>% round(3)`). For homes with air conditioning, the correlation of `r recs_des %>% group_by(ACUsed) %>% summarize(SQFT_Elec_Corr = survey_corr(TOTSQFT_EN, DOLLAREL)) %>% filter(ACUsed == TRUE) %>% pull(SQFT_Elec_Corr) %>% round(3)` indicates a stronger positive correlation between total square footage and electricity consumption. +For homes without A/C, there is a small positive correlation between total square footage with electricity consumption (`r recs_des %>% group_by(ACUsed) %>% summarize(SQFT_Elec_Corr = survey_corr(TOTSQFT_EN, DOLLAREL)) %>% filter(ACUsed == FALSE) %>% pull(SQFT_Elec_Corr) %>% round(3)`.) For homes with A/C, the correlation of `r recs_des %>% group_by(ACUsed) %>% summarize(SQFT_Elec_Corr = survey_corr(TOTSQFT_EN, DOLLAREL)) %>% filter(ACUsed == TRUE) %>% pull(SQFT_Elec_Corr) %>% round(3)` indicates a stronger positive correlation between total square footage and electricity consumption. ## Standard deviation and variance @@ -893,7 +927,7 @@ recs_des %>% sd_elbill = survey_sd(DOLLAREL)) ``` -We may encounter a warning related to a deprecation in the underlying calculations performed by the `survey_var()` function. This warning is a result of changes in the way R handles recycling in vectorized operations. The results are still valid. They give an estimate of the population variance of electricity bills (`var_elbill`), the standard error of that variance (`var_elbill_se`), and the estimated population standard deviation of electricity bills (`sd_elbill`). Note that no standard error is associated with the standard deviation - this is the only estimate that does not include a standard error. +We may encounter a warning related to deprecated underlying calculations performed by the `survey_var()` function. This warning is a result of changes in the way R handles recycling in vectorized operations. The results are still valid. They give an estimate of the population variance of electricity bills (`var_elbill`), the standard error of that variance (`var_elbill_se`), and the estimated population standard deviation of electricity bills (`sd_elbill`.) Note that no standard error is associated with the standard deviation - this is the only estimate that does not include a standard error. #### Example 2: Variability by subgroup {.unnumbered} @@ -950,9 +984,9 @@ It is estimated that American residential households spent an average of `r .elb ### Subpopulation analysis -We mentioned using `filter()` to subset a survey object for analysis. This operation should be done after creating the survey design object. In rare circumstances, subsetting data before creating the object can lead to incorrect variability estimates. This may occur if subsetting removes an entire Primary Sampling Unit (PSU; see Chapter \@ref(c10-sample-designs-replicate-weights) for more information on PSUs and sample designs). +We mentioned using `filter()` to subset a survey object for analysis. This operation should be done after creating the survey design object. Subsetting data before creating the object can lead to incorrect variability estimates if subsetting removes an entire Primary Sampling Unit (PSU; see Chapter \@ref(c10-sample-designs-replicate-weights) for more information on PSUs and sample designs.) -Suppose we want estimates of the average amount spent on natural gas among housing units using natural gas (based on the variable `BTUNG`)^[`BTUNG` is derived from the supplier side component of the survey where `BTUNG` represents the natural gas consumption in British thermal units (Btus) in a year]. We first filter records to only include records where `BTUNG > 0` and then find the average amount of money spent. +Suppose we want estimates of the average amount spent on natural gas among housing units using natural gas (based on the variable `BTUNG`.)^[`BTUNG` is derived from the supplier side component of the survey where `BTUNG` represents the natural gas consumption in British thermal units (Btus) in a year.] We first filter records to only include records where `BTUNG > 0` and then find the average amount spent. ```{r} #| label: desc-subpop @@ -962,9 +996,7 @@ recs_des %>% vartype = c("se", "ci"))) ``` -The estimated average amount spent on natural gas is `r recs_des %>% filter(BTUNG > 0) %>% summarize(NG_mean = survey_mean(DOLLARNG, vartype = c("se", "ci"))) %>% mutate(NG_mean =round(NG_mean )) %>% pull(NG_mean) %>% scales::dollar()`. - -Note that applying the filter to include only housing units that use natural gas yields a higher mean than when not applying the filter. This is because including housing units that do not use natural gas introduces many $0 amounts, impacting the mean calculation. +The estimated average amount spent on natural gas among households that use natural gas is `r recs_des %>% filter(BTUNG > 0) %>% summarize(NG_mean = survey_mean(DOLLARNG, vartype = c("se", "ci"))) %>% mutate(NG_mean =round(NG_mean )) %>% pull(NG_mean) %>% scales::dollar()`. Let's compare this to the mean when we do not filter. ```{r} #| label: desc-subpop-2 @@ -973,11 +1005,11 @@ recs_des %>% vartype = c("se", "ci"))) ``` -Based on this calculation, the estimated average amount spent on natural gas is `r recs_des %>% summarize(NG_mean = survey_mean(DOLLARNG, vartype = c("se", "ci"))) %>% mutate(NG_mean =round(NG_mean )) %>% pull(NG_mean) %>% scales::dollar()`. +Based on this calculation, the estimated average amount spent on natural gas is `r recs_des %>% summarize(NG_mean = survey_mean(DOLLARNG, vartype = c("se", "ci"))) %>% mutate(NG_mean =round(NG_mean )) %>% pull(NG_mean) %>% scales::dollar()`. Note that applying the filter to include only housing units that use natural gas yields a higher mean than when not applying the filter. This is because including housing units that do not use natural gas introduces many $0 amounts, impacting the mean calculation. ### Design effects {#desc-deff} -The design effect measures how the precision of an estimate is influenced by the sampling design. In other words, it measures how much more or less statistically efficient the survey design is compared to a simple random sample (SRS). It is computed by taking the ratio of the estimate's variance under the design at hand to the estimate's variance under a simple random sample without replacement. A design effect less than 1 indicates that the design is *more* statistically efficient than an SRS design, which is rare but possible in a stratified sampling design where the outcome correlates with the stratification variable(s). A design effect greater than 1 indicates that the design is *less* statistically efficient than a SRS design. From a design effect, we can calculate the effective sample size as follows: +The design effect measures how the precision of an estimate is influenced by the sampling design. In other words, it measures how much more or less statistically efficient the survey design is compared to a simple random sample (SRS.) It is computed by taking the ratio of the estimate's variance under the design at hand to the estimate's variance under a simple random sample without replacement. A design effect less than 1 indicates that the design is *more* statistically efficient than an SRS design, which is rare but possible in a stratified sampling design where the outcome correlates with the stratification variable(s). A design effect greater than 1 indicates that the design is *less* statistically efficient than a SRS design. From a design effect, we can calculate the effective sample size as follows: $$n_{eff}=\frac{n}{D_{eff}} $$ @@ -999,7 +1031,7 @@ For the values less than 1 (`BTUEL_deff` and `BTUFO_deff`), the results suggest ### Creating summary rows -When using `group_by()` in analysis, the results are returned with a row for each group or combination of groups. Often, we want both the breakdowns by group and a summary row for the estimate representing the entire population. For example, we may want the average electricity consumption by region *and* nationally. The {srvyr} package has the convenient `cascade()` function, which adds summary rows for the total of a group. It is used in place of `summarize()` and has similar functionalities along with some additional features. +When using `group_by()` in analysis, the results are returned with a row for each group or combination of groups. Often, we want both the breakdowns by group and a summary row for the estimate representing the entire population. For example, we may want the average electricity consumption by region *and* nationally. The {srvyr} package has the convenient `cascade()` function, which adds summary rows for the total of a group. It is used instead of `summarize()` and has similar functionalities along with some additional features. #### Syntax {.unnumbered} @@ -1020,11 +1052,11 @@ where the arguments are: * `.data`: A `tbl_svy` object * `...`: Name-value pairs of summary functions (same as the `summarize()` function) * `.fill`: Value to fill in for group summaries (defaults to `NA`) -* `.fill_level_top`: When filling factor variables, whether to put the value '.fill' in the first position (defaults to FALSE, placing it in the bottom). +* `.fill_level_top`: When filling factor variables, whether to put the value '.fill' in the first position (defaults to FALSE, placing it in the bottom.) #### Example {.unnumbered} -First, let's look at an example where we calculate the average household electricity cost and. Then, we build on it to examine the features of the `cascade()` function. In the first example below, we calculate the average household energy cost `DOLLAREL_mn` using `survey_mean()` without modifying any of the argument defaults in the function: +First, let's look at an example where we calculate the average household electricity cost. Then, we build on it to examine the features of the `cascade()` function. In the first example below, we calculate the average household energy cost `DOLLAREL_mn` using `survey_mean()` without modifying any of the argument defaults in the function: ```{r} #| label: desc-casc-ex1 @@ -1058,7 +1090,7 @@ recs_des %>% ) ``` -We can see the estimated average electricity bills by regions: `r .ebill_reg_cascade %>% filter(Region=="Northeast") %>% pull(DOLLAREL_mn)` for the Northeast, `r .ebill_reg_cascade %>% filter(Region=="South") %>% pull(DOLLAREL_mn)` for the South, and so on. The last row where `Region = NA` is the national average electricity bill, `r .ebill_reg_cascade %>% filter(is.na(Region)) %>% pull(DOLLAREL_mn)`. However, naming the national "region" as `NA` is not very informative. We can give it a better name using the `.fill` argument. +We can see the estimated average electricity bills by region: `r .ebill_reg_cascade %>% filter(Region=="Northeast") %>% pull(DOLLAREL_mn)` for the Northeast, `r .ebill_reg_cascade %>% filter(Region=="South") %>% pull(DOLLAREL_mn)` for the South, and so on. The last row, where `Region = NA`, is the national average electricity bill, `r .ebill_reg_cascade %>% filter(is.na(Region)) %>% pull(DOLLAREL_mn)`. However, naming the national "region" as `NA` is not very informative. We can give it a better name using the `.fill` argument. ```{r} #| label: desc-casc-ex3 @@ -1087,11 +1119,11 @@ While the results remain the same, the table is now easier to interpret. Often, we are interested in a summary statistic across many variables. Useful tools include the `across()` function in {dplyr}, shown a few times above, and the `map()` function in {purrr}. -The `across()` function allows you to apply the same function to multiple columns within `summarize()`. This works well with all functions shown above, except for `survey_prop()`. In a later example, we will tackle summarizing multiple proportions. +The `across()` function applies the same function to multiple columns within `summarize()`. This works well with all functions shown above, except for `survey_prop()`. In a later example, we tackle summarizing multiple proportions. #### Example 1: `across()` {.unnumbered} -Suppose we want to calculate the total and average consumption, along with coefficients of variation (CV), for each fuel type. These include the reported consumption of electricity (`BTUEL`), natural gas (`BTUNG`), liquid propane (`BTULP`), fuel oil (`BTUFO`), and wood (`BTUWOOD`), as mentioned in the section on design effects. We can take advantage of the fact that these are the only variables that start with "BTU" by selecting them with `starts_with("BTU")` in the `across()` function. For each selected column (`.x`), `across()` creates a list of two functions to be applied: `survey_total()` to calculate the total and `survey_mean()` to calculate the mean, along with their CV (`vartype = "cv"`). Finally, `.unpack = "{outer}.{inner}"` specifies that the resulting column names are a concatenation of the variable name, followed by Total or Mean, and then "coef" or "cv". +Suppose we want to calculate the total and average consumption, along with coefficients of variation (CV), for each fuel type. These include the reported consumption of electricity (`BTUEL`), natural gas (`BTUNG`), liquid propane (`BTULP`), fuel oil (`BTUFO`), and wood (`BTUWOOD`), as mentioned in the section on design effects. We can take advantage of the fact that these are the only variables that start with "BTU" by selecting them with `starts_with("BTU")` in the `across()` function. For each selected column (`.x`), `across()` creates a list of two functions to be applied: `survey_total()` to calculate the total and `survey_mean()` to calculate the mean, along with their CV (`vartype = "cv"`.) Finally, `.unpack = "{outer}.{inner}"` specifies that the resulting column names are a concatenation of the variable name, followed by Total or Mean, and then "coef" or "cv". ```{r} #| label: desc-multi-1 @@ -1143,7 +1175,7 @@ consumption_ests_long %>% #### Example 2: Proportions with `across()` {.unnumbered} -As mentioned earlier, proportions do not work as well directly with the `across()` method. If we want the proportion of houses with air conditioning and the proportion of houses with heating, we require two separate `group_by()` statements as shown below: +As mentioned earlier, proportions do not work as well directly with the `across()` method. If we want the proportion of houses with air conditioning (A/C) and the proportion of houses with heating, we require two separate `group_by()` statements as shown below: ```{r} #| label: desc-multip-1 @@ -1156,9 +1188,9 @@ recs_des %>% summarize(p = survey_prop()) ``` -We estimate `r scales::percent(recs_des %>% group_by(ACUsed) %>% summarize(p = survey_prop()) %>% filter(ACUsed == TRUE) %>% pull(p), accuracy = 0.1)` of households have air conditioning and `r scales::percent(recs_des %>% group_by(SpaceHeatingUsed) %>% summarize(p = survey_prop()) %>% filter(SpaceHeatingUsed == TRUE) %>% pull(p), accuracy = 0.1)` have heating. +We estimate `r scales::percent(recs_des %>% group_by(ACUsed) %>% summarize(p = survey_prop()) %>% filter(ACUsed == TRUE) %>% pull(p), accuracy = 0.1)` of households have A/C and `r scales::percent(recs_des %>% group_by(SpaceHeatingUsed) %>% summarize(p = survey_prop()) %>% filter(SpaceHeatingUsed == TRUE) %>% pull(p), accuracy = 0.1)` have heating. -If we are *only* interested in the `TRUE` outcomes, that is, the proportion of households that have air conditioning and the proportion that have heating, we can simplify the code. Applying `survey_mean()` to a logical variable is the same as using `survey_prop()`, as shown below: +If we are *only* interested in the `TRUE` outcomes, that is, the proportion of households that have A/C and the proportion that have heating, we can simplify the code. Applying `survey_mean()` to a logical variable is the same as using `survey_prop()`, as shown below: ```{r} #| label: desc-multip-2 @@ -1169,7 +1201,7 @@ cool_heat_tab <- recs_des %>% cool_heat_tab ``` -Note that the estimates are the same with those obtained using the separate `group_by()` statements. As before, we can use `pivot_longer()` to structure the table in a more suitable format for distribution. +Note that the estimates are the same as those obtained using the separate `group_by()` statements. As before, we can use `pivot_longer()` to structure the table in a more suitable format for distribution. ```{r} #| label: desc-multip-3 @@ -1183,7 +1215,7 @@ cool_heat_tab %>% #### Example 3: `purrr::map()` {.unnumbered} -Loops are a common tool when dealing with repetitive calculations. The {purrr} package provides the `map()` functions which, like a loop, allow you to perform the same task across different elements [@R-purrr]. In our case, we may want to calculate proportions from the same design multiple times. A straightforward approach is to design the calculation for one variable, build a function based on that, and then apply it iteratively for the rest of the variables. +Loops are a common tool when dealing with repetitive calculations. The {purrr} package provides the `map()` functions, which, like a loop, allow us to perform the same task across different elements [@R-purrr]. In our case, we may want to calculate proportions from the same design multiple times. A straightforward approach is to design the calculation for one variable, build a function based on that, and then apply it iteratively for the rest of the variables. Suppose we want to create a table that shows the proportion of people who express trust in their government (`TrustGovernment`)^[Question: How often can you trust the federal government in Washington to do what is right? (Always, most of the time, about half the time, some of the time, or never / Never, some of the time, about half the time, most of the time, or always)?] as well as those that trust in people (`TrustPeople`)^[Question: Generally speaking, how often can you trust other people? (Always, most of the time, about half the time, some of the time, or never / Never, some of the time, about half the time, most of the time, or always)? ]. @@ -1234,11 +1266,11 @@ c("TrustGovernment", "TrustPeople") %>% list_rbind() ``` -In addition to our results above, we can also see the output for `TrustPeople`. While we estimate `r scales::percent(anes_des %>% drop_na(TrustGovernment) %>% group_by(TrustGovernment) %>% summarize(p = survey_prop()) %>% mutate(Variable = "TrustGovernment") %>% rename(Answer = TrustGovernment) %>% select(Variable, everything()) %>% filter(Answer == "Always") %>% pull(p), accuracy = 0.01)` of people always trust the government, `r scales::percent(anes_des %>% drop_na(TrustPeople) %>% group_by(TrustPeople) %>% summarize(p = survey_prop()) %>% mutate(Variable = "TrustPeople") %>% rename(Answer = TrustPeople) %>% select(Variable, everything()) %>% filter(Answer == "Always") %>% pull(p), accuracy = 0.01)` always trust people. +In addition to our results above, we can also see the output for `TrustPeople`. While we estimate that `r scales::percent(anes_des %>% drop_na(TrustGovernment) %>% group_by(TrustGovernment) %>% summarize(p = survey_prop()) %>% mutate(Variable = "TrustGovernment") %>% rename(Answer = TrustGovernment) %>% select(Variable, everything()) %>% filter(Answer == "Always") %>% pull(p), accuracy = 0.01)` of people always trust the government, `r scales::percent(anes_des %>% drop_na(TrustPeople) %>% group_by(TrustPeople) %>% summarize(p = survey_prop()) %>% mutate(Variable = "TrustPeople") %>% rename(Answer = TrustPeople) %>% select(Variable, everything()) %>% filter(Answer == "Always") %>% pull(p), accuracy = 0.01)` always trust people. ## Exercises -The exercises use the design objects `anes_des` and `recs_des` provided in the Prerequisites box in the beginning of the chapter. +The exercises use the design objects `anes_des` and `recs_des` provided in the Prerequisites box at the beginning of the chapter. 1. How many females have a graduate degree? Hint: the variables `Gender` and `Education` will be useful. @@ -1252,8 +1284,8 @@ The exercises use the design objects `anes_des` and `recs_des` provided in the P 6. What is the median temperature people set their thermostats to at night during the winter? Hint: The variable `WinterTempNight` indicates the temperature that people set their temperature in the winter at night. -7. People sometimes set their temperature differently over different seasons and during the day. What median temperatures do people set their thermostat to in the summer and winter, both during the day and at night? Include confidence intervals. Hint: Use the variables `WinterTempDay`, `WinterTempNight`, `SummerTempDay`, and `SummerTempNight`. +7. People sometimes set their temperature differently over different seasons and during the day. What median temperatures do people set their thermostats to in the summer and winter, both during the day and at night? Include confidence intervals. Hint: Use the variables `WinterTempDay`, `WinterTempNight`, `SummerTempDay`, and `SummerTempNight`. 8. What is the correlation between the temperature that people set their temperature at during the night and during the day in the summer? -9. What is the 1st, 2nd, and 3rd quartile of the amount of money spent on energy by Building America (BA) climate zone? Hint: `TOTALDOL` indicates the total amount spent on all fuel, and `ClimateRegion_BA` indicates the BA climate zones. \ No newline at end of file +9. What is the 1st, 2nd, and 3rd quartile of money spent on energy by Building America (BA) climate zone? Hint: `TOTALDOL` indicates the total amount spent on all fuel, and `ClimateRegion_BA` indicates the BA climate zones. \ No newline at end of file diff --git a/06-statistical-testing.Rmd b/06-statistical-testing.Rmd index 94f916dc..a0747420 100644 --- a/06-statistical-testing.Rmd +++ b/06-statistical-testing.Rmd @@ -27,7 +27,7 @@ library(gt) library(prettyunits) ``` -We will be using data from ANES and RECS described in Chapter \@ref(c04-getting-started). As a reminder, here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter \@ref(c04-getting-started) for more information). +We are using data from ANES and RECS described in Chapter \@ref(c04-getting-started). As a reminder, here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter \@ref(c04-getting-started) for more information.) ```{r} #| label: stattest-anes-des @@ -65,39 +65,39 @@ recs_des <- recs_2020 %>% ## Introduction -When analyzing results from a survey, the point estimates described in Chapter \@ref(c05-descriptive-analysis) help us understand the data at a high level. Still, researchers and the public often want to make comparisons between different groups. These comparisons are calculated through statistical testing. +When analyzing survey results, the point estimates described in Chapter \@ref(c05-descriptive-analysis) help us understand the data at a high level. Still, we often want to make comparisons between different groups. These comparisons are calculated through statistical testing. The general idea of statistical testing is the same for data obtained through surveys and data obtained through other methods, where we compare the point estimates and variance estimates of each statistic to see if statistically significant differences exist. However, statistical testing for complex surveys involves additional considerations due to the need to account for the sampling design in order to obtain accurate variance estimates. -Statistical testing, also called hypothesis testing, involves declaring a null and alternative hypothesis. A null hypothesis is denoted as $H_0$ and the alternative hypothesis is denoted as $H_A$. The null hypothesis is the default assumption in that there are no differences in the data, or that the data is operating under "standard" behaviors. On the other hand, the alternative hypothesis is the break from the "standard" and what we are trying to determine if the data supports. +Statistical testing, also called hypothesis testing, involves declaring a null and alternative hypothesis. A null hypothesis is denoted as $H_0$ and the alternative hypothesis is denoted as $H_A$. The null hypothesis is the default assumption in that there are no differences in the data, or that the data are operating under "standard" behaviors. On the other hand, the alternative hypothesis is the break from the "standard" and what we are trying to determine if the data support this alternative hypothesis. -Let's review an example outside of survey data. If we are flipping a coin, a null hypothesis would be that the coin is fair and that each side has an equal chance of being flipped. In other words, the probability of the coin landing on each side is 1/2. Whereas an alternative hypothesis could be that the coin is unfair and that one side has a higher probability of being flipped (e.g., a probability of 1/4 to get heads, but a probability of 3/4 to get tails). We write this set of hypotheses as: +Let's review an example outside of survey data. If we are flipping a coin, a null hypothesis would be that the coin is fair and that each side has an equal chance of being flipped. In other words, the probability of the coin landing on each side is 1/2. Whereas an alternative hypothesis could be that the coin is unfair and that one side has a higher probability of being flipped (e.g., a probability of 1/4 to get heads but a probability of 3/4 to get tails.) We write this set of hypotheses as: - $H_0: \rho_{heads} = \rho_{tails}$, where $\rho_{x}$ is the probability of flipping the coin and having it land on heads ($\rho_{heads}$) or tails ($\rho_{tails}$) - $H_A: \rho_{heads} \neq \rho_{tails}$ -When we conduct hypothesis testing, the statistical models calculate a p-value, which shows how likely we are to observe the data if the null hypothesis is true. If the p-value (a probability between 0 and 1) is small, we have strong evidence to reject the null hypothesis as it is unlikely to see the data we are observing if the null hypothesis is true. However, if the p-value is large, we say we do not have evidence to reject the null hypothesis. The size of the p-value for this cut off is determined by type 1 error known as $\alpha$. A common type 1 error value for statistical testing is to use $\alpha = 0.05$.^[For more information on statistical testing, we recommend reviewing introduction to statistics textbooks.] It is common for explanations of statistical testing to refer to confidence level. The confidence level is the inverse of the type 1 error. Thus, if $\alpha = 0.05$, the confidence level would be 95%. +When we conduct hypothesis testing, the statistical models calculate a p-value, which shows how likely we are to observe the data if the null hypothesis is true. If the p-value (a probability between 0 and 1) is small, we have strong evidence to reject the null hypothesis as it is unlikely to see the data we observe if the null hypothesis is true. However, if the p-value is large, we say we do not have evidence to reject the null hypothesis. The size of the p-value for this cut-off is determined by Type 1 error known as $\alpha$. A common Type 1 error value for statistical testing is to use $\alpha = 0.05$.^[For more information on statistical testing, we recommend reviewing introduction to statistics textbooks.] It is common for explanations of statistical testing to refer to confidence level. The confidence level is the inverse of the Type 1 error. Thus, if $\alpha = 0.05$, the confidence level would be 95%. -The functions in the {survey} package allow for the correct estimation of the variances. This chapter will cover the following statistical tests with survey data and the following functions from the {survey} package[@lumley2010complex]: +The functions in the {survey} package allow for the correct estimation of the variances. This chapter covers the following statistical tests with survey data and the following functions from the {survey} package[@lumley2010complex]: -* Comparison of proportions `svyttest()` -* Comparison of means `svyttest()` -* Goodness of fit tests `svygofchisq()` -* Tests of independence `svychisq()` -* Tests of homogeneity `svychisq()` +* Comparison of proportions (`svyttest()`) +* Comparison of means (`svyttest()`) +* Goodness of fit tests (`svygofchisq()`) +* Tests of independence (`svychisq()`) +* Tests of homogeneity (`svychisq()`) ## Dot notation {#dot-notation} -Up to this point, we have shown functions that use wrappers from the {srvyr} package. This means that the functions work with tidyverse syntax. However, the functions in this chapter do not have wrappers in the {srvyr} package and are instead used directly from the {survey} package. Therefore, the design object is *not* the first argument, and to use these functions with the magrittr pipe (`%>%`) and tidyverse syntax, we will need to use dot (`.`) notation^[This could change in the future if another package is built or {srvyr} is expanded to work with {tidymodels} packages but no such plans are known at this time.] +Up to this point, we have shown functions that use wrappers from the {srvyr} package. This means that the functions work with tidyverse syntax. However, the functions in this chapter do not have wrappers in the {srvyr} package and are instead used directly from the {survey} package. Therefore, the design object is *not* the first argument, and to use these functions with the magrittr pipe (`%>%`) and tidyverse syntax, we need to use dot (`.`) notation.^[This could change in the future if another package is built or {srvyr} is expanded to work with {tidymodels} packages but no such plans are known at this time.] -Functions that work with the magrittr pipe (`%>%`) have the data as the first argument. When we run a function with the pipe, it automatically places anything to the left of the pipe into the first argument of the function to the right of the pipe. For example, if we wanted to take the `mtcars` data and filter to cars with six cylinders, we can write the code in at least four different ways: +Functions that work with the magrittr pipe (`%>%`) have the dataset as the first argument. When we run a function with the pipe, it automatically places anything to the left of the pipe into the first argument of the function to the right of the pipe. For example, if we wanted to take the `towny` data from the {gt} package and filter to municipalities with the Census Subdivision Type of "city", we can write the code in at least four different ways: -1. `filter(mtcars, cyl == 6)` -2. `mtcars %>% filter(cyl == 6)` -3. `mtcars %>% filter(., cyl == 6)` -4. `mtcars %>% filter(.data = ., cyl == 6)` +1. `filter(towny, csd_type == "city")` +2. `towny %>% filter(csd_type == "city")` +3. `towny %>% filter(., csd_type == "city")` +4. `towny %>% filter(.data = ., csd_type == "city")` -Each of these lines of code will produce the same output since the argument that takes the data is in the first spot in `filter()`. The first two are probably familiar to those who have worked with the tidyverse. The third option functions the same way as the second one but is explicit that `mtcars` goes into the first argument, and the fourth option indicates that `mtcars` is going into the named argument of `.data`. Here, we are telling R to take what's on the left side of the pipe (`mtcars`) and pipe it into the spot with the dot (`.`)---the first argument. +Each of these lines of code produces the same output since the argument that takes the dataset is in the first spot in `filter()`. The first two are probably familiar to those who have worked with the tidyverse. The third option functions the same way as the second one but is explicit that `towny` goes into the first argument, and the fourth option indicates that `towny` is going into the named argument of `.data`. Here, we are telling R to take what is on the left side of the pipe (`towny`) and pipe it into the spot with the dot (`.`)---the first argument. In functions that are not part of the tidyverse, the data argument may not be in the first spot. For example, in `svyttest()`, the data argument is in the second spot, which means we need to place the dot (`.`) in the second spot and not the first. For example: @@ -115,7 +115,7 @@ svydata_des %>% svyttest(design = ., x ~ y) ``` -However, the following code will not work as the `svyttest()` function expects the formula as the first argument when arguments are not named: +However, the following code does not work as the `svyttest()` function expects the formula as the first argument when arguments are not named: ```r svydata_des %>% @@ -124,19 +124,19 @@ svydata_des %>% ## Comparison of proportions and means {#stattest-ttest} -We use t-tests to compare two proportions or means. T-tests allow us to determine if one proportion or mean is statistically different from another. They are commonly used to determine if a single estimate differs from a known value (e.g., 0 or 50%) or to compare two group means (e.g., North versus South). Comparing a single estimate to a known value is called a *one sample t-test*, and we can set up the hypothesis test as follows: +We use t-tests to compare two proportions or means. T-tests allow us to determine if one proportion or mean is statistically different from another. They are commonly used to determine if a single estimate differs from a known value (e.g., 0 or 50%) or to compare two group means (e.g., North versus South.) Comparing a single estimate to a known value is called a *one sample t-test*, and we can set up the hypothesis test as follows: - $H_0: \mu = 0$ where $\mu$ is the mean outcome and $0$ is the value we are comparing it to - $H_A: \mu \neq 0$ -For comparing two estimates, this is called a *two-sample t-test* and we can set up the hypothesis test as follows: +For comparing two estimates, this is called a *two-sample t-test*. We can set up the hypothesis test as follows: - $H_0: \mu_1 = \mu_2$ where $\mu_i$ is the mean outcome for group $i$ - $H_A: \mu_1 \neq \mu_2$ -Two sample t-tests can also be *paired* or *unpaired*. If the data come from two different populations (e.g., North versus South), the t-test run will be an *unpaired* or *independent samples* t-test. *Paired* t-tests occur when the data come from the same population. This is commonly seen with data from the same population in two different time periods (e.g., before and after an intervention). +Two sample t-tests can also be *paired* or *unpaired*. If the data come from two different populations (e.g., North versus South), the t-test run is an *unpaired* or *independent samples* t-test. *Paired* t-tests occur when the data come from the same population. This is commonly seen with data from the same population in two different time periods (e.g., before and after an intervention.) -The difference between t-tests with non-survey data and survey data is based on the underlying variance estimation difference. Chapter \@ref(c10-sample-designs-replicate-weights) provides a detailed overview of the math behind the mean and sampling error calculations for various sample designs. The functions in the {survey} package will account for these nuances, provided the design object is correctly defined. +The difference between t-tests with non-survey data and survey data is based on the underlying variance estimation difference. Chapter \@ref(c10-sample-designs-replicate-weights) provides a detailed overview of the math behind the mean and sampling error calculations for various sample designs. The functions in the {survey} package account for these nuances, provided the design object is correctly defined. ### Syntax {#stattest-ttest-syntax} @@ -144,7 +144,7 @@ When we do not have survey data, we can use the `t.test()` function from the {st - We need to use the survey design object instead of the original data frame - We can only use a formula and not separate x and y data - - The confidence level cannot be specified and will always be set to 95%. However, we will show examples of how the confidence level can be changed after running the `svyttest()` function by using the `confint()` function. + - The confidence level cannot be specified and is always be set to 95%. However, we show examples of how the confidence level can be changed after running the `svyttest()` function by using the `confint()` function. Here is the syntax for the `svyttest()` function: @@ -166,14 +166,14 @@ The `formula` argument can take several different forms depending on what we are 1. **One-sample t-test:** a. **Comparison to 0:** `var ~ 0`, where `var` is the measure of interest, and we compare it to the value `0`. For example, we could test if the population mean of household debt is different from `0` given the sample data collected. - b. **Comparison to a different value:** `var - value ~ 0`, where `var` is the measure of interest and `value` is what we are comparing to. For example, we could test if the proportion of the population that has blue eyes is different from `25%` by using `var - 0.25 ~ 0`. Note that specifying the formula as `var ~ 0.25` is not equivalent and will result in a syntax error. + b. **Comparison to a different value:** `var - value ~ 0`, where `var` is the measure of interest and `value` is what we are comparing to. For example, we could test if the proportion of the population that has blue eyes is different from `25%` by using `var - 0.25 ~ 0`. Note that specifying the formula as `var ~ 0.25` is not equivalent and results in a syntax error. 2. **Two-sample t-test:** a. **Unpaired:** - **2 level grouping variable:** `var ~ groupVar`, where `var` is the measure of interest and `groupVar` is a variable with two categories. For example, we could test if the average age of the population who voted for president in 2020 differed from the age of people who did not vote. In this case, age would be used for `var`, and a binary variable indicating voting activity would be the `groupVar`. - **3+ level grouping variable:** `var ~ groupVar == level`, where `var` is the measure of interest, `groupVar` is the categorical variable, and `level` is the category level to isolate. For example, we could test if the test scores in one classroom differed from all other classrooms where `groupVar` would be the variable holding the values for classroom IDs and `level` is the classroom ID we want to compare to the others. - b. **Paired:** `var_1 - var_2 ~ 0`, where `var_1` is the first variable of interest and `var_2` is the second variable of interest. For example, we could test if test scores on a subject differed between the start and the end of a course so `var_1` would be the test score at the beginning of the course and `var_2` would be the score at the end of the course. + b. **Paired:** `var_1 - var_2 ~ 0`, where `var_1` is the first variable of interest and `var_2` is the second variable of interest. For example, we could test if test scores on a subject differed between the start and the end of a course, so `var_1` would be the test score at the beginning of the course, and `var_2` would be the score at the end of the course. -The `na.rm` argument defaults to `FALSE`, which means if any data is missing, the t-test will not compute. Throughout this chapter, we will always set `na.rm = TRUE`, but before analyzing the survey data, review the notes provided in Chapter \@ref(c03-survey-data-documentation) to better understand how to handle missing data. +The `na.rm` argument defaults to `FALSE`, which means if any data values are missing, the t-test does not compute. Throughout this chapter, we always set `na.rm = TRUE`, but before analyzing the survey data, review the notes provided in Chapter \@ref(c11-missing-data) to better understand how to handle missing data. Let's walk through a few examples using the ANES and RECS data. @@ -219,19 +219,21 @@ recs_des %>% The result is the same in both methods, so we see that the average temperature U.S. households set their thermostat to in the summer at night is `r signif(ttest_ex1$estimate + 68,3)`$^\circ$F. Looking at the output from `svyttest()`, the t-statistic is `r signif(ttest_ex1$statistic, 3)`, and the p-value is $`r pretty_p_value(ttest_ex1[["p.value"]])`$, indicating that the average is statistically different from 68$^\circ$F at an $\alpha$ level of $0.05$. -If we want an 80% confidence interval for the test statistic, we can use the function `confint()` to change the confidence level. Below, we print both the original 95% confidence interval and the 80% confidence interval: +If we want an 80% confidence interval for the test statistic, we can use the function `confint()` to change the confidence level. Below, we print the default confidence interval (95%), the confidence interval explicitly specifying the level as 95%, and the 80% confidence interval. The default confidence level is 95%, and when we specify this level, R returns a vector with both row and column names. However, when we specify any other confidence level, an unnamed vector is returned, with the first element being the lower bound and the second element being the upper bound of the confidence interval. ```{r} #| label: stattest-ttest-ex1-ci80 -confint(ttest_ex1, level = 0.95) -confint(ttest_ex1, level = 0.8) +confint(ttest_ex1) +confint(ttest_ex1, level = 0.95) +confint(ttest_ex1, level = 0.8) ``` + In this case, neither confidence interval contains 0, and we draw the same conclusion from either that the average temperature households set their thermostat in the summer at night is significantly higher than 68$^\circ$F. #### Example 2: One-sample t-test for proportion {.unnumbered #stattest-ttest-ex2} -RECS asked respondents if they use any air conditioning (AC) in their home.^[Is any air conditioning equipment used in your home?] In our data, we call this variable `ACUsed`. Let's look at the proportion of U.S. households that use AC in their homes using the `survey_prop()` function we learned in Chapter \@ref(c05-descriptive-analysis). +RECS asked respondents if they use air conditioning (A/C) in their home.^[Is any air conditioning equipment used in your home?] In our data, we call this variable `ACUsed`. Let's look at the proportion of U.S. households that use A/C in their homes using the `survey_prop()` function we learned in Chapter \@ref(c05-descriptive-analysis). ```{r} #| label: stattest-ttest-acused @@ -242,9 +244,9 @@ acprop <- recs_des %>% acprop ``` -Based on this, `r signif((acprop %>% filter(ACUsed==TRUE) %>% pull(p))*100,3)`% of U.S. households use AC in their homes. If we wanted to know if this differs from 90%, we could set up our hypothesis as follows: +Based on this, `r signif((acprop %>% filter(ACUsed==TRUE) %>% pull(p))*100,3)`% of U.S. households use A/C in their homes. If we wanted to know if this differs from 90%, we could set up our hypothesis as follows: -- $H_0: p = 0.90$ where $p$ is the proportion of the U.S. households that use AC in their homes +- $H_0: p = 0.90$ where $p$ is the proportion of U.S. households that use A/C in their homes - $H_A: p \neq 0.90$ To conduct this in R, we use the `svyttest()` function as follows: @@ -268,7 +270,7 @@ The output from the `svyttest()` function can be a bit hard to read. Using the ` tidy(ttest_ex2) ``` -The 'tidied' output can also be piped into the {gt} package to create a table ready for publication. We go over the {gt} package in Chapter \@ref(c08-communicating-results). The function `pretty_p_value()` comes from the {prettyunits} package and converts numeric p-values to characters and, by default prints four decimal places and displays any p-value less than 0.0001 as `"<0.0001"` though another minimum display p-value can be specified [@R-prettyunits]. +The 'tidied' output can also be piped into the {gt} package to create a table ready for publication. We go over the {gt} package in Chapter \@ref(c08-communicating-results). The function `pretty_p_value()` comes from the {prettyunits} package and converts numeric p-values to characters and, by default, prints four decimal places and displays any p-value less than 0.0001 as `"<0.0001"` though another minimum display p-value can be specified [@R-prettyunits]. ```{r} #| label: stattest-ttest-ex2-gt @@ -279,7 +281,7 @@ tidy(ttest_ex2) %>% fmt_number() ``` -(ref:stattest-ttest-ex2-gt-tab) One-sample t-test output for estimates of U.S. households use AC in their homes differing from 90%, RECS 2020 +(ref:stattest-ttest-ex2-gt-tab) One-sample t-test output for estimates of U.S. households use A/C in their homes differing from 90%, RECS 2020 ```{r} #| label: stattest-ttest-ex2-gt-tab @@ -293,13 +295,13 @@ tidy(ttest_ex2) %>% print_gt_book(knitr::opts_current$get()[["label"]]) ``` -The estimate differs from Example 1 in that the estimate is not displaying \(\mu - 0.90\) but rather \(\mu\), or the difference between the U.S. households that use AC and the proportion we are comparing to. We can see that there is a difference of `r signif(ttest_ex2$estimate*100,3)` percentage points. Additionally, the t-statistic value in the `statistic` column is `r signif(ttest_ex2$statistic,3)`, and the p-value is `r pretty_p_value(ttest_ex2$p.value)`. These results indicate that the fewer than 90% of U.S. households use AC in their homes. +The estimate differs from Example 1 in that the estimate does not display \(p - 0.90\) but rather \(p\), or the difference between the U.S. households that use A/C and the proportion we are comparing to. We can see that there is a difference of `r signif(ttest_ex2$estimate*100,3)` percentage points. Additionally, the t-statistic value in the `statistic` column is `r signif(ttest_ex2$statistic,3)`, and the p-value is `r pretty_p_value(ttest_ex2$p.value)`. These results indicate that fewer than 90% of U.S. households use A/C in their homes. #### Example 3: Unpaired two-sample t-test {.unnumbered #stattest-ttest-ex3} -Two additional variables in the RECS data are the electric bill cost (`DOLLAREL`) and whether the house used AC or not (`ACUsed`).^[Is any air conditioning equipment used in your home?] If we want to know if the U.S. households that used AC had higher electrical bills compared to those that did not, we could set up the hypothesis as follows: +Two additional variables in the RECS data are the electric bill cost (`DOLLAREL`) and whether the house used A/C or not (`ACUsed`.)^[Is any air conditioning equipment used in your home?] If we want to know if the U.S. households that used A/C had higher electrical bills compared to those that did not, we could set up the hypothesis as follows: -- $H_0: \mu_{AC} = \mu_{noAC}$ where $\mu_{AC}$ is the electrical bill cost for U.S. households that used AC and $\mu_{noAC}$ is the electrical bill cost for U.S. households that did not use AC +- $H_0: \mu_{AC} = \mu_{noAC}$ where $\mu_{AC}$ is the electrical bill cost for U.S. households that used A/C and $\mu_{noAC}$ is the electrical bill cost for U.S. households that did not use A/C - $H_A: \mu_{AC} \neq \mu_{noAC}$ Let's take a quick look at the data to see the format the data are in: @@ -330,7 +332,7 @@ tidy(ttest_ex3) %>% fmt_number() ``` -(ref:stattest-ttest-ex3-gt-tab) Unpaired two-sample t-test output for estimates of U.S. households electrical bills by AC use, RECS 2020 +(ref:stattest-ttest-ex3-gt-tab) Unpaired two-sample t-test output for estimates of U.S. households electrical bills by A/C use, RECS 2020 ```{r} #| label: stattest-ttest-ex3-gt-tab @@ -344,11 +346,11 @@ tidy(ttest_ex3) %>% print_gt_book(knitr::opts_current$get()[["label"]]) ``` -The results indicate that the difference in electrical bills for those that used AC and those that did not is, on average, \$`r round(ttest_ex3$estimate,2)`. The difference appears to be statistically significant as the t-statistic is `r signif(ttest_ex3$statistic, 3)` and the p-value is $`r pretty_p_value(ttest_ex3[["p.value"]])`$. Households that used AC spent, on average, $`r round(ttest_ex3[["estimate"]], 2) %>% unname()` more in 2020 on electricity than households without AC. +The results indicate that the difference in electrical bills for those who used A/C and those who did not is, on average, \$`r round(ttest_ex3$estimate,2)`. The difference appears to be statistically significant as the t-statistic is `r signif(ttest_ex3$statistic, 3)` and the p-value is $`r pretty_p_value(ttest_ex3[["p.value"]])`$. Households that used A/C spent, on average, $`r round(ttest_ex3[["estimate"]], 2) %>% unname()` more in 2020 on electricity than households without A/C. #### Example 4: Paired two-sample t-test {.unnumbered #stattest-ttest-ex4} -Let's say we want to test whether the temperature that U.S. households set their thermostat at night differs depending on the season (comparing summer^[During the summer, what is your home’s typical indoor temperature inside your home at night?] and winter^[During the winter, what is your home’s typical indoor temperature inside your home at night?] temperatures). We could set up the hypothesis as follows: +Let's say we want to test whether the temperature at which U.S. households set their thermostat at night differs depending on the season (comparing summer^[During the summer, what is your home’s typical indoor temperature inside your home at night?] and winter^[During the winter, what is your home’s typical indoor temperature inside your home at night?] temperatures.) We could set up the hypothesis as follows: - $H_0: \mu_{summer} = \mu_{winter}$ where $\mu_{summer}$ is the temperature that U.S. households set their thermostat to during summer nights, and $\mu_{winter}$ is the temperature that U.S. households set their thermostat to during winter nights - $H_A: \mu_{summer} \neq \mu_{winter}$ @@ -388,13 +390,13 @@ tidy(ttest_ex4) %>% print_gt_book(knitr::opts_current$get()[["label"]]) ``` -U.S. households set their thermostat on average `r signif(ttest_ex4$estimate,2)`$^\circ$F warmer in summer nights than winter nights, which is statistically significant (t = `r signif(ttest_ex4$statistic, 3)`, p-value = $`r pretty_p_value(ttest_ex4[["p.value"]])`$). +U.S. households set their thermostat on average `r signif(ttest_ex4$estimate,2)`$^\circ$F warmer in summer nights than winter nights, which is statistically significant (t = `r signif(ttest_ex4$statistic, 3)`, p-value = $`r pretty_p_value(ttest_ex4[["p.value"]])`$.) ## Chi-square tests {#stattest-chi} Chi-square tests ($\chi^2$) allow us to examine multiple proportions using a goodness-of-fit test, a test of independence, or a test of homogeneity. These three tests have the same $\chi^2$ distributions but with slightly different underlying assumptions. -First, **goodness-of-fit** tests are used when comparing *observed* data to *expected* data. For example, this could be used to determine if respondent demographics (the observed data in the sample) match known population information (the expected data). In this case, we can set up the hypothesis test as follows: +First, **goodness-of-fit** tests are used when comparing *observed* data to *expected* data. For example, this could be used to determine if respondent demographics (the observed data in the sample) match known population information (the expected data.) In this case, we can set up the hypothesis test as follows: - $H_0: p_1 = \pi_1, ~ p_2 = \pi_2, ~ ..., ~ p_k = \pi_k$ where $p_i$ is the observed proportion for category $i$, $\pi_i$ is expected proportion for category $i$, and $k$ is the number of categories - $H_A:$ at least one level of $p_i$ does not match $\pi_i$ @@ -409,7 +411,7 @@ Third, **tests of homogeneity** are used to compare two distributions to see if - $H_0: p_{1a} = p_{1b}, ~ p_{2a} = p_{2b}, ~ ..., ~ p_{ka} = p_{kb}$ where $p_{ia}$ is the observed proportion of category $i$ for subgroup $a$, $p_{ib}$ is the observed proportion of category $i$ for subgroup $a$ and $k$ is the number of categories - $H_A:$ at least one category of $p_{ia}$ does not match $p_{ib}$ -As with t-tests, the difference between using $\chi^2$ tests with non-survey data and survey data is based on the underlying variance estimation. The functions in the {survey} package will account for these nuances, provided the design object is correctly defined. For basic variance estimation formulas for different survey design types, refer to Chapter \@ref(c10-sample-designs-replicate-weights). +As with t-tests, the difference between using $\chi^2$ tests with non-survey data and survey data is based on the underlying variance estimation. The functions in the {survey} package account for these nuances, provided the design object is correctly defined. For basic variance estimation formulas for different survey design types, refer to Chapter \@ref(c10-sample-designs-replicate-weights). ### Syntax {#stattest-chi-syntax} @@ -433,11 +435,11 @@ svygofchisq(formula, The arguments are: * `formula`: Formula specifying a single factor variable -* `p`: Vector of probabilities for the categories of the factor in the correct order. If they probabilities do not sum to 1, they will be rescaled to sum to 1. +* `p`: Vector of probabilities for the categories of the factor in the correct order. If the probabilities do not sum to 1, they are rescaled to sum to 1. * `design`: Survey design object * ...: Other arguments to pass on, such as `na.rm` -Based on the order of the arguments, we again must use the dot `(.)` notation if we pipe in the survey design object or explicitly name the arguments as described in Section \@ref(dot-notation). For the goodness of fit tests, the formula will be a single variable `formula = ~var` as we compare the observed data from this variable to the expected data. The expected probabilities are then entered in the `p` argument and need to be a vector of the same length as the number of categories in the variable. For example, if we want to know if the proportion of males and females matches a distribution of 30/70, then the sex variable (with two categories) would be used `formula = ~SEX`, and the proportions would be included as `p = c(.3, .7)`. It is important to note that the variable entered into the formula should be formatted as either a factor or a character. The examples below provide more detail and tips on how to make sure the levels match up correctly. +Based on the order of the arguments, we again must use the dot `(.)` notation if we pipe in the survey design object or explicitly name the arguments as described in Section \@ref(dot-notation). For the goodness of fit tests, the formula is a single variable `formula = ~var` as we compare the observed data from this variable to the expected data. The expected probabilities are then entered in the `p` argument and need to be a vector of the same length as the number of categories in the variable. For example, if we want to know if the proportion of males and females matches a distribution of 30/70, then the sex variable (with two categories) would be used `formula = ~SEX`, and the proportions would be included as `p = c(.3, .7)`. It is important to note that the variable entered into the formula should be formatted as either a factor or a character. The examples below provide more detail and tips on how to make sure the levels match up correctly. For tests of homogeneity and independence, the `svychisq()` function should be used. The syntax is as follows: @@ -460,11 +462,11 @@ The arguments are: There are six statistics that are accepted in this formula. For tests of homogeneity (when comparing cross-tabulations), the `F` or `Chisq` statistics should be used.^[These two statistics can also be used for goodness of fit tests if the `svygofchisq()` function is not used.] The `F` statistic is the default and uses the Rao-Scott second-order correction. This correction is designed to assist with complicated sampling designs (i.e., those other than a simple random sample) [@Scott2007]. The `Chisq` statistic is an adjusted version of the Pearson $\chi^2$ statistic. The version of this statistic in the `svychisq()` function compares the design effect estimate from the provided survey data to what the $\chi^2$ distribution would have been if the data came from a simple random sampling. -For tests of independence, the `Wald` and `adjWald` are recommended as they provide a better adjustment for variable comparisons [@lumley2010complex]. If the data has a small number of primary sampling units (PSUs) compared to the degrees of freedom, then the `adjWald` statistic should be used to account for this. The `lincom` and `saddlepoint` statistics are available for more complicated data structures. +For tests of independence, the `Wald` and `adjWald` are recommended as they provide a better adjustment for variable comparisons [@lumley2010complex]. If the data have a small number of primary sampling units (PSUs) compared to the degrees of freedom, then the `adjWald` statistic should be used to account for this. The `lincom` and `saddlepoint` statistics are available for more complicated data structures. -The formula argument will always be one-sided, unlike the `svyttest()` function. The two variables of interest should be included with a plus sign: `formula = ~ var_1 + var_2`. As with the `svygofchisq()` function, the variables entered into the formula should be formatted as either a factor or a character. +The formula argument is always one-sided, unlike the `svyttest()` function. The two variables of interest should be included with a plus sign: `formula = ~ var_1 + var_2`. As with the `svygofchisq()` function, the variables entered into the formula should be formatted as either a factor or a character. -Additionally, as with the t-test function, both `svygofchisq()` and `svychisq()` have the `na.rm` argument. If any data is missing, the $\chi^2$ tests will assume that `NA` is a category and include it in the calculation. Throughout this chapter, we will always set `na.rm = TRUE`, but before analyzing the survey data, review the notes provided in Chapter \@ref(c03-survey-data-documentation) to better understand how to handle missing data. +Additionally, as with the t-test function, both `svygofchisq()` and `svychisq()` have the `na.rm` argument. If any data values are missing, the $\chi^2$ tests assume that `NA` is a category and include it in the calculation. Throughout this chapter, we always set `na.rm = TRUE`, but before analyzing the survey data, review the notes provided in Chapter \@ref(c11-missing-data) to better understand how to handle missing data. ### Examples {#stattest-chi-examples} @@ -472,9 +474,9 @@ Let's walk through a few examples using the ANES data. #### Example 1: Goodness of fit test {.unnumbered #stattest-chi-ex1} -ANES asked respondents about their highest education level.^[What is the highest level of school you have completed or the highest degree you have received?] Based on the data from the 2020 American Community Survey (ACS) 5-year estimates^[Data was pulled from data.census.gov using the S1501 Education Attainment 2020: ACS 5-Year Estimates Subject Tables], the education distribution of those aged 18+ in the United States (among the 50 states and District of Columbia) is as follows: +ANES asked respondents about their highest education level.^[What is the highest level of school you have completed or the highest degree you have received?] Based on the data from the 2020 American Community Survey (ACS) 5-year estimates^[Data was pulled from data.census.gov using the S1501 Education Attainment 2020: ACS 5-Year Estimates Subject Tables.], the education distribution of those aged 18+ in the United States (among the 50 states and District of Columbia) is as follows: - - 11% had less than High School degree + - 11% had less than a High School degree - 27% had a High School degree - 29% had some college or associate's degree - 33% had a bachelor's degree or higher @@ -494,7 +496,7 @@ anes_des %>% summarize(p = survey_mean()) ``` -Based on this output, we can see that we have different levels than the ACS data provides. Specifically, the education data from ANES has two levels for Bachelor's Degree or Higher (Bachelor's and Graduate), so these two categories need to be collapsed into a single category to match the ACS data. For this, among other methods, we can use the {forcats} package from the tidyverse [@R-forcats]. The package's `fct_collapse()` function helps us create a new variable by collapsing categories into a single one. Then, we will use the `svygofchisq()` function to compare the ANES data to the ACS data where we specify the updated design object, the formula using the collapsed education variable, the ACS estimates for education levels as p, and removing NA values. +Based on this output, we can see that we have different levels from the ACS data. Specifically, the education data from ANES include two levels for Bachelor's Degree or Higher (Bachelor's and Graduate), so these two categories need to be collapsed into a single category to match the ACS data. For this, among other methods, we can use the {forcats} package from the tidyverse [@R-forcats]. The package's `fct_collapse()` function helps us create a new variable by collapsing categories into a single one. Then, we use the `svygofchisq()` function to compare the ANES data to the ACS data, where we specify the updated design object, the formula using the collapsed education variable, the ACS estimates for education levels as p, and removing NA values. ```{r} #| label: stattest-chi-ex1 @@ -520,7 +522,7 @@ chi_ex1 <- anes_des_educ %>% chi_ex1 ``` -The output from the `svygofchisq()` indicates that at least one proportion from ANES does not match the ACS data ($\chi^2 =$ `r prettyNum(chi_ex1$statistic, big.mark=",")`; p-value `r pretty_p_value(chi_ex1[["p.value"]])`). To get a better idea of the differences, we can use the `expected` output along with `survey_mean()` to create a comparison table: +The output from the `svygofchisq()` indicates that at least one proportion from ANES does not match the ACS data ($\chi^2 =$ `r prettyNum(chi_ex1$statistic, big.mark=",")`; p-value `r pretty_p_value(chi_ex1[["p.value"]])`.) To get a better idea of the differences, we can use the `expected` output along with `survey_mean()` to create a comparison table: ```{r} #| label: stattest-chi-ex1-table @@ -535,11 +537,11 @@ ex1_table <- anes_des_educ %>% ex1_table ``` -This output includes our expected proportions from the ACS that we provided the `svygofchisq()` function along with the output of the observed proportions and their confidence intervals. This table shows that the "High school" and "Post HS" categories have nearly identical proportions but that the other two categories are slightly different. Looking at the confidence intervals, we can see that the ANES data skews to include fewer people in the "Less than HS" category and more people in the "Bachelor or Higher" category. This may be easier to see if we plot this. The code below uses the tabular output to create Figure \@ref(fig:stattest-chi-ex1-graph). +This output includes our expected proportions from the ACS that we provided the `svygofchisq()` function along with the output of the observed proportions and their confidence intervals. This table shows that the "High school" and "Post HS" categories have nearly identical proportions but that the other two categories are slightly different. Looking at the confidence intervals, we can see that the ANES data skew to include fewer people in the "Less than HS" category and more people in the "Bachelor or Higher" category. This may be easier to see if we plot this. The code below uses the tabular output to create Figure \@ref(fig:stattest-chi-ex1-graph). ```{r} #| label: stattest-chi-ex1-graph -#| fig.cap: Expected and observed proportions of education, showing the confidence intervals for the expected proportions and whether the observed proportions lie within them. +#| fig.cap: Expected and observed proportions of education with confidence intervals #| fig.alt: Expected and observed proportions of education, showing the confidence intervals for the expected proportions and whether the observed proportions lie within them. The x-axis has labels 'Less than HS', 'High school', 'Post HS', and 'Bachelor or Higher'. The only ones where expected proportion is outside of the intervals is 'Less than HS' and 'Bachelor or Higher'. ex1_table %>% @@ -551,11 +553,13 @@ ex1_table %>% mutate( Observed_low = if_else(Names == "Observed", Observed_low, NA_real_), Observed_upp = if_else(Names == "Observed", Observed_upp, NA_real_), - Names = if_else(Names == "Observed", "ANES (observed)", "ACS (expected)") + Names = if_else(Names == "Observed", + "ANES (observed)", "ACS (expected)") ) %>% ggplot(aes(x = Education, y = Proportion, color = Names)) + geom_point(alpha = 0.75, size = 2) + - geom_errorbar(aes(ymin = Observed_low, ymax = Observed_upp), width = 0.25) + + geom_errorbar(aes(ymin = Observed_low, ymax = Observed_upp), + width = 0.25) + theme_bw() + scale_color_manual(name = "Type", values = book_colors[c(4, 1)]) + theme(legend.position = "bottom", legend.title = element_blank()) @@ -595,7 +599,7 @@ The output from `svychisq()` indicates that the distribution of people's trust i chi_ex2$observed ``` -However, as researchers, we often want to know about the proportions and not just the respondent counts from the survey. There are a couple of different ways that we can do this. The first is using the counts from `chi_ex2$observed` to calculate the proportion. We can then pivot the table to create a cross-tabulation similar to the counts table above. Adding `group_by()` to the code means that we are obtaining the proportions within each level of that variable. In this case, we are looking at the distribution of `TrustGovernment` for each level of `TrustPeople`. The resulting table is shown in Table \@ref(tab:stattest-chi-ex2-prop1-tab). +However, we often want to know about the proportions, not just the respondent counts from the survey. There are a couple of different ways that we can do this. The first is using the counts from `chi_ex2$observed` to calculate the proportion. We can then pivot the table to create a cross-tabulation similar to the counts table above. Adding `group_by()` to the code means that we obtain the proportions within each variable level. In this case, we are looking at the distribution of `TrustGovernment` for each level of `TrustPeople`. The resulting table is shown in Table \@ref(tab:stattest-chi-ex2-prop1-tab). ```{r} #| label: stattest-chi-ex2-prop1 @@ -632,9 +636,9 @@ chi_ex2_table %>% print_gt_book(knitr::opts_current$get()[["label"]]) ``` -In Table \@ref(tab:stattest-chi-ex2-prop1-tab), each column sums to 1. For example, we can say that it is estimated that of people who always trust in people, `r round(chi_ex2$observed[1,1]/sum(chi_ex2$observed[,1])*100, 1)`% also always trust in government based on the top-left cell but `r round(chi_ex2$observed[5,1]/sum(chi_ex2$observed[,1])*100, 1)`% never trust in government. +In Table \@ref(tab:stattest-chi-ex2-prop1-tab), each column sums to 1. For example, we can say that it is estimated that of people who always trust in people, `r round(chi_ex2$observed[1,1]/sum(chi_ex2$observed[,1])*100, 1)`% also always trust in the government based on the top-left cell, but `r round(chi_ex2$observed[5,1]/sum(chi_ex2$observed[,1])*100, 1)`% never trust in the government. -The second option is to use `group_by()` and `survey_mean()` functions to calculate the proportions from the ANES design object. A reminder that with more than one variable listed in the `group_by()` statement, the proportions are within the first variable listed. As mentioned above, we are looking at the distribution of `TrustGovernment` for each level of `TrustPeople`. +The second option is to use the `group_by()` and `survey_mean()` functions to calculate the proportions from the ANES design object. Remember that with more than one variable listed in the `group_by()` statement, the proportions are within the first variable listed. As mentioned above, we are looking at the distribution of `TrustGovernment` for each level of `TrustPeople`. ```{r} #| label: stattest-chi-ex2-prop2 chi_ex2_obs <- anes_des %>% @@ -672,9 +676,9 @@ chi_ex2_obs_table %>% print_gt_book(knitr::opts_current$get()[["label"]]) ``` -Both methods produce the same output as the `svychisq()` function does account for the survey design. However, calculating the proportions directly from the design object means we can also obtain the variance information. In this case, the table output displays the survey estimate followed by the confidence intervals. Based on the output, we can see that of those who never trust people, `r round(chi_ex2$observed[5,5]/sum(chi_ex2$observed[,5])*100, 1)`% also never trust the government, while the proportions of never trusting the government are much lower for each of the other levels of trusting people. +Both methods produce the same output as the `svychisq()` function. However, calculating the proportions directly from the design object allows us to obtain the variance information. In this case, the table output displays the survey estimate followed by the confidence intervals. Based on the output, we can see that of those who never trust people, `r round(chi_ex2$observed[5,5]/sum(chi_ex2$observed[,5])*100, 1)`% also never trust the government, while the proportions of never trusting the government are much lower for each of the other levels of trusting people. -We may find it easier to look at these proportions graphically. We can use `ggplot()` and facets to provide an overview as shown below to create Figure \@ref(fig:stattest-chi-ex2-graph): +We may find it easier to look at these proportions graphically. We can use `ggplot()` and facets to provide an overview to create Figure \@ref(fig:stattest-chi-ex2-graph) below: ```{r} #| label: stattest-chi-ex2-graph @@ -685,14 +689,16 @@ chi_ex2_obs %>% mutate(TrustPeople= fct_reorder(str_c("Trust in People:\n", TrustPeople), order(TrustPeople))) %>% - ggplot(aes(x = TrustGovernment, y = Observed, color = TrustGovernment)) + + ggplot( + aes(x = TrustGovernment, y = Observed, color = TrustGovernment)) + facet_wrap( ~ TrustPeople, ncol = 5) + geom_point() + geom_errorbar(aes(ymin = Observed_low, ymax = Observed_upp)) + ylab("Proportion") + xlab("") + theme_bw() + - scale_color_manual(name="Trust in Government", values=book_colors) + + scale_color_manual(name="Trust in Government", + values=book_colors) + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(), legend.position = "bottom") + @@ -770,19 +776,19 @@ chi_ex3_obs_table %>% print_gt_book(knitr::opts_current$get()[["label"]]) ``` -We can see that the age group distribution that voted for Biden and other candidates was younger than those that voted for Trump. For example, of those who voted for Biden, 20.4% were in the 18-29 age group, compared to only 11.4% of those who voted for Trump were in that age group. On the other side, 23.4% of those who voted for Trump were in the 50-59 age group compared to only 15.4% of those who voted for Biden. +We can see that the age group distribution that voted for Biden and other candidates was younger than those that voted for Trump. For example, of those who voted for Biden, 20.4% were in the 18-29 age group, compared to only 11.4% of those who voted for Trump were in that age group. Conversely, 23.4% of those who voted for Trump were in the 50-59 age group compared to only 15.4% of those who voted for Biden. ## Exercises {#stattest-exercises} -The exercises use the design objects `anes_des` and `recs_des` as provided in the Prerequisites box in the [beginning of the chapter](#c06-statistical-testing). Here are some exercises for practicing conducting t-tests using `svyttest()`: +The exercises use the design objects `anes_des` and `recs_des` as provided in the Prerequisites box at the [beginning of the chapter](#c06-statistical-testing). Here are some exercises for practicing conducting t-tests using `svyttest()`: -1. Using the RECS data, do more than 50% of U.S. households use AC (`ACUsed`)? +1. Using the RECS data, do more than 50% of U.S. households use A/C (`ACUsed`)? -2. Using the RECS data, does the average temperature that U.S. households set their thermostats to differ between the day and night in the winter (`WinterTempDay` and `WinterTempNight`)? +2. Using the RECS data, does the average temperature at which U.S. households set their thermostats differ between the day and night in the winter (`WinterTempDay` and `WinterTempNight`)? 3. Using the ANES data, does the average age (`Age`) of those who voted for Joseph Biden in 2020 (`VotedPres2020_selection`) differ from those who voted for another candidate? -4. If you wanted to determine if the political party affiliation differed for males and females, what test would you use? +4. If we wanted to determine if the political party affiliation differed for males and females, what test would we use? a. Goodness of fit test (`svygofchisq()`) b. Test of independence (`svychisq()`) diff --git a/07-modeling.Rmd b/07-modeling.Rmd index f4cfd379..6df5c378 100644 --- a/07-modeling.Rmd +++ b/07-modeling.Rmd @@ -26,7 +26,7 @@ library(gt) library(prettyunits) ``` -We will be using data from ANES and RECS described in Chapter \@ref(c04-getting-started). As a reminder, here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter \@ref(c04-getting-started) for more information). +We are using data from ANES and RECS described in Chapter \@ref(c04-getting-started). As a reminder, here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter \@ref(c04-getting-started) for more information.) ```{r} #| label: model-anes-des @@ -50,13 +50,12 @@ For RECS, details are included in the RECS documentation and Chapters \@ref(c04- ```{r} #| label: model-recs-des #| eval: FALSE - recs_des <- recs_2020 %>% as_survey_rep( weights = NWEIGHT, repweights = NWEIGHT1:NWEIGHT60, type = "JK1", - scale = 59/60, + scale = 59 / 60, mse = TRUE ) ``` @@ -64,17 +63,17 @@ recs_des <- recs_2020 %>% ## Introduction {#model-intro} -Modeling data is a way for researchers to investigate the relationship between a single dependent variable and one or more independent variables. This builds upon the analyses conducted in Chapter \@ref(c06-statistical-testing), which looked at the relationships between just two variables. For example, in Example 3 in Section \@ref(stattest-ttest-examples), we investigated if there is a relationship between the electrical bill cost and whether or not the household used air-conditioning. However, there are potentially other elements that could go into what the cost of electrical bills are in a household (e.g., outside temperature, desired internal temperature, types and number of appliances, etc.). +Modeling data is a way for researchers to investigate the relationship between a single dependent variable and one or more independent variables. This builds upon the analyses conducted in Chapter \@ref(c06-statistical-testing), which looked at the relationships between just two variables. For example, in Example 3 in Section \@ref(stattest-ttest-examples), we investigated if there is a relationship between the electrical bill cost and whether or not the household used air-conditioning. However, there are potentially other elements that could go into what the cost of electrical bills are in a household (e.g., outside temperature, desired internal temperature, types and number of appliances, etc.) -T-tests only allow us to investigate the relationship of one independent variable at a time, but using models we can look into multiple variables and even explore interactions between these variables. There are several types of models, but in this chapter we will cover Analysis of Variance (ANOVA) and linear regression models following common normal (Gaussian) and logit models. Jonas Kristoffer Lindeløv has an interesting [discussion](https://lindeloev.github.io/tests-as-linear/) of many statistical tests and models being equivalent to a linear model. For example, a one-way ANOVA is a linear model with one categorical independent variable, and a two-sample t-test is an ANOVA where the independent variable has exactly two levels. +T-tests only allow us to investigate the relationship of one independent variable at a time, but using models, we can look into multiple variables and even explore interactions between these variables. There are several types of models, but in this chapter, we cover Analysis of Variance (ANOVA) and linear regression models following common normal (Gaussian) and logit models. Jonas Kristoffer Lindeløv has an interesting [discussion](https://lindeloev.github.io/tests-as-linear/) of many statistical tests and models being equivalent to a linear model. For example, a one-way ANOVA is a linear model with one categorical independent variable, and a two-sample t-test is an ANOVA where the independent variable has exactly two levels. -When modeling data, it is helpful to first create an equation that provides an overview as to what it is that we are modeling. The main structure of these models is as follows: +When modeling data, it is helpful to first create an equation that provides an overview of what we are modeling. The main structure of these models is as follows: $$y_i=\beta_0 +\sum_{i=1}^p \beta_i x_i + \epsilon_i$$ -where $y_i$ is the outcome, $\beta_0$ is an intercept, $x_1, \cdots, x_p$ are the predictors with $\beta_1, \cdots, \beta_p$ as the associated coefficients, and $\epsilon_i$ is the error. Not all models will have all components. For example, some models may not include an intercept ($\beta_0$), may have interactions between different independent variables ($x_i$), or may have different underlying structures for the dependent variable ($y_i$). However, all linear models have the independent variables related to the dependent variable in a linear form. +where $y_i$ is the outcome, $\beta_0$ is an intercept, $x_1, \cdots, x_p$ are the predictors with $\beta_1, \cdots, \beta_p$ as the associated coefficients, and $\epsilon_i$ is the error. Not all models have all components. For example, some models may not include an intercept ($\beta_0$), may have interactions between different independent variables ($x_i$), or may have different underlying structures for the dependent variable ($y_i$.) However, all linear models have the independent variables related to the dependent variable in a linear form. -To specify these models in R, the formulas are the same with both survey data and other data. The left side of the formula is the response/dependent variable, and the right side of the formula has the predictor/independent variable(s). There are many symbols used in R to specify the formula. +To specify these models in R, the formulas are the same with both survey data and other data. The left side of the formula is the response/dependent variable, and the right side has the predictor/independent variable(s). There are many symbols used in R to specify the formula. For example, a linear formula mathematically notated as @@ -92,7 +91,7 @@ Table: (\#tab:notation-common) Common symbols in formula notation | : | `x:z` | include the interaction between these variables | | \* | `x*z` | include these variables and the interactions between them | | `^n` | `(x+y+z)^3` | include these variables and all interactions up to n-way | -| I | `I(x-z)` | as-is: include a new variable which is calculated inside the parentheses (e.g., x-z, x*z, x/z are possible claculations that could be done) | +| I | `I(x-z)` | as-is: include a new variable that is calculated inside the parentheses (e.g., x-z, x*z, x/z are possible calculations that could be done) | There are often multiple ways to specify the same formula. For example, consider the following equation using the `mtcars` dataset that is built into R: @@ -114,7 +113,7 @@ Table: (\#tab:notation-diffs) Differences in formulas for `:` and `*` code synta | \* | `mpg ~ cyl*disp*hp` |$$ \begin{aligned} mpg_i= &\beta_0+\beta_1cyl_{i}+\beta_2disp_{i}+\beta_3hp_{i}+\\& \beta_4cyl_{i}disp_{i}+\beta_5cyl_{i}hp_{i}+\beta_6disp_{i}hp_{i}+\\&\beta_7cyl_{i}disp_{i}hp_{i}+\epsilon_i\end{aligned}$$ | -When using non-survey data such as experimental or observational data, researchers will use the `glm()` function for linear models. With survey data, however, we use `svyglm()` from the {survey} package to ensure that we account for the survey design and weights in modeling^[There is some debate about whether weights should be used in regression [@bollen2016weightsreg; @gelman2007weights]. However, for the purposes of providing complete information on how to analyze complex survey data, this chapter will include weights.]. This allows us to generalize a model to the target population and accounts for the fact that the observations in the survey data may not be independent. As discussed in Chapter \@ref(c06-statistical-testing), modeling survey data cannot be directly done in {srvyr}, but can be done in the {survey} package [@lumley2010complex]. In this chapter, we will provide syntax and examples for linear models, including ANOVA, normal linear regression, and logistic regression. For details on other types of regression, including ordinal regression, log-linear models, and survival analysis, refer to @lumley2010complex. @lumley2010complex also discusses custom models such as a negative binomial or Poisson model in Appendix E of his book. +When using non-survey data, such as experimental or observational data, researchers use the `glm()` function for linear models. With survey data, however, we use `svyglm()` from the {survey} package to ensure that we account for the survey design and weights in modeling^[There is some debate about whether weights should be used in regression [@bollen2016weightsreg; @gelman2007weights]. However, for the purposes of providing complete information on how to analyze complex survey data, this chapter includes weights.]. This allows us to generalize a model to the population of interest and accounts for the fact that the observations in the survey data may not be independent. As discussed in Chapter \@ref(c06-statistical-testing), modeling survey data cannot be directly done in {srvyr} but can be done in the {survey} package [@lumley2010complex]. In this chapter, we provide syntax and examples for linear models, including ANOVA, normal linear regression, and logistic regression. For details on other types of regression, including ordinal regression, log-linear models, and survival analysis, refer to @lumley2010complex. @lumley2010complex also discusses custom models such as a negative binomial or Poisson model in Appendix E of his book. ## Analysis of variance (ANOVA) @@ -123,7 +122,7 @@ In ANOVA, we are testing whether the mean of an outcome is the same across two o - $H_0: \mu_1 = \mu_2= \dots = \mu_k$ where $\mu_i$ is the mean outcome for group $i$ - $H_A: \text{At least one mean is different}$ -Using the framework, an ANOVA test is also a linear model, we can re-frame the problem as: +An ANOVA test is also a linear model, we can re-frame the problem using the framework as: $$ y_i=\sum_{i=1}^k \mu_i x_i + \epsilon_i$$ @@ -156,11 +155,11 @@ The arguments are: * `na.action`: handling of missing data * `df.resid`: degrees of freedom for Wald tests (optional) - defaults to using `degf(design)-(g-1)` where $g$ is the number of groups -The function `svyglm()` does not have the design as the first argument so the dot (`.`) notation is used to pass it with a pipe (see Chapter \@ref(c06-statistical-testing) for more details). The default for missing data is `na.omit`, this means that we are removing all records with any missing data in either predictors or outcomes from analyses. There are other options for handling missing data and we recommend looking at the help documentation for `na.omit` (run `help(na.omit)` or `?na.omit`) for more information on options to use for `na.action`. For a discussion of how to handle missing data see Chapter \@ref(c11-missing-data). +The function `svyglm()` does not have the design as the first argument so the dot (`.`) notation is used to pass it with a pipe (see Chapter \@ref(c06-statistical-testing) for more details.) The default for missing data is `na.omit`. This means that we are removing all records with any missing data in either predictors or outcomes from analyses. There are other options for handling missing data, and we recommend looking at the help documentation for `na.omit` (run `help(na.omit)` or `?na.omit`) for more information on options to use for `na.action`. For a discussion on how to handle missing data, see Chapter \@ref(c11-missing-data). ### Example -Looking at an example will help us discuss the output and how to interpret the results. In RECS, respondents are asked what temperature they set their thermostat to during the day and evening when using the air-conditioning during the summer. To analyze this data, we filter the respondents to only those using AC (`ACUsed`). Then if we want to see if there are differences by region, we can use `group_by()`. A descriptive analysis of the temperature at night (`SummerTempNight`) set by region and the sample sizes is displayed below. +Looking at an example helps us discuss the output and how to interpret the results. In RECS, respondents are asked what temperature they set their thermostat to during the day and evening when using the air-conditioning (A/C) during the summer. To analyze these data, we filter the respondents to only those using A/C (`ACUsed`.) Then, if we want to see if there are regional differences, we can use `group_by()`. A descriptive analysis of the temperature at night (`SummerTempNight`) set by region and the sample sizes is displayed below. ```{r} #| label: model-anova-prep @@ -174,7 +173,7 @@ recs_des %>% ) ``` -In the following code, we test whether this temperature varies by region by first using `svyglm()` to run the test and then using `broom::tidy()` to display the output. Note that the temperature setting is set to NA when the household does not use air-conditioning, and since the default handling of NAs is `na.action=na.omit`, records that do not use air-conditioning will not be included in this regression. +In the following code, we test whether this temperature varies by region by first using `svyglm()` to run the test and then using `broom::tidy()` to display the output. Note that the temperature setting is set to NA when the household does not use A/C, and since the default handling of NAs is `na.action=na.omit`, records that do not use A/C are not included in this regression. ```{r} #| label: model-anova-ex @@ -185,9 +184,9 @@ anova_out <- recs_des %>% tidy(anova_out) ``` -In the output above, we can see the estimated coefficients (`estimate`), estimated standard errors of the coefficients (`std.error`), the t-statistic (`statistic`), and the p-value for each coefficient. In this output, the intercept represents the reference value of the Northeast region. The other coefficients indicate the difference in temperature relative to the Northeast region. For example, in the Midwest, temperatures are set, on average, `r tidy(anova_out) %>% filter(term=="RegionMidwest") %>% pull(estimate) %>% signif(3)` (p-value`r tidy(anova_out) %>% filter(term=="RegionMidwest") %>% pull(p.value) %>% pretty_p_value()`) degrees higher than in the Northeast during summer nights and each region sets their thermostats at significantly higher temperatures than the Northeast. +In the output above, we can see the estimated coefficients (`estimate`), estimated standard errors of the coefficients (`std.error`), the t-statistic (`statistic`), and the p-value for each coefficient. In this output, the intercept represents the reference value of the Northeast region. The other coefficients indicate the difference in temperature relative to the Northeast region. For example, in the Midwest, temperatures are set, on average, `r tidy(anova_out) %>% filter(term=="RegionMidwest") %>% pull(estimate) %>% signif(3)` (p-value`r tidy(anova_out) %>% filter(term=="RegionMidwest") %>% pull(p.value) %>% pretty_p_value()`) degrees higher than in the Northeast during summer nights, and each region sets their thermostats at significantly higher temperatures than the Northeast. -If we wanted to change the reference value we would reorder the factor before modeling using the function `relevel()` from {stats} or using one of many factor ordering functions in {forcats} such as `fct_relevel()` or `fct_infreq()`. For example, if we wanted the reference level to be the Midwest region, we could use the following code. Note the usage of the `gt()` function on top of `tidy()` to print a nice looking output table [@R-gt; @R-broom] - we will go over more usage of the {gt} package in Chapter \@ref(c08-communicating-results). +If we wanted to change the reference value, we would reorder the factor before modeling using the function `relevel()` from {stats} or using one of many factor ordering functions in {forcats} such as `fct_relevel()` or `fct_infreq()`. For example, if we wanted the reference level to be the Midwest region, we could use the following code. Note the usage of the `gt()` function on top of `tidy()` to print a nice-looking output table [@R-gt; @R-broom] (see Chapter \@ref(c08-communicating-results) for more information on the {gt} package.) ```{r} #| label: model-anova-ex-relevel @@ -221,11 +220,11 @@ tidy(anova_out_relevel) %>% ``` -This output now has the coefficients indicating the difference in temperature relative to the Midwest region. For example, in the Northeast, temperatures are set, on average, `r tidy(anova_out_relevel) %>% filter(term=="RegionNortheast") %>% pull(estimate) %>% signif(3)` (p-value`r tidy(anova_out_relevel) %>% filter(term=="RegionNortheast") %>% pull(p.value) %>% pretty_p_value()`) degrees lower than in the Midwest during summer nights and each region sets their thermostats at significantly lower temperatures than the Midwest. This is the reverse from what we saw in the prior model as we are still comparing the same two regions, just from different reference points. +This output now has the coefficients indicating the difference in temperature relative to the Midwest region. For example, in the Northeast, temperatures are set, on average, `r tidy(anova_out_relevel) %>% filter(term=="RegionNortheast") %>% pull(estimate) %>% signif(3)` (p-value`r tidy(anova_out_relevel) %>% filter(term=="RegionNortheast") %>% pull(p.value) %>% pretty_p_value()`) degrees lower than in the Midwest during summer nights, and each region sets their thermostats at significantly lower temperatures than the Midwest. This is the reverse of what we saw in the prior model, as we are still comparing the same two regions, just from different reference points. ## Normal linear regression -Normal linear regression is a more generalized method than ANOVA where we fit a model of a continuous outcome with any number of categorical or continuous predictors whereas ANOVA only has categorical predictors and is similarly specified as: +Normal linear regression is a more generalized method than ANOVA, where we fit a model of a continuous outcome with any number of categorical or continuous predictors (whereas ANOVA only has categorical predictors) and is similarly specified as: ```{=tex} \begin{equation} @@ -233,17 +232,17 @@ y_i=\beta_0 +\sum_{i=1}^p \beta_i x_i + \epsilon_i \end{equation} ``` -where $y_i$ is the outcome, $\beta_0$ is an intercept, $x_1, \cdots, x_n$ are the predictors with $\beta_1, \cdots, \beta_p$ as the associated coefficients, and $\epsilon_i$ is the error. +where $y_i$ is the outcome, $\beta_0$ is an intercept, $x_1, \cdots, x_p$ are the predictors with $\beta_1, \cdots, \beta_p$ as the associated coefficients, and $\epsilon_i$ is the error. Assumptions in normal linear regression using survey data include: - The residuals ($\epsilon_i$) are normally distributed, but there is not an assumption of independence, and the correlation structure is captured in the survey design object - There is a linear relationship between the outcome variable and the independent variables - - The residuals are homoscedastic, that is, the error term is the same across all values of independent variables + - The residuals are homoscedastic; that is, the error term is the same across all values of independent variables ### Syntax -The syntax for this regression uses the same function as ANOVA, but can have more than one variable listed on the right-hand side of the formula: +The syntax for this regression uses the same function as ANOVA but can have more than one variable listed on the right-hand side of the formula: ``` r des_obj %>% @@ -266,13 +265,14 @@ As discussed in Section \@ref(model-intro), the formula on the right-hand side c ### Examples -#### Example 1: Linear regression with single variable {.unnumbered} -On RECS, we can obtain information on the square footage of homes and the electric bills. We assume that square footage is related to the amount of money spent on electricity and examine a model for this. Before any modeling, we first plot the data to determine whether it is reasonable to assume a linear relationship. In Figure \@ref(fig:model-plot-sf-elbill), each hexagon represents the weighted count of households in the bin, and we can see a general positive linear trend (as the square footage increases so does the amount of money spent on electricity). +#### Example 1: Linear regression with a single variable {.unnumbered} + +On RECS, we can obtain information on the square footage of homes and the electric bills. We assume that square footage is related to the amount of money spent on electricity and examine a model for this. Before any modeling, we first plot the data to determine whether it is reasonable to assume a linear relationship. In Figure \@ref(fig:model-plot-sf-elbill), each hexagon represents the weighted count of households in the bin, and we can see a general positive linear trend (as the square footage increases, so does the amount of money spent on electricity.) ```{r} #| label: model-plot-sf-elbill #| fig.cap: Relationship between square footage and dollars spent on electricity, RECS 2020 -#| fig.alt: Hex chart where each hexagon represents a number of housing units at a point. x-axis is 'Total square footage' ranging from 0 to 7,500 and y-axis is 'Amount spent on electricity' ranging from $0 to 8,000. The trend is relatively linear and positve. A high concentration of points have square footage between 0 and 2,500 square feet as well as between electricity expenditure between $0 and 2,000 +#| fig.alt: Hex chart where each hexagon represents a number of housing units at a point. x-axis is 'Total square footage' ranging from 0 to 7,500 and y-axis is 'Amount spent on electricity' ranging from $0 to 8,000. The trend is relatively linear and positive. A high concentration of points have square footage between 0 and 2,500 square feet as well as between electricity expenditure between $0 and 2,000 #| echo: TRUE #| warning: FALSE recs_2020 %>% @@ -294,7 +294,7 @@ recs_2020 %>% theme_minimal() ``` -Given that the plot shows a potential increasing relationship between square footage and electricity expenditure, fitting a model will allow us to determine if the relationship is statistically significant. The model is fit below with electricity expenditure as the outcome. +Given that the plot shows a potentially increasing relationship between square footage and electricity expenditure, fitting a model allows us to determine if the relationship is statistically significant. The model is fit below with electricity expenditure as the outcome. ```{r} #| label: model-slr-examp @@ -329,19 +329,21 @@ tidy(m_electric_sqft) %>% -In the output above, we can see the estimated coefficients (`estimate`), estimated standard errors of the coefficients (`std.error`), the t-statistic (`statistic`), and the p-value for each coefficient. In these results, we can say that, on average, for every additional square foot of house size, the electricity bill increases by `r (tidy(m_electric_sqft) %>% filter(term=="TOTSQFT_EN") %>% pull(estimate) %>% signif(3))*100` cents and that square footage is significantly associated with electricity expenditure (p-value`r tidy(m_electric_sqft) %>% filter(term=="TOTSQFT_EN") %>% pull(p.value) %>% pretty_p_value()`). +In the output above, we can see the estimated coefficients (`estimate`), estimated standard errors of the coefficients (`std.error`), the t-statistic (`statistic`), and the p-value for each coefficient. In these results, we can say that, on average, for every additional square foot of house size, the electricity bill increases by `r (tidy(m_electric_sqft) %>% filter(term=="TOTSQFT_EN") %>% pull(estimate) %>% signif(2))*100` cents, and that square footage is significantly associated with electricity expenditure (p-value`r tidy(m_electric_sqft) %>% filter(term=="TOTSQFT_EN") %>% pull(p.value) %>% pretty_p_value()`.) -This is a very simple model, and there are likely many more factors related to electricity expenditure, including the type of cooling, number of appliances, location, and more. However, starting with one variable models can help researchers understand what potential relationships there are between variables before fitting more complex models. Often researchers start with known relationships before building models to determine what impact additional variables have on the model. +This is a straightforward model, and there are likely many more factors related to electricity expenditure, including the type of cooling, number of appliances, location, and more. However, starting with one-variable models can help researchers understand what potential relationships there are between variables before fitting more complex models. Often, researchers start with known relationships before building models to determine what impact additional variables have on the model. #### Example 2: Linear regression with multiple variables and interactions {.unnumbered} -In the following example, a model is fit to predict electricity expenditure, including Census region (factor/categorical), urbanicity (factor/categorical), square footage (double/numeric), and whether air-conditioning is used (logical/categorical) with all two-way interactions also included. In this example, we are choosing to fit this model without an intercept (using `-1` in the formula). This will result in an intercept estimate for each region instead of a single intercept for all data. + +In the following example, a model is fit to predict electricity expenditure, including Census region (factor/categorical), urbanicity (factor/categorical), square footage (double/numeric), and whether air-conditioning (A/C) is used (logical/categorical) with all two-way interactions also included. In this example, we are choosing to fit this model without an intercept (using `-1` in the formula.) This results in an intercept estimate for each region instead of a single intercept for all data. ```{r} #| label: model-lmr-examp m_electric_multi <- recs_des %>% svyglm( design = ., - formula = DOLLAREL ~ (Region + Urbanicity + TOTSQFT_EN + ACUsed)^2 - 1, + formula = + DOLLAREL ~ (Region + Urbanicity + TOTSQFT_EN + ACUsed)^2 - 1, na.action = na.omit ) ``` @@ -379,9 +381,9 @@ urb_reg_test <- regTermTest(m_electric_multi, ~Urbanicity:Region) urb_reg_test ``` -This output indicates there is a significant interaction between urbanicity and region (p-value=`r pretty_p_value(urb_reg_test[["p"]])`). +This output indicates there is a significant interaction between urbanicity and region (p-value=`r pretty_p_value(urb_reg_test[["p"]])`.) -To examine the predictions, residuals, and more from the model, the function `augment()` from {broom} can be used. The `augment()` function will return a tibble with the independent and dependent variables and other fit statistics. The `augment()` function has not been specifically written for objects of class `svyglm`, and as such, a warning will be displayed indicating this at this time. As it was not written exactly for this class of objects, a little tweaking needs to be done after using `augment()`. To obtain the standard error of the predicted values (`.se.fit`) we need to use the `attr()` function on the predicted values (`.fitted`) created by `augment()`. Additionally, the predicted values created are outputted as a `svrep` type of data. If we want to plot the predicted values, we need to use `as.numeric()` to get the predicted values into a numeric format to work with. However, it is important to note that this adjustment must be completed **after** the standard error adjustment. +To examine the predictions, residuals, and more from the model, the function `augment()` from {broom} can be used. The `augment()` function returns a tibble with the independent and dependent variables and other fit statistics. The `augment()` function has not been specifically written for objects of class `svyglm`, and as such, a warning is displayed indicating this at this time. As it was not written exactly for this class of objects, a little tweaking needs to be done after using `augment()`. To obtain the standard error of the predicted values (`.se.fit`), we need to use the `attr()` function on the predicted values (`.fitted`) created by `augment()`. Additionally, the predicted values created are outputted with a type of `svrep`. If we want to plot the predicted values, we need to use `as.numeric()` to get the predicted values into a numeric format to work with. However, it is important to note that this adjustment must be completed **after** the standard error adjustment. ```{r} #| label: model-aug-examp-se @@ -399,10 +401,10 @@ These results can then be used in a variety of ways, including examining residua ```{r} #| label: model-aug-examp-plot #| fig.cap: Residual plot of electric cost model with covariates Region, Urbanicity, TOTSQFT\_EN, and ACUsed -#| fig.alt: Residual scatter plot with a x-axis of 'Fitted value of electricity cost' ranging between approximately $0 and $4,000 and a y-axis with the 'Residual of model' ranging from approximatley -$3,000 to $5,000. The points create a slight megaphone shape with largest residuals in the middle of the x-range. A red line is drawn horizontally at y=0. +#| fig.alt: Residual scatter plot with a x-axis of 'Fitted value of electricity cost' ranging between approximately $0 and $4,000 and a y-axis with the 'Residual of model' ranging from approximately -$3,000 to $5,000. The points create a slight megaphone shape with largest residuals in the middle of the x-range. A red line is drawn horizontally at y=0. fitstats %>% ggplot(aes(x = .fitted, .resid)) + - geom_point() + + geom_point(alpha=.1) + geom_hline(yintercept = 0, color = "red") + theme_minimal() + xlab("Fitted value of electricity cost") + @@ -412,7 +414,7 @@ fitstats %>% ``` -Additionally, `augment()` can be used to predict outcomes for data not used in modeling. Perhaps, we would like to predict the energy expenditure for a home in an urban area in the south that uses air-conditioning and is 2,500 square feet. To do this, we first make a tibble including that additional data and then use the `newdata` argument in the `augment()` function. As before, to obtain the standard error of the predicted values we need to use the `attr()` function. +Additionally, `augment()` can be used to predict outcomes for data not used in modeling. Perhaps we would like to predict the energy expenditure for a home in an urban area in the south that uses air-conditioning and is 2,500 square feet. To do this, we first make a tibble including that additional data and then use the `newdata` argument in the `augment()` function. As before, to obtain the standard error of the predicted values, we need to use the `attr()` function. ```{r} #| label: model-predict-new-dat @@ -444,9 +446,9 @@ In the above example, it is predicted that the energy expenditure would be \$`r ## Logistic regression -Logistic regression is used to model binary outcomes such as whether or not someone voted. There are several instances where an outcome may not be originally binary but is collapsed into being binary. For example, given that gender is often asked in surveys with multiple response options and not a binary scale, many researchers now code gender in logistic modeling as cis-male compared to not cis-male. We could also convert a 4-point likert scale that has levels of "Strongly Agree", "Agree", "Disagree", and "Strongly Disagree" to group the agreement levels into one group and disagreement levels into a second group. +Logistic regression is used to model binary outcomes, such as whether or not someone voted. There are several instances where an outcome may not be originally binary but is collapsed into being binary. For example, given that gender is often asked in surveys with multiple response options and not a binary scale, many researchers now code gender in logistic modeling as cis-male compared to not cis-male. We could also convert a 4-point Likert scale that has levels of "Strongly Agree", "Agree", "Disagree", and "Strongly Disagree" to group the agreement levels into one group and disagreement levels into a second group. -Logistic regression is a specific case of the generalized linear model (GLM). A GLM uses a link function to link the response variable to the linear model. If we tried to use a normal linear regression with a binary outcome, many assumptions are not held - namely the response is not continuous. Logistic regression allows us to link a linear model between the covariates and a propensity of an outcome. In logistic regression, the link model is the logit function. Specifically, the model is specified as follows: +Logistic regression is a specific case of the generalized linear model (GLM.) A GLM uses a link function to link the response variable to the linear model. If we tried to use a normal linear regression with a binary outcome, many assumptions would not hold, namely, the response would not be continuous. Logistic regression allows us to link a linear model between the covariates and the propensity of an outcome. In logistic regression, the link model is the logit function. Specifically, the model is specified as follows: $$ y_i \sim \text{Bernoulli}(\pi_i)$$ @@ -464,8 +466,8 @@ The Bernoulli distribution is a distribution which has an outcome of 0 or 1 give Assumptions in logistic regression using survey data include: - The outcome variable has two levels - - There is a linear relationship between the independent variables and the log odds ($\log \left(\frac{\pi_i}{1-\pi_i} \right)$) - - The residuals are homoscedastic, that is, the error term is the same across all values of independent variables + - There is a linear relationship between the independent variables and the log odds (the equation for the logit function) + - The residuals are homoscedastic; that is, the error term is the same across all values of independent variables @@ -492,16 +494,17 @@ The arguments are: * `df.resid`: degrees of freedom for Wald tests (optional) - defaults to using `degf(design)-p` where $p$ is the rank of the design matrix * `family`: the error distribution/link function to be used in the model -Note `svyglm()` is the same function used in both ANOVA and normal linear regression. However, we've added the link function quasibinomial. While we can use the binomial link function, it is recommended to use the quasibinomial as our weights may not be integers, and the quasibinomial also allows for overdispersion [@lumley2010complex; @mccullagh1989binary; @R-base]. The quasibinomial family has a default logit link which is what is specified in the equations above. When specifying the outcome variable, it will likely be specified in one of three ways with survey data: +Note `svyglm()` is the same function used in both ANOVA and normal linear regression. However, we've added the link function quasibinomial. While we can use the binomial link function, it is recommended to use the quasibinomial as our weights may not be integers, and the quasibinomial also allows for overdispersion [@lumley2010complex; @mccullagh1989binary; @R-base]. The quasibinomial family has a default logit link, which is specified in the equations above. When specifying the outcome variable, it is likely specified in one of three ways with survey data: - - A two level factor variable where the first level of the factor indicates a "failure" and the second level indicates a "success" + - A two-level factor variable where the first level of the factor indicates a "failure" and the second level indicates a "success" - A numeric variable which is 1 or 0 where 1 indicates a success - A logical variable where TRUE indicates a success ### Examples -#### Example 1: Logistic regression with single variable {.unnumbered} -In the following example, the ANES data is used, and we are modeling whether someone usually has trust in the government^[Question: How often can you trust the federal government in Washington to do what is right?] by who someone voted for president in 2020. As a reminder, the leading candidates were Biden and Trump though people could vote for someone else not in the Democratic or Republican parties. Those votes are all grouped into an "Other" category. We first create a binary outcome for trusting in the government by collapsing "Always" and "Most of the time" into a single factor level, and the other response options ("About half the time", "Some of the time", and "Never") into a second factor level. Next, a scatter plot of the raw data is not useful as it is all 0 and 1 outcomes, so instead, we plot a summary of the data. +#### Example 1: Logistic regression with a single variable {.unnumbered} + +In the following example, the ANES data are used, and we are modeling whether someone usually has trust in the government^[Question: How often can you trust the federal government in Washington to do what is right?] by who someone voted for president in 2020. As a reminder, the leading candidates were Biden and Trump, though people could vote for someone else not in the Democratic or Republican parties. Those votes are all grouped into an "Other" category. We first create a binary outcome for trusting in the government by collapsing "Always" and "Most of the time" into a single-factor level, and the other response options ("About half the time", "Some of the time", and "Never") into a second factor level. Next, a scatter plot of the raw data is not useful as it is all 0 and 1 outcomes, so instead, we plot a summary of the data. ```{r} #| label: model-logisticexamp-plot @@ -535,7 +538,7 @@ anes_des_der %>% theme_minimal() ``` -By looking at Figure \@ref(fig:model-logisticexamp-plot) it appears that people who voted for Trump are more likely to say that they usually have trust in the government compared to those who voted for Biden and Other candidates. To determine if this insight is accurate, we next we fit the model. +Looking at Figure \@ref(fig:model-logisticexamp-plot), it appears that people who voted for Trump are more likely to say that they usually have trust in the government compared to those who voted for Biden and Other candidates. To determine if this insight is accurate, we next fit the model. ```{r} #| label: model-logisticexamp-model @@ -568,9 +571,9 @@ tidy(logistic_trust_vote) %>% print_gt_book(knitr::opts_current$get()[["label"]]) ``` -In the output above, we can see the estimated coefficients (`estimate`), estimated standard errors of the coefficients (`std.error`), the t-statistic (`statistic`), and the p-value for each coefficient. This output indicates that respondents who voted for Trump are `r signif(logistic_trust_vote$coefficients[2],3)` times more likely to usually have trust in the government compared to those who voted for Biden (the reference level). +In the output above, we can see the estimated coefficients (`estimate`), estimated standard errors of the coefficients (`std.error`), the t-statistic (`statistic`), and the p-value for each coefficient. This output indicates that respondents who voted for Trump are more likely to usually have trust in the government compared to those who voted for Biden (the reference level.) The coefficient of `r signif(logistic_trust_vote$coefficients[2],3)` represents the increase in the log odds of usually trusting the government. -Sometimes it is easier to talk about the odds instead of the likelihood. To do this, we need to exponentiate the coefficients. We can use the same `tidy()` function, but include the argument `exponentiate = TRUE` to see the odds. +In most cases, it is easier to talk about the odds instead of the log odds. To do this, we need to exponentiate the coefficients. We can use the same `tidy()` function but include the argument `exponentiate = TRUE` to see the odds. ```{r} #| label: model-logisticexamp-model-odds-noeval @@ -607,9 +610,9 @@ or_other <- pull(estimate) ``` -We can interpret this as saying that the odds of usually trusting the government for someone who voted for Trump is `r signif(or_trump*100, 3)`% as likely to trust the government compared to a person who voted for Biden (the reference level). In comparison, a person who voted for neither Biden nor Trump is `r signif(or_other*100, 3)`% as likely to trust the government as someone who voted for Biden. +We can interpret this as saying that the odds of usually trusting the government for someone who voted for Trump is `r signif(or_trump*100, 3)`% as likely to trust the government compared to a person who voted for Biden (the reference level.) In comparison, a person who voted for neither Biden nor Trump is `r signif(or_other*100, 3)`% as likely to trust the government as someone who voted for Biden. -As with linear regression, the `augment()` can be used to predict values. By default, the prediction is the link function (logit function in this instance) and not the probability. To predict the probability, add an argument of `type.predict="response"` as demonstrated below: +As with linear regression, the `augment()` can be used to predict values. By default, the prediction is the link function, not the probability model. To predict the probability, add an argument of `type.predict="response"` as demonstrated below: ```{r} #| label: model-logistic-aug @@ -625,25 +628,27 @@ logistic_trust_vote %>% ``` -#### Example 2: Interaction effects {.unnumbered} -Let's look at another example with interaction effects. If we're interested in understanding the demographics of people who voted for Biden among all voters in 2020, we could include `EarlyVote2020` and `Gender` in our model. +#### Example 2: Interaction Effects {.unnumbered} +Let's look at another example with interaction effects. If we're interested in understanding the demographics of people who voted for Biden among all voters in 2020, we could include the indicator of if respondents voted early (`EarlyVote2020`) and their income group (`Income7`) in our model. + +First, we need to subset the data to 2020 voters and then create an indicator for voted for Biden. -First we need to subset the data to 2020 voters and then create an indicator for voted for Biden. ```{r} #| label: model-logisticexamp-biden-ind -anes_des_ind <- anes_des %>% +anes_des_ind <- anes_des %>% filter(!is.na(VotedPres2020_selection)) %>% - mutate(VoteBiden = case_when(VotedPres2020_selection == "Biden"~1, + mutate(VoteBiden = case_when(VotedPres2020_selection == "Biden" ~ 1, TRUE ~ 0)) ``` -Let's first look at the main effects of gender and early voting behavior. +Let's first look at the main effects of income grouping and early voting behavior. + ```{r} #| label: model-logisticexamp-biden-main log_biden_main <- anes_des_ind %>% mutate(EarlyVote2020 = fct_relevel(EarlyVote2020, "No", after = 0)) %>% svyglm(design = ., - formula = VoteBiden ~ EarlyVote2020 + Gender, + formula = VoteBiden ~ EarlyVote2020 + Income7, family = quasibinomial) ``` @@ -656,7 +661,7 @@ tidy(log_biden_main) %>% fmt_number() ``` -(ref:model-logisticexamp-biden-main-tab) Logistic regression output for predicting voting for Biden given early voting behavior and gender - main effects only, RECS 2020 +(ref:model-logisticexamp-biden-main-tab) Logistic regression output for predicting voting for Biden given early voting behavior and income - main effects only, ANES 2020 ```{r} #| label: model-logisticexamp-biden-main-tab @@ -670,16 +675,15 @@ tidy(log_biden_main) %>% print_gt_book(knitr::opts_current$get()[["label"]]) ``` +This main effect model (see Table \@ref(tab:model-logisticexamp-biden-main-tab)) indicates that people with incomes of \$125,000 or more have a significant negative coefficient `r signif(log_biden_main$coefficients[8],3)` (p-value=`r tidy(log_biden_main) %>% slice(8) %>% pull(p.value) %>% pretty_p_value()`). This indicates that people with incomes of \$125,000 or more were less likely to vote for Biden in the 2020 election compared to people with incomes of \$20,000 or less (reference level). -This main effect model indicates that respondents with who early voted in 2020 are `r signif(log_biden_main$coefficients[2],3)` (p-value=`r tidy(log_biden_main) %>% slice(2) %>% pull(p.value) %>% pretty_p_value()`) times more likely to vote for Biden compared to respondents who did not early vote in the 2020 election (the reference level). We see that gender is also significant with females more likely to vote for Biden compared to males (p-value=`r tidy(log_biden_main) %>% slice(3) %>% pull(p.value) %>% pretty_p_value()`). - -It is possible that there is an interaction between gender and early voting behavior. To determine this we can create a model that includes the interaction effects: +Although early voting behavior was not significant, there may be an interaction between income and early voting behavior. To determine this, we can create a model that includes the interaction effects: ```{r} #| label: model-logisticexamp-biden-int log_biden_int <- anes_des_ind %>% mutate(EarlyVote2020 = fct_relevel(EarlyVote2020, "No", after = 0)) %>% svyglm(design = ., - formula = VoteBiden ~ (EarlyVote2020 + Gender)^2, + formula = VoteBiden ~ (EarlyVote2020 + Income7)^2, family = quasibinomial) ``` @@ -692,7 +696,7 @@ tidy(log_biden_int) %>% fmt_number() ``` -(ref:model-logisticexamp-biden-int-tab) Logistic regression output for predicting voting for Biden given early voting behavior and gender - with interaction, RECS 2020 +(ref:model-logisticexamp-biden-int-tab) Logistic regression output for predicting voting for Biden given early voting behavior and income - with interaction, ANES 2020 ```{r} #| label: model-logisticexamp-biden-int-tab @@ -706,51 +710,56 @@ tidy(log_biden_int) %>% print_gt_book(knitr::opts_current$get()[["label"]]) ``` -The results from the interaction model show that the interaction between early voting behavior and gender is significant. To better understand what this interaction means, we will want to plot the predicted probabilities with an interaction plot. Let's first obtain the predicted probabilities for each possible combination of variables using the `augment()` function. +The results from the interaction model (see Table \@ref(tab:model-logisticexamp-biden-int-tab)) show that one interaction between early voting behavior and income is significant. To better understand what this interaction means, we can plot the predicted probabilities with an interaction plot. Let's first obtain the predicted probabilities for each possible combination of variables using the `augment()` function. ```{r} #| label: model-logisticexamp-biden-aug #| warning: false log_biden_pred <- log_biden_int %>% augment(type.predict = "response") %>% - mutate(.se.fit = sqrt(attr(.fitted, "var")), + mutate(.se.fit = sqrt(attr(.fitted, "var")), .fitted = as.numeric(.fitted)) %>% - select(VoteBiden, EarlyVote2020, Gender, .fitted, .se.fit) + select(VoteBiden, EarlyVote2020, Income7, .fitted, .se.fit) ``` -To create an interaction plot, the y-axis will be the predicted probabilities, and one of our x-variables will be on the x-axis and the other will be represented by multiple lines. Figure \@ref(fig:model-logisticexamp-biden-plot) shows the interaction plot with gender on the x-axis and early voting behavior represented by the lines. +The y-axis is the predicted probabilities, one of our x-variables is on the x-axis, and the other is represented by multiple lines. Figure \@ref(fig:model-logisticexamp-biden-plot) shows the interaction plot with early voting behavior on the x-axis and income represented by the lines. ```{r} #| label: model-logisticexamp-biden-plot -#| fig.cap: Interaction Plot of Gender and Early Voting Predicting the Probability of Voting for Biden -#| fig.alt: "Line plot with x-axis as Male and Female (left to right) and y-axis as 'Predicted Probability of Voting for Biden'. There are two lines for early voting indicators with lines being from top to bottom: Did Not Early Vote and Did Early Vote. The line representing did not early vote is roughly parallel with similar predicted probabilities between males and females. For those who did early vote, females have higher predicted probability of voting for Biden than males." - -log_biden_pred %>% - filter(VoteBiden==1) %>% - distinct() %>% - arrange(Gender, EarlyVote2020) %>% - mutate(EarlyVote2020 = fct_reorder2(EarlyVote2020, Gender, .fitted)) %>% - ggplot(aes(x = Gender, y = .fitted, group = EarlyVote2020, - color = EarlyVote2020, linetype = EarlyVote2020)) + +#| fig.cap: Interaction Plot of Early Voting and Income Predicting the Probability of Voting for Biden +#| fig.alt: "Line plot with x-axis as indicator for voted early, with did not early vote on the left and did early vote on the right, and y-axis as 'Predicted Probability of Voting for Biden'. There are seven lines for income groups with lines being from top to bottom: Under $20k, $80k to less than $100k, $40k to less than $60k, $100k to less than $125k, $20k to less than 40k, $125k or more, and $60k to less than $80k. The lines for $40k to less than $60k, $60k to less than $80k, and $125k or more are all relatively flat with the probabilities for did not early vote and did early vote being equivalent. The lines for $20k to less than $40k and $100k to less than $125k have a slight positive slope. The line for less than $20k has a slight negative slope and has overall the highest probability for both levels of early voting. The line for $80k to less than $100k has a large positive slope. This line shows the lowest probability for those who did not early vote, and the second highest probability for those who did early vote." + +log_biden_pred %>% + filter(VoteBiden == 1) %>% + distinct() %>% + arrange(EarlyVote2020, Income7) %>% + ggplot(aes( + x = EarlyVote2020, + y = .fitted, + group = Income7, + color = Income7, + linetype = Income7 + )) + geom_line(linewidth = 1.1) + - scale_color_manual(values = book_colors[c(2,4)]) + + scale_color_manual(values = colorRampPalette(book_colors)(7)) + ylab("Predicted Probability of Voting for Biden") + - labs(color="Voted Early", - linetype="Voted Early") + - coord_cartesian(ylim=c(0,1)) + + labs(x = "Voted Early", + color = "Income", + linetype = "Income") + + coord_cartesian(ylim = c(0, 1)) + guides(fill = "none") + theme_minimal() ``` -From this plot we can see that respondents who indicated a male gender had roughly the same probability of voting for Biden regardless of if they voted early or not. However, females who voted early were more likely to vote for Biden if they voted early than if they did not vote early. +From Figure \@ref(fig:model-logisticexamp-biden-plot), we can see that people who have incomes in most groups (e.g., \$40k to <60k) have roughly the same probability of voting for Biden regardless of whether they voted early or not. However, those with income in the \$100k to < 125k group were more likely to vote for Biden if they voted early than if they did not vote early. -Interactions in models can be difficult to understand from the coefficients alone. Using these interaction plots can help others understand the nuances of the results, and often can become even more helpful with more than two levels in a given factor (e.g., education or race/ethnicity). +Interactions in models can be difficult to understand from the coefficients alone. Using these interaction plots can help others understand the nuances of the results. ## Exercises 1. The type of housing unit may have an impact on energy expenses. Is there any relationship between housing unit type (`HousingUnitType`) and total energy expenditure (`TOTALDOL`)? First, find the average energy expenditure by housing unit type as a descriptive analysis and then do the test. The reference level in the comparison should be the housing unit type that is most common. -2. Does temperature play a role in electricity expenditure? Cooling degree days are a measure of how hot a place is. CDD65 for a given day indicates the number of degrees Fahrenheit warmer than 65°F (18.3°C) it is in a location. On a day that averages 65°F and below, CDD65=0. While a day that averages 85°F (29.4°C) would have CDD65=20 because it is 20 degrees Fahrenheit warmer [@eia-cdd]. For each day in the year, this is summed to give an indicator of how hot the place is throughout the year. Similarly, HDD65 indicates the days colder than 65°F. Can energy expenditure be predicted using these temperature indicators along with square footage? Is there a significant relationship? Include main effects and two-way interactions. +2. Does temperature play a role in electricity expenditure? Cooling degree days are a measure of how hot a place is. CDD65 for a given day indicates the number of degrees Fahrenheit warmer than 65°F (18.3°C) it is in a location. On a day that averages 65°F and below, CDD65=0, while a day that averages 85°F (29.4°C) would have CDD65=20 because it is 20 degrees Fahrenheit warmer [@eia-cdd]. Each day in the year is summed up to indicate how hot the place is throughout the year. Similarly, HDD65 indicates the days colder than 65°F. Can energy expenditure be predicted using these temperature indicators along with square footage? Is there a significant relationship? Include main effects and two-way interactions. 3. Continuing with our results from question 2, create a plot between the actual and predicted expenditures and a residual plot for the predicted expenditures. diff --git a/08-communicating-results.Rmd b/08-communicating-results.Rmd index c4885d47..6b2f8196 100644 --- a/08-communicating-results.Rmd +++ b/08-communicating-results.Rmd @@ -27,7 +27,7 @@ library(gt) library(gtsummary) ``` -We will be using data from ANES as described in Chapter \@ref(c04-getting-started). As a reminder, here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter \@ref(c04-getting-started) for more information). +We are using data from ANES as described in Chapter \@ref(c04-getting-started). As a reminder, here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter \@ref(c04-getting-started) for more information.) ```{r} #| label: results-anes-des @@ -53,58 +53,58 @@ After finishing the analysis and modeling, we proceed to the important task of c Before beginning any dissemination of results, consider questions such as: - - How will we present results? Examples include a website, print, or other media. Based on the media type, we might limit or enhance the use of graphical representation. - - What is the audience's familiarity with the study and/or data? Audiences can range from the general public to data experts. If we anticipate limited knowledge about the study, we should provide detailed descriptions (we discuss recommendations later in the chapter). - - What are we trying to communicate? It could be summary statistics, trends, patterns, or other insights. Tables might suit summary statistics, while plots are better at conveying trends and patterns. + - How are we presenting results? Examples include a website, print, or other media. Based on the media type, we might limit or enhance the use of graphical representation. + - What is the audience's familiarity with the study and/or data? Audiences can range from the general public to data experts. If we anticipate limited knowledge about the study, we should provide detailed descriptions (we discuss recommendations later in the chapter.) + - What are we trying to communicate? It could be summary statistics, trends, patterns, or other insights. Tables may suit summary statistics, while plots are better at conveying trends and patterns. - Is the audience accustomed to interpreting plots? If not, include explanatory text to guide them on how to interpret the plots effectively. - What is the audience's statistical knowledge? If the audience does not have a strong statistics background, provide text on standard errors, confidence intervals, and other estimate types to enhance understanding. ## Describing results through text -As analysts, our emphasis is often on the data, and communicating results can sometimes be overlooked. First, we need to identify the appropriate information to share with our audience. Chapters \@ref(c02-overview-surveys) and \@ref(c03-survey-data-documentation) provide insights into factors we need to consider during analysis, and they remain relevant when presenting results to others. +As analysts, we often emphasize the data, and communicating results can sometimes be overlooked. To be effective communicators, we need to identify the appropriate information to share with our audience. Chapters \@ref(c02-overview-surveys) and \@ref(c03-survey-data-documentation) provide insights into factors we need to consider during analysis, and they remain relevant when presenting results to others. ### Methodology -If we are using existing data, methodologically-sound surveys will provide documentation about how the survey was fielded, the questionnaires, and other necessary information for analyses. For example, the survey's methodology reports should include the population of interest, sampling procedures, response rates, questionnaire documentation, weighting, and a general overview of disclosure statements. Many American organizations follow the American Association for Public Opinion Research's (AAPOR) [Transparency Initiative](https://aapor.org/standards-and-ethics/transparency-initiative). The AAPOR Transparency Initiative requires organizations to include specific details in their methodology, making it clear how we can and should analyze the results. Being transparent about these methods is vital for the scientific rigor of the field. +If we are using existing data, methodologically-sound surveys provide documentation about how the survey was fielded, the questionnaires, and other necessary information for analyses. For example, the survey's methodology reports should include the population of interest, sampling procedures, response rates, questionnaire documentation, weighting, and a general overview of disclosure statements. Many American organizations follow the American Association for Public Opinion Research's (AAPOR) [Transparency Initiative.](https://aapor.org/standards-and-ethics/transparency-initiative) The AAPOR Transparency Initiative requires organizations to include specific details in their methodology, making it clear how we can and should analyze and interpret the results. Being transparent about these methods is vital for the scientific rigor of the field. The details provided in Chapter \@ref(c02-overview-surveys) about the survey process should be shared with the audience when presenting the results. When using publicly-available data, like the examples in this book, we can often link to the methodology report in our final output. We should also provide high-level information for the audience to quickly grasp the context around the findings. For example, we can mention when and where the study was conducted, the population's age range, or other contextual details. This information helps the audience understand how generalizable the results are. -Providing this material is especially important when there's no methodology report available for the analyzed data. For example, if a researcher conducted a new survey for a specific purpose, we should document and present all the pertinent information during the analysis and reporting process. Adhering to the AAPOR Transparency Initiative guidelines is a reliable method to guarantee that all essential information is communicated to the audience. +Providing this material is especially important when no methodology report is available for the analyzed data. For example, if we conducted a new survey for a specific purpose, we should document and present all the pertinent information during the analysis and reporting process. Adhering to the AAPOR Transparency Initiative guidelines is a reliable method to guarantee that all essential information is communicated to the audience. ### Analysis -Along with the survey methodology and weight calculations, we should also share our approach to preparing, cleaning, and analyzing the data. For example, in Chapter \@ref(c06-statistical-testing), we compared education distributions from the ANES survey to the American Community Survey (ACS). To make the comparison, we had to collapse education categories provided in the ANES data to match the ACS. The process for this particular example may seem straightforward (like combining Bachelor's and Graduate Degrees into a single category), but there are multiple ways to deal with the data. Our choice is just one of many. We should document both the original ANES question and response options and the steps we took to match it with ACS data. This transparency helps clarify our analysis to our audience. +Along with the survey methodology and weight calculations, we should also share our approach to preparing, cleaning, and analyzing the data. For example, in Chapter \@ref(c06-statistical-testing), we compared education distributions from the ANES survey to the American Community Survey (ACS.) To make the comparison, we had to collapse the education categories provided in the ANES data to match the ACS. The process for this particular example may seem straightforward (like combining Bachelor's and Graduate Degrees into a single category), but there are multiple ways to deal with the data. Our choice is just one of many. We should document both the original ANES question and response options and the steps we took to match them with ACS data. This transparency helps clarify our analysis to our audience. -Missing data is another instance where we want to be unambigious and upfront with our audience. In this book, numerous examples and exercises remove missing data, as this is often the easiest way to handle them. However, there are circumstances where missing data holds substantive importance, and excluding them could introduce bias (see Chapter \@ref(c11-missing-data)). Being transparent about our handling of missing data is important to maintaining the integrity of our analysis and ensuring a comprehensive understanding of the results. +Missing data is another instance where we want to be unambiguous and upfront with our audience. In this book, numerous examples and exercises remove missing data, as this is often the easiest way to handle them. However, there are circumstances where missing data holds substantive importance, and excluding them could introduce bias (see Chapter \@ref(c11-missing-data).) Being transparent about our handling of missing data is important to maintaining the integrity of our analysis and ensuring a comprehensive understanding of the results. ### Results While tables and graphs are commonly used to communicate results, there are instances where text can be more effective in sharing information. Narrative details, such as context around point estimates or model coefficients, can go a long way in improving our communication. We have several strategies to effectively convey the significance of the data to the audience through text. -First, we can highlight important data points in a sentence using plain language. For example, if we were looking at election polling data conducted before an election, we could say something like: +First, we can highlight important data elements in a sentence using plain language. For example, if we were looking at election polling data conducted before an election, we could say: > As of [DATE], an estimated XX% of registered U.S. voters say they will vote for [CANDIDATE NAME] for president in the [YEAR] general election. This sentence provides key pieces of information in a straightforward way: - 1. **[DATE]**: Given that polling data is time-specific, providing the date of reference lets the audience know when this data was valid. - 2. **Registered U.S. voters**: This tells the audience who we surveyed, letting them know the target population. + 1. **[DATE]**: Given that polling data are time-specific, providing the date of reference lets the audience know when these data were valid. + 2. **Registered U.S. voters**: This tells the audience who we surveyed, letting them know the population of interest. 3. **XX%**: This part provides the estimated percentage of people voting for a specific candidate for a specific office. 4. **[YEAR] general election**: As with the bullet above, adding this gives more context about the election type and year. The estimate would take on a different meaning if we changed it to a *primary* election instead of a *general* election. We also included the word "estimated." When presenting aggregate survey results, we have errors around each estimate. We want to convey this uncertainty rather than talk in absolutes. Words like "estimated," "on average," or "around" can help communicate this uncertainty to the audience. Instead of saying 'XX%,' we can also say 'XX% (+/- Y%)' to show the margin of error. Confidence intervals can also be incorporated into the text to assist readers. -Second, providing context and discussing the *meaning* behind a point estimate can help the audience glean some insight into why the data is important. For example, when comparing two values, it can be helpful to highlight if there are statistically significant differences and explain the impact and relevance of this information. This is where we, as analysts, should to do our best to be mindful of biases and present the facts logically. +Second, providing context and discussing the *meaning* behind a point estimate can help the audience glean some insight into why the data are important. For example, when comparing two values, it can be helpful to highlight if there are statistically significant differences and explain the impact and relevance of this information. This is where we should do our best to be mindful of biases and present the facts logically. -Keep in mind how we discuss these findings can greatly influence how the audience interprets them. If we include speculation, using phrases like "the authors speculate" or "these findings may indicate" relays the uncertainty around the notion while still lending a plausible solution. Additionally, we can present alternative viewpoints or competing discussion points to explain the uncertainty in the results. +Keep in mind how we discuss these findings can greatly influence how the audience interprets them. If we include speculation, phrases like "the authors speculate" or "these findings may indicate" relays the uncertainty around the notion while still lending a plausible solution. Additionally, we can present alternative viewpoints or competing discussion points to explain the uncertainty in the results. ## Visualizing data -Although discussing key findings in the text is important, presenting large amounts of data is often more digestible for the audience in tables or visualizations. Effectively combining text, tables, and graphs can be powerful in communicating results. This section provides examples of using the {gt}, {gtsummary}, and {ggplot2} packages to enhance the dissemination of results [@R-gt; @gtsummary; @ggplot22016]. +Although discussing key findings in the text is important, presenting large amounts of data in tables or visualizations is often more digestible for the audience. Effectively combining text, tables, and graphs can be powerful in communicating results. This section provides examples of using the {gt}, {gtsummary}, and {ggplot2} packages to enhance the dissemination of results [@R-gt; @gtsummarysjo; @ggplot2wickham]. ### Tables -Tables are a great way to provide a large amount of data when individual data points need to be examined. However, it is important to present tables in a reader-friendly format. Numbers should align, rows and columns should be easy to follow, and the table size should not compromise readability. Using key visualization techniques, we can create tables that are informative and nice to look at. Many packages create easy-to-read tables (e.g., {kable} \+ {kableExtra}, {gt}, {gtsummary}, {DT}, {formattable}, {flextable}, {reactable}). While we will focus on {gt} here, we encourage learning about others as they may have additional helpful features. We appreciate the flexibility, ability to use pipes (e.g., `%>%`), and numerous extensions of the {gt} package. Please note, at this time, {gtsummary} needs additional features to be widely used for survey analysis, particularly due to its lack of ability to work with replicate designs. We provide one example using {gtsummary} and hope it evolves into a more comprehensive tool over time. +Tables are a great way to provide a large amount of data when individual data points need to be examined. However, it is important to present tables in a reader-friendly format. Numbers should align, rows and columns should be easy to follow, and the table size should not compromise readability. Using key visualization techniques, we can create tables that are informative and nice to look at. Many packages create easy-to-read tables (e.g., {kable} \+ {kableExtra}, {gt}, {gtsummary}, {DT}, {formattable}, {flextable}, {reactable}.) We appreciate the flexibility, ability to use pipes (e.g., `%>%`), and numerous extensions of the {gt} package. While we focus on {gt} here, we encourage learning about others as they may have additional helpful features. Please note, at this time, {gtsummary} needs additional features to be widely used for survey analysis, particularly due to its lack of ability to work with replicate designs. We provide one example using {gtsummary} and hope it evolves into a more comprehensive tool over time. #### Transitioning {srvyr} output to a {gt} table {#results-gt} @@ -122,9 +122,9 @@ trust_gov The default output generated by R may work for initial viewing inside our IDE or when creating basic output in an R Markdown or Quarto document. However, when presenting these results in other publications, such as the print version of this book or with other formal dissemination modes, modifying the display can improve our reader's experience. -Looking at the output from `trust_gov`, a couple of improvements are obvious: (1) switching to percentages instead of proportions and (2) using the variable names as column headers. The {gt} package is a good tool for implementing better labeling and creating publishable tables. Let's walk through some code as we make a few changes to improve the table's usefulness. +Looking at the output from `trust_gov`, a couple of improvements stand out: (1) switching to percentages instead of proportions and (2) removing the variable names as column headers. The {gt} package is a good tool for implementing better labeling and creating publishable tables. Let's walk through some code as we make a few changes to improve the table's usefulness. -First, we initiate the table with the `gt()` function. Next, we use the argument `rowname_col()` to designate the `TrustGovernment` column as the labels for each row (called the table "stub"). We apply the `cols_label()` function to create informative column labels instead of variable names, and then the `tab_spanner()` function to add a label across multiple columns. In this case, we label all columns except the stub with "Trust in Government, 2020". We then format the proportions into percentages with the `fmt_percent()` function and reduce the number of decimals shown with `decimals = 1`. Finally, the `tab_caption()` function adds a table title for HTML version of the book. We can use the caption for cross-referencing in R Markdown, Quarto, and bookdown, as well as adding it to the list of tables in the book. +First, we initiate the formatted table with the `gt()` function on the `trust_gov` tibble previously created. Next, we use the argument `rowname_col()` to designate the `TrustGovernment` column as the label for each row (called the table "stub".) We apply the `cols_label()` function to create informative column labels instead of variable names and then the `tab_spanner()` function to add a label across multiple columns. In this case, we label all columns except the stub with "Trust in Government, 2020". We then format the proportions into percentages with the `fmt_percent()` function and reduce the number of decimals shown to one with `decimals = 1`. Finally, the `tab_caption()` function adds a table title for the HTML version of the book. We can use the caption for cross-referencing in R Markdown, Quarto, and bookdown, as well as adding it to the list of tables in the book. These changes are all seen in Table \@ref(tab:results-table-gt1-tab). ```{r} #| label: results-table-gt1 @@ -155,7 +155,7 @@ trust_gov_gt %>% print_gt_book(knitr::opts_current$get()[["label"]]) ``` -We can add a few more enhancements, such as a title, a data source note, and a footnote with the question information, using the functions `tab_header()`, `tab_source_note()`, and `tab_footnote()`. If having the percentage sign in both the header and the cells seems redundant, we can opt for `fmt_number()` instead of `fmt_percent()` and scale the number by 100 with `scale_by = 100`. +We can add a few more enhancements, such as a title (which is different from a caption^[The function `tab_caption()` is intended for usage in R Markdown, Quarto, or bookdown to add the ability of cross-referencing where as the function `tab_header()` is used to add a title or subtitle to a table in any context including Shiny or GitHub flavor Markdown without cross-referencing and is placed within the table object itself whereas a caption is placed based with the table based on the output type.]), a data source note, and a footnote with the question information, using the functions `tab_header()`, `tab_source_note()`, and `tab_footnote()`. If having the percentage sign in both the header and the cells seems redundant, we can opt for `fmt_number()` instead of `fmt_percent()` and scale the number by 100 with `scale_by = 100`. The resulting table is displayed in Table \@ref(tab:results-table-gt2-tab). ```{r} #| label: results-table-gt2 @@ -193,13 +193,13 @@ trust_gov_gt2 %>% The {gtsummary} package simultaneously summarizes data and creates publication-ready tables. Initially designed for clinical trial data, it has been extended to include survey analysis in certain capacities. At this time, it is only compatible with survey objects using Taylor's Series Linearization and not replicate methods. While it offers a restricted set of summary statistics, the following are available for categorical variables: - `{n}` frequency - - `{N}` denominator, or cohort size - - `{p}` percentage + - `{N}` denominator, or respondent population + - `{p}` proportion (stylized as a percentage by default) - `{p.std.error}` standard error of the sample proportion - `{deff}` design effect of the sample proportion - `{n_unweighted}` unweighted frequency - `{N_unweighted}` unweighted denominator - - `{p_unweighted}` unweighted formatted percentage + - `{p_unweighted}` unweighted formatted proportion (stylized as a percentage by default) The following summary statistics are available for continuous variables: @@ -211,10 +211,10 @@ The following summary statistics are available for continuous variables: - `{var}` variance - `{min}` minimum - `{max}` maximum - - `{p##}` any integer percentile, where `##` is an integer from 0 to 100 + - `{p#}` any integer percentile, where `#` is an integer from 0 to 100 - `{sum}` sum -In the following example, we will build a table using {gtsummary}, similar to the table in the {gt} example. The main function we use is `tbl_svysummary()`. In this function, we include the variables we want to analyze in the `include` argument and define the statistics we want to display in the `statistic` argument. To specify the statistics, we apply the syntax from the {glue} package, where we enclose the variables we want to insert within curly brackets. We must specify the desired statistics using the names listed above. For example, to specify that we want the proportion followed by the standard error of the proportion in parentheses, we use `{p} ({p.std.error})`. +In the following example, we build a table using {gtsummary}, similar to the table in the {gt} example. The main function we use is `tbl_svysummary()`. In this function, we include the variables we want to analyze in the `include` argument and define the statistics we want to display in the `statistic` argument. To specify the statistics, we apply the syntax from the {glue} package, where we enclose the variables we want to insert within curly brackets. We must specify the desired statistics using the names listed above. For example, to specify that we want the proportion followed by the standard error of the proportion in parentheses, we use `{p} ({p.std.error})`. Table \@ref(tab:results-gts-ex-1-tab) displays the resulting table. ```{r} #| label: results-gts-ex-1 @@ -229,7 +229,7 @@ anes_des_gtsum <- anes_des %>% anes_des_gtsum ``` -(ref:results-gts-ex-1-tab) Example of gtsummary table with trust in government estimates +(ref:results-gts-ex-1-tab) Example of {gtsummary} table with trust in government estimates ```{r} #| label: results-gts-ex-1-tab @@ -240,7 +240,7 @@ anes_des_gtsum %>% print_gt_book(knitr::opts_current$get()[["label"]]) ``` -The default table includes the weighted number of missing (or Unknown) records. The standard error is reported as a proportion, while the proportion is styled as a percentage. In the next step, we remove the Unknown category by setting the missing argument to "no" and format the standard error as a percentage using the `digits` argument. To improve the table for publication, we provide a more polished label for the "TrustGovernment" variable using the `label` argument. +The default table (shown in Table \@ref(tab:results-gts-ex-1-tab) includes the weighted number of missing (or Unknown) records. The standard error is reported as a proportion, while the proportion is styled as a percentage. In the next step, we remove the Unknown category by setting the missing argument to "no" and format the standard error as a percentage using the `digits` argument. To improve the table for publication, we provide a more polished label for the "TrustGovernment" variable using the `label` argument. Te resulting table is displayed in Table \@ref(tab:results-gts-ex-2-tab). ```{r} #| label: results-gts-ex-2 @@ -260,7 +260,7 @@ anes_des_gtsum2 <- anes_des %>% anes_des_gtsum2 ``` -(ref:results-gts-ex-2-tab) Example of gtsummary table with trust in government estimates with labeling and digits options +(ref:results-gts-ex-2-tab) Example of {gtsummary} table with trust in government estimates with labeling and digits options ```{r} #| label: results-gts-ex-2-tab @@ -271,7 +271,7 @@ anes_des_gtsum2 %>% print_gt_book(knitr::opts_current$get()[["label"]]) ``` -To exclude the term "Characteristic" and the estimated population size, we can modify the header using the`modify_header()` function to update the `label`. Further adjustments can be made based on personal preferences, organizational guidelines, or other style guides. If we prefer having the standard error in the header, similar to the {gt} table, instead of in the footnote (the {gtsummary} default), we can make these changes by specifying `stat_0` in the `modify_header()` function. Additionally, using `modify_footnote()` with `update = everything() ~ NA` removes the standard error from the footnote. After transforming the object into a gt table using `as_gt()`, we can add footnotes and a title using the same methods explained in Section \@ref(results-gt). +Table \@ref(tab:results-gts-ex-2-tab) is closer to our ideal output, but we still want to make a few changes. To exclude the term "Characteristic" and the estimated population size (N), we can modify the header using the `modify_header()` function to update the `label`. Further adjustments can be made based on personal preferences, organizational guidelines, or other style guides. If we prefer having the standard error in the header, similar to the {gt} table, instead of in the footnote (the {gtsummary} default), we can make these changes by specifying `stat_0` in the `modify_header()` function. Additionally, using `modify_footnote()` with `update = everything() ~ NA` removes the standard error from the footnote. After transforming the object into a gt table using `as_gt()`, we can add footnotes and a title using the same methods explained in Section \@ref(results-gt). This updated table is displayed in Table \@ref(tab:results-gts-ex-3-tab). ```{r} #| label: results-gts-ex-3 @@ -303,7 +303,7 @@ anes_des_gtsum3 <- anes_des %>% anes_des_gtsum3 ``` -(ref:results-gts-ex-3-tab) Example of gtsummary table with trust in government estimates with more labeling options and context +(ref:results-gts-ex-3-tab) Example of {gtsummary} table with trust in government estimates with more labeling options and context ```{r} #| label: results-gts-ex-3-tab @@ -314,7 +314,7 @@ anes_des_gtsum3 %>% print_gt_book(knitr::opts_current$get()[["label"]]) ``` -We can also include continuous variables in the table. Below, we add a summary of the age variable by updating the `include`, `statistic`, and `digits` arguments. +We can also include summaries of more than one variable in the table. These variables can be either categorical or continuous. In the following code and Table \@ref(tab:results-gts-ex-4-tab), we add the mean age by updating the `include`, `statistic`, and `digits` arguments. ```{r} #| label: results-gts-ex-4 @@ -335,7 +335,8 @@ anes_des_gtsum4 <- anes_des %>% modify_header(label = " ", stat_0 = "% (s.e.)") %>% as_gt() %>% - tab_header("American voter's trust in the federal government, 2020") %>% + tab_header( + "American voter's trust in the federal government, 2020") %>% tab_source_note("American National Election Studies, 2020") %>% tab_footnote( "Question text: How often can you trust the federal government @@ -351,7 +352,7 @@ anes_des_gtsum4 <- anes_des %>% anes_des_gtsum4 ``` -(ref:results-gts-ex-4-tab) Example of gtsummary table with trust in government estimates and average age +(ref:results-gts-ex-4-tab) Example of {gtsummary} table with trust in government estimates and average age ```{r} #| label: results-gts-ex-4-tab @@ -362,7 +363,7 @@ anes_des_gtsum4 %>% print_gt_book(knitr::opts_current$get()[["label"]]) ``` -With {gtsummary}, we can also calculate statistics by different groups. Let's modify the previous example to analyze data on whether a respondent voted for president in 2020. We update the `by` argument and refine the header. +With {gtsummary}, we can also calculate statistics by different groups. Let's modify the previous example (displayed in Table \@ref(tab:results-gts-ex-4-tab) to analyze data on whether a respondent voted for president in 2020. We update the `by` argument and refine the header. The resulting table is displayed in Table \@ref(tab:results-gts-ex-5-tab). ```{r} #| label: results-gts-ex-5 @@ -401,7 +402,7 @@ anes_des_gtsum5 <- anes_des %>% anes_des_gtsum5 ``` -(ref:results-gts-ex-5-tab) Example of gtsummary table with trust in government estimates by voting status +(ref:results-gts-ex-5-tab) Example of {gtsummary} table with trust in government estimates by voting status ```{r} #| label: results-gts-ex-5-tab @@ -416,9 +417,9 @@ anes_des_gtsum5 %>% Survey analysis can yield an abundance of printed summary statistics and models. Even with the most careful analysis, interpreting the results can be overwhelming. This is where charts and plots play a key role in our work. By transforming complex data into a visual representation, we can recognize patterns, relationships, and trends with greater ease. -R has numerous packages for creating compelling and insightful charts. In this section, we will focus on {ggplot2}, a member of the {tidyverse} collection of packages. Known for its power and flexibility, {ggplot2} is an invaluable tool for creating a wide range of data visualizations [@ggplot22016]. +R has numerous packages for creating compelling and insightful charts. In this section, we focus on {ggplot2}, a member of the {tidyverse} collection of packages. Known for its power and flexibility, {ggplot2} is an invaluable tool for creating a wide range of data visualizations [@ggplot2wickham]. -The {ggplot2} package follows the "grammar of graphics," a framework that incrementally adds layers of chart components. This approach allows us to customize visual elements such as scales, colors, labels, and annotations to enhance the clarity of our results. After creating the survey design object, we can modify it to include additional outcomes and calculate estimates for our desired data points. Below, we create a binary variable `TrustGovernmentUsually`, which is `TRUE` when `TrustGovernment` is "Always" or "Most of the time" and `FALSE` otherwise. Then, we calculate the percentage of people who usually trust the government based on their vote in the 2020 presidential election (`VotedPres2020_selection`). We remove the cases where people did not vote or did not indicate their choice. +The {ggplot2} package follows the "grammar of graphics," a framework that incrementally adds layers of chart components. This approach allows us to customize visual elements such as scales, colors, labels, and annotations to enhance the clarity of our results. After creating the survey design object, we can modify it to include additional outcomes and calculate estimates for our desired data points. Below, we create a binary variable `TrustGovernmentUsually`, which is `TRUE` when `TrustGovernment` is "Always" or "Most of the time" and `FALSE` otherwise. Then, we calculate the percentage of people who usually trust the government based on their vote in the 2020 presidential election (`VotedPres2020_selection`.) We remove the cases where people did not vote or did not indicate their choice. ```{r} #| label: results-anes-prep @@ -442,7 +443,7 @@ anes_des_der <- anes_des %>% anes_des_der ``` -Now, we can begin creating our chart with {ggplot2}. First, we set up our plot with `ggplot()`. Next, we define the data points to be displayed using aesthetics, or `aes`. Aesthetics represent the visual properties of the objects in the plot. In the example below, we map the `x` variable to `VotedPres2020_selection` from the dataset and the `y` variable to `pct_trust`. Finally, we specify the type of plot with `geom_*()`, in this case, `geom_bar()`. The resulting plot is displayed in Figure \@ref(fig:results-plot1). +Now, we can begin creating our chart with {ggplot2}. First, we set up our plot with `ggplot()`. Next, we define the data points to be displayed using aesthetics, or `aes`. Aesthetics represent the visual properties of the objects in the plot. In the following example, we create a bar chart of the percentage of people who usually trust the government by who they voted for in the 2020 election. To do this, we want to have who they voted for on the x-axis (`VotedPres2020_selection`) and the percent they usually trust the government on the y-axis (`pct_trust`.) We specify these variables in `ggplot()` and then indicate we want a bar chart with `geom_bar()`. The resulting plot is displayed in Figure \@ref(fig:results-plot1). ```{r} #| label: results-plot1 @@ -456,7 +457,7 @@ p <- anes_des_der %>% p ``` -This is a great starting point: we observe that a higher percentage of people stating they usually trust the government among those who voted for Trump compared to those who voted for Biden or other candidates. Now, what if we want to introduce color to better differentiate the three groups? We can add `fill` under `aesthetics`, indicating that we want to use distinct values of `VotedPres2020_selection` to color the bars. In this instance, Biden and Trump will be displayed in different colors. +This is a great starting point: it appears that a higher percentage of people state they usually trust the government among those who voted for Trump compared to those who voted for Biden or other candidates. Now, what if we want to introduce color to better differentiate the three groups? We can add `fill` under `aesthetics`, indicating that we want to use distinct colors for each value of `VotedPres2020_selection`. In this instance, Biden and Trump are displayed in different colors (shades in the print version of this book) in Figure \@ref(fig:results-plot2). ```{r} #| label: results-plot2 @@ -471,7 +472,7 @@ pcolor <- anes_des_der %>% pcolor ``` -Let's say we wanted to follow proper statistical analysis practice and incorporate variability in our plot. We can add another geom, `geom_errorbar()`, to display the confidence intervals on top of our existing `geom_bar()` layer. We can add the layer using a plus sign `+`. +Let's say we wanted to follow proper statistical analysis practice and incorporate variability in our plot. We can add another geom, `geom_errorbar()`, to display the confidence intervals on top of our existing `geom_bar()` layer. We can add the layer using a plus sign `+`. The resulting graph is displayed in Figure \@ref(fig:results-plot3). ```{r} #| label: results-plot3 @@ -489,7 +490,7 @@ pcol_error <- anes_des_der %>% pcol_error ``` -We can continue adding to our plot until we achieve our desired look. For example, we can eliminate the color legend as it doesn't contribute meaningful information with `guides(fill = "none")`. We can specify specific colors for `fill` using `scale_fill_manual()`. Inside the function, we provide a vector of values corresponding to the colors in our plot. These values are hexadecimal (hex) color codes, denoted by a leading pound sign `#` followed by six letters or numbers. The hex code `#0b3954` used below is a dark blue. There are many tools online that help pick hex codes, such as [htmlcolorcodes.com/](https://htmlcolorcodes.com/). +We can continue adding to our plot until we achieve our desired look. For example, we can eliminate the color legend as it doesn't contribute meaningful information with `guides(fill = "none")`. We can also specify specific colors for `fill` using `scale_fill_manual()`. Inside this function, we provide a vector of values corresponding to the colors in our plot. These values are hexadecimal (hex) color codes, denoted by a leading pound sign `#` followed by six letters or numbers. The hex code `#0b3954` used below is dark blue. There are many tools online that help pick hex codes, such as [htmlcolorcodes.com](https://htmlcolorcodes.com/). Additionally, Figure \@ref(fig:results-plot4) incorporates better labels for the x and y axes (`xlab()`, `ylab()`), a title (`labs(title=)`), and a footnote with the data source (`labs(caption=)`.) ```{r} #| label: results-plot4 @@ -516,4 +517,4 @@ pfull <- pfull ``` -What we've explored in this section are just the foundational aspects of {ggplot2}, and the capabilities of this package extend far beyond what we've covered. Advanced features such as annotation, faceting, and theming allow for more sophisticated and customized visualizations. The book @ggplot22016 is a comprehensive guide to learning more about this powerful tool. \ No newline at end of file +What we've explored in this section are just the foundational aspects of {ggplot2}, and the capabilities of this package extend far beyond what we've covered. Advanced features such as annotation, faceting, and theming allow for more sophisticated and customized visualizations. The ggplot2 book by @ggplot2wickham is a comprehensive guide to learning more about this powerful tool. \ No newline at end of file diff --git a/09-reproducible-data.Rmd b/09-reproducible-data.Rmd index 8cec0031..b802ad92 100644 --- a/09-reproducible-data.Rmd +++ b/09-reproducible-data.Rmd @@ -9,18 +9,18 @@ knitr::opts_chunk$set(tidy = 'styler') ## Introduction -Reproducing a data analysis's results is a crucial aspect of any research. First, reproducibility serves as a form of quality assurance. If we pass an analysis project to another person, they should be able to run the entire project from start to finish and obtain the same results. They can critically assess the methodology and code while detecting potential errors. Another goal of reproducibility is enabling the verification of our analysis. When someone else is able to check our results, it ensures the integrity of the analyses by determining that the conclusions are not dependent on a particular person running the code or workflow on a particular day or in a particular environment. +Reproducing results is a crucial aspect of any research. First, reproducibility serves as a form of quality assurance. If we pass an analysis project to another person, they should be able to run the entire project from start to finish and obtain the same results. They can critically assess the methodology and code while detecting potential errors. Another goal of reproducibility is enabling the verification of our analysis. When someone else is able to check our results, it ensures the integrity of the analyses by determining that the conclusions are not dependent on a particular person running the code or workflow on a particular day or in a particular environment. Not only is reproducibility a key component in ethical and accurate research, but it is also a requirement for many scientific journals. For example, the Journal of Survey Statistics and Methodology (JSSAM) and Public Opinion Quarterly (POQ) require authors to make code, data, and methodology transparent and accessible to other researchers who wish to verify or build on existing work. Reproducible research requires that the key components of analysis are available, discoverable, documented, and shared with others. The four main components that we should consider are: - **Code**: source code used for data cleaning, analysis, modeling, and reporting - - **Data**: raw data used in the workflow, or if data is sensitive or proprietary, as much data as possible that would allow others to run our workflow (e.g., access to a restricted use file (RUF)) + - **Data**: raw data used in the workflow, or if data are sensitive or proprietary, as much data as possible that would allow others to run our workflow or provide details on how to access the data (e.g., access to a restricted use file (RUF)) - **Environment**: environment of the project, including the R version, packages, operating system, and other dependencies used in the analysis - - **Methodology**: analysis methodology, including rationale behind decisions, interpretations, and assumptions + - **Methodology**: survey and analysis methodology, including rationale behind sample, questionnaire and analysis decisions, interpretations, and assumptions -In Chapter \@ref(c08-communicating-results), we briefly mention how each of these is important to include in the methodology report and when communicating the findings of a study. However, to be transparent and effective researchers, we need to ensure we not only discuss these through text but also provide files and additional information when requested. Often, when starting a project, analysts will dive into the data and make decisions as they go without full documentation, which can be challenging if we need to go back and make changes or understand even what we did a few months ago. It benefits other analysts and potentially our future selves to better document everything from the start. The good news is that many tools, practices, and project management techniques make survey analysis projects easy to reproduce. For best results, analysts should decide which techniques and tools will be used before starting a project (or very early on). +In Chapter \@ref(c08-communicating-results), we briefly mention how each of these is important to include in the methodology report and when communicating the findings of a study. However, to be transparent and effective researchers, we need to ensure we not only discuss these through text but also provide files and additional information when requested. Often, when starting a project, we may be eager to dive into the data and make decisions as we go without full documentation. This can be challenging if we need to go back and make changes or understand even what we did a few months ago. It benefits other analysts and potentially our future selves to document everything from the start. The good news is that many tools, practices, and project management techniques make survey analysis projects easy to reproduce. For best results, we should decide which techniques and tools to use before starting a project (or very early on.) This chapter covers some of our suggestions for tools and techniques we can use in projects. This list is not comprehensive but aims to provide a starting point for those looking to create a reproducible workflow. @@ -28,7 +28,7 @@ This chapter covers some of our suggestions for tools and techniques we can use We recommend a project-based workflow for analysis projects as described by @wickham2023r4ds. A project-based workflow maintains a "source of truth" for our analyses. It helps with file system discipline by putting everything related to a project in a designated folder. Since all associated files are in a single location, they are easy to find and organize. When we reopen the project, we can recreate the environment in which we originally ran the code to reproduce our results. -The RStudio IDE has built-in support for projects. When we create a project in RStudio, it creates a `.Rproj` file that store settings specific to that project. Once we have created a project, we can create folders that help us organize our workflow. For example, a project directory could look like this: +The RStudio IDE has built-in support for projects. When we create a project in RStudio, it creates an `.Rproj` file that stores settings specific to that project. Once we have created a project, we can create folders that help us organize our workflow. For example, a project directory could look like this: ``` | anes_analysis/ @@ -51,11 +51,11 @@ The RStudio IDE has built-in support for projects. When we create a project in R | anes_report.pdf ``` -In a project-based workflow, all paths are relative and, by default, relative to the project’s folder. By using relative paths, others can open and run our files even if their directory configuration differs from ours. The {here} package enables easy file referencing, and we can start with using the `here::here()` function to build the path for loading or saving data [@R-here]. Below, we ask R to read the CSV file `anes_2020.csv` in the project directory's `data` folder: +In a project-based workflow, all paths are relative and, by default, relative to the folder the `.Rproj` file is located in. By using relative paths, others can open and run our files even if their directory configuration differs from ours (e.g., Mac and Windows users have different directory path structures.) The {here} package enables easy file referencing, and we can start by using the `here::here()` function to build the path for loading or saving data [@R-here]. Below, we ask R to read the CSV file `anes_2020.csv` in the project directory's `data` folder: ```{r} -#| eval: false #| label: reprex-project-file-example +#| eval: false anes <- read_csv(here::here("data", "anes2020_clean.csv")) ``` @@ -64,7 +64,7 @@ The combination of projects and the {here} package keep all associated files in ## Functions and packages -We may find ourselves repeating ourselves in our script, and the chances of errors increases whenever we copy and paste our code. By creating a function, we can create a consistent set of commands that reduce the likelihood of mistakes. Functions also organize our code, improve the code readability, and allow others to execute the same commands. Throughout this book, we have created functions, such as in Chapter \@ref(c13-ncvs-vignette), to run sequences of rename, filter, group_by, and summarize statements across different variables. The function helps us avoid overlooking necessary steps. +We may find ourselves repeating ourselves in our script, and the chance of errors increases whenever we copy and paste our code. By creating a function, we can create a consistent set of commands that reduce the likelihood of mistakes. Functions also organize our code, improve the code readability, and allow others to execute the same commands. For example, in Chapter \@ref(c13-ncvs-vignette), we create a function to run sequences of `rename()`, `filter()`, `group_by()`, and summarize statements across different variables. Creating functions helps us avoid overlooking necessary steps. A package is made up of a collection of functions. If we find ourselves sharing functions with others to replicate the same series of commands in a separate project, creating a package can be a useful tool for sharing the code along with data and documentation. @@ -72,17 +72,17 @@ A package is made up of a collection of functions. If we find ourselves sharing Often, a survey analysis project produces a lot of code. Keeping track of the latest version can become challenging as files evolve throughout a project. If a team of analysts is working on the same script, someone may use an outdated version, resulting in incorrect results or redundant work. -Version control systems like Git can help alleviate these pains. Git is a system that helps track changes in computer files. Analysts can use Git to follow code evaluation and manage asynchronous work. With Git, it is easy to see any changes made in a script, revert changes, and resolve differences between code versions (called conflicts). +Version control systems like Git can help alleviate these pains. Git is a system that tracks changes in files. We can use Git to follow code evaluation and manage asynchronous work. With Git, it is easy to see any changes made in a script, revert changes, and resolve differences between code versions (called conflicts.) -Services such as GitHub or GitLab provide hosting and sharing of files as well as version control with Git. For example, we can visit the GitHub repository for this book ([https://github.com/tidy-survey-r/tidy-survey-book](https://github.com/tidy-survey-r/tidy-survey-book)) and see the files that build the book, when they were committed to the repository, and the history of modifications over time. +Services such as GitHub or GitLab provide hosting and sharing of files as well as version control with Git. For example, we can visit the [GitHub repository for this book](https://github.com/tidy-survey-r/tidy-survey-book) and see the files that build the book, when they were committed to the repository, and the history of modifications over time. In addition to code scripts, platforms like GitHub can store data and documentation. They provide a way to maintain a history of data modifications through versioning and timestamps. By saving the data and documentation alongside the code, it becomes easier for others to refer to and access everything they need in one place. -Using version control in analysis projects makes collaboration and maintenance more manageable. For connecting Git with R, we recommend @git-w-R. +Using version control in analysis projects makes collaboration and maintenance more manageable. To connect Git with R, we recommend referencing the book [Happy Git and GitHub for the useR](https://happygitwithr.com/) [@git-w-R]. ## Package management with {renv} -Ensuring reproducibility involves not only using version control of code, but also managing the versions of packages. If two people run the same code but use different versions of a package, the results might differ because of changes in those packages. For example, this book currently uses a version of the {srvyr} package from GitHub and not from CRAN. This is because the version of {srvyr} on CRAN has some bugs (errors) that result in incorrect calculations. The version on GitHub has corrected these errors, so we have asked readers to install the GitHub version to obtain the same results. +Ensuring reproducibility involves not only using version control of code but also managing the versions of packages. If two people run the same code but use different package versions, the results might differ because of changes to those packages. For example, this book currently uses a version of the {srvyr} package from GitHub and not from CRAN. This is because the version of {srvyr} on CRAN has some bugs (errors) that result in incorrect calculations. The version on GitHub has corrected these errors, so we have asked readers to install the GitHub version to obtain the same results. One way to handle different package versions is with the {renv} package. This package allows researchers to set the versions for each package used and manage package dependencies. Specifically, {renv} creates isolated, project-specific environments that record the packages and their versions used in the code. When initiated by a new user, {renv} checks whether the installed packages are consistent with the recorded version for the project. If not, it installs the appropriate versions so that others can replicate the project's environment to rerun the code and obtain consistent results [@R-renv]. @@ -92,9 +92,9 @@ Just as different versions of packages can introduce discrepancies or compatibil ## Workflow management with {targets} -With complex studies involving multiple code files and dependencies, it is important to ensures each step is executed in the intended sequence. We can do this manually, e.g., numbering files to indicate the order or providing detailed documentation on the order. Alternatively, we can automate the process so the code flows sequentially. Making sure that the code runs in the correct order helps ensure that the research is reproducible. Anyone should be able to pick up the set of scripts and get the same results by following the workflow. +With complex studies involving multiple code files and dependencies, it is important to ensure each step is executed in the intended sequence. We can do this manually, e.g., by numbering files to indicate the order or providing detailed documentation on the order. Alternatively, we can automate the process so the code flows sequentially. Making sure that the code runs in the correct order helps ensure that the research is reproducible. Anyone should be able to pick up the set of scripts and get the same results by following the workflow. -The {targets} package is growing as a popular workflow manager that documents, automates, and executes complex data workflows with multiple steps and dependencies. With this package, we first define the order of execution for our code, and then it will consistently execute the code in that order each time it is run. One beneficial feature of {targets} is that if you change code later in the workflow, only the affected code and its downstream targets (i.e., the subsequent code files) are re-executed when we change a script. The {targets} package also provides interactive progress monitoring and reporting, allowing us to track the status and progress of our analysis pipeline [@targets2021]. +The {targets} package is growing as a popular workflow manager that documents, automates, and executes complex data workflows with multiple steps and dependencies. With this package, we first define the order of execution for our code, and then it consistently executes the code in that order each time it is run. One beneficial feature of {targets} is that if code changes later in the workflow, only the affected code and its downstream targets (i.e., the subsequent code files) are re-executed when we change a script. The {targets} package also provides interactive progress monitoring and reporting, allowing us to track the status and progress of our analysis pipeline [@targetslandau]. ## Documentation with Quarto and R Markdown @@ -106,7 +106,7 @@ Quarto and R Markdown documents also allow users to re-execute the underlying co Another useful feature of Quarto and R Markdown is the ability to reduce repetitive code by parameterizing the files. Parameters can control various aspects of the analysis, such as dates, geography, or other analysis variables. We can define and modify these parameters to explore different scenarios or inputs. For example, suppose we start by creating a document that provides survey analysis results for North Carolina but then later decide we want to look at another state. In that case, we can define a `state` parameter and rerun the same analysis for a state like Washington without having to edit the code throughout the document. -Parameters can be defined in the header or code chunks of our Quarto or R Markdown documents and easily be modified and documented. We reduce errors that may occur by manually editing code throughout the script, and offer a flexible way for others to replicate the analysis and explore variations. +Parameters can be defined in the header or code chunks of our Quarto or R Markdown documents and easily modified and documented. By manually editing code throughout the script, we reduce errors that may occur and offer a flexible way for others to replicate the analysis and explore variations. ## Other tips for reproducibility @@ -114,30 +114,33 @@ Parameters can be defined in the header or code chunks of our Quarto or R Markdo Some tasks in survey analysis require randomness, such as imputation, model training, or creating random samples. By default, the random numbers generated by R change each time we rerun the code, making it difficult to reproduce the same results. By "setting the seed," we can control the randomness and ensure that the random numbers remain consistent whenever we rerun the code. Others can use the same seed value to reproduce our random numbers and achieve the same results. -In R, we can use the `set.seed()` function to control the randomness in our code. Set a seed value by providing an integer to the function: +In R, we can use the `set.seed()` function to control the randomness in our code. We set a seed value by providing an integer in the function argument. The following code chunk sets a seed using `999`, then runs a random number function (`runif()`) to get five random numbers from a uniform distribution. -```r +```{r} +#| label: reprex-set-seed set.seed(999) - runif(5) ``` -The `runif()` function generates five random numbers from a uniform distribution. Since the seed is set to `999`, running `runif()` multiple times will always produce the same sequence: +Since the seed is set to `999`, running `runif(5)` multiple times always produces the same output: -``` -[1] 0.38907138 0.58306072 0.09466569 0.85263123 0.78674676 +```{r} +#| label: reprex-runif +#| echo: false +set.seed(999) +runif(5) ``` -The choice of the seed number is up to the analyst. For example, this could be the date (`20240102`) or time of day (`1056`) when the analysis was first conducted, a phone number (`8675309`), or the first few numbers that come to mind (`369`). As long as the seed is set for a given analysis, the actual number is up to the analyst to decide. It is important to note that `set.seed()` should be used *before* random number generation. It would be unethical to run an analysis over and over to choose a seed that produces the result you want. Run it once per program, and the seed will be applied to the entire script. We recommend setting the seed at the beginning of a script, where libraries are loaded. +The choice of the seed number is up to the analyst. For example, this could be the date (`20240102`) or time of day (`1056`) when the analysis was first conducted, a phone number (`8675309`), or the first few numbers that come to mind (`369`.) As long as the seed is set for a given analysis, the actual number is up to the analyst to decide. It is important to note that `set.seed()` should be used **before** random number generation. Run it once per program, and the seed is applied to the entire script. We recommend setting the seed at the beginning of a script, where libraries are loaded. ### Descriptive names and labels -Using descriptive variable names or labeling data can also assist with reproducible research. For example, in the ANES data, the variable names in the raw data all start with `V20` and are a string of numbers. To make things easier to reproduce, we opted to change the variable names to be more descriptive of what they contained (e.g., `Age`). This can also be done with the data values themselves. One way to accomplish this is by creating factors for categorical data, which can ensure that we know that a value of `1` really means `Female`, for example. There are other ways of handling this, such as attaching labels to the data instead of recoding variables to be descriptive (see Chapter \@ref(c11-missing-data)). As with random number seeds, the exact method is up to the analyst, but providing this information can help ensure our research is reproducible. +Using descriptive variable names or labeling data can also assist with reproducible research. For example, in the ANES data, the variable names in the raw data all start with `V20` and are a string of numbers. To make things easier to reproduce in this book, we opted to change the variable names to be more descriptive of what they contained (e.g., `Age`.) This can also be done with the data values themselves. One way to accomplish this is by creating factors for categorical data, which can ensure that we know that a value of `1` really means `Female`, for example. There are other ways of handling this, such as attaching labels to the data instead of recoding variables to be descriptive (see Chapter \@ref(c11-missing-data).) As with random number seeds, the exact method is up to the analyst, but providing this information can help ensure our research is reproducible. -## Summary +## Additional resources -We can promote accuracy and verification of results by making our analysis reproducible. There are various tools and guides available to help you achieve reproducibility in your work, a few of which were described in this chapter. Here are additional resources to explore: +We can promote accuracy and verification of results by making our analysis reproducible. There are various tools and guides available to help achieve reproducibility in analysis work, a few of which were described in this chapter. Here are additional resources to explore: -* R for Data Science chapter on project-based workflows: [https://r4ds.hadley.nz/workflow-scripts.html#projects](https://r4ds.hadley.nz/workflow-scripts.html#projects) -* Building reproducible analytical pipelines with R by Bruno Rodrigues: [https://raps-with-r.dev/](https://raps-with-r.dev/) -* Posit Solutions Site page on reproducible environments: [https://solutions.posit.co/envs-pkgs/environments/](https://solutions.posit.co/envs-pkgs/environments/) +* [R for Data Science chapter on project-based workflows](https://r4ds.hadley.nz/workflow-scripts.html#projects) +* [Building reproducible analytical pipelines with R](https://raps-with-r.dev/) +* [Posit Solutions Site page on reproducible environments](https://solutions.posit.co/envs-pkgs/environments/) diff --git a/10-sample-designs-replicate-weights.Rmd b/10-sample-designs-replicate-weights.Rmd index 00ea7b00..318cd74a 100644 --- a/10-sample-designs-replicate-weights.Rmd +++ b/10-sample-designs-replicate-weights.Rmd @@ -25,21 +25,21 @@ library(srvyr) library(srvyrexploR) ``` -To help explain the different types of sample designs, this chapter will use the `api` and `scd` data that are included in the {survey} package [@lumley2010complex]: +To help explain the different types of sample designs, this chapter uses the `api` and `scd` data that are included in the {survey} package [@lumley2010complex]: ```{r} #| label: samp-setup-surveydata data(api) data(scd) ``` -This chapter also uses data from the Residential Energy Consumption Survey (RECS) - both 2015 and 2020, which are included in the {srvyrexploR} package as `recs_2015` and `recs_2020`, respectively [@R-srvyrexploR]. +This chapter uses data from the Residential Energy Consumption Survey (RECS) - both 2015 and 2020, so we load the RECS data from the {srvyrexploR} package using their object names `recs_2015` and `recs_2020`, respectively [@R-srvyrexploR]. ::: ## Introduction The primary reason for using packages like {survey} and {srvyr} is to account for the sampling design or replicate weights into estimates [@R-srvyr; @lumley2010complex]. By incorporating the sampling design or replicate weights, precision estimates (e.g., standard errors and confidence intervals) are appropriately calculated. -In this chapter, we will introduce common sampling designs and common types of replicate weights, the mathematical methods for calculating estimates and standard errors for a given sampling design, and the R syntax to specify the sampling design or replicate weights. While we will show the math behind the estimates, the functions in these packages will do the calculation. To deeply understand the math and the derivation, refer to @pennstate506, @sarndal2003model, @wolter2007introduction, or @fuller2011sampling (these are listed in order of increasing statistical rigorousness). +In this chapter, we introduce common sampling designs and common types of replicate weights, the mathematical methods for calculating estimates and standard errors for a given sampling design, and the R syntax to specify the sampling design or replicate weights. While we show the math behind the estimates, the functions in these packages handle the calculation. To deeply understand the math and the derivation, refer to @pennstate506, @sarndal2003model, @wolter2007introduction, or @fuller2011sampling (these are listed in order of increasing statistical rigorousness.) The general process for estimation in the {srvyr} package is to: @@ -55,17 +55,17 @@ This chapter includes details on the first step - creating the survey object. On ## Common sampling designs -A sampling design is the method used to draw a sample. Both logistical and statistical elements are considered when developing a sampling design. When specifying a sampling design in R, the levels of sampling are specified along with the weights. The weight for each record is constructed so that the particular record represents that many units in the population. For example, in a survey of 6th-grade students in the United States, the weight associated with each responding student reflects how many 6th grade students across the country that record represents. Generally, the weights represent the inverse of the probability of selection such that the sum of the weights corresponds to the total population size, although some studies may have the sum of the weights equal to the number of respondent records. +A sampling design is the method used to draw a sample. Both logistical and statistical elements are considered when developing a sampling design. When specifying a sampling design in R, we specify the levels of sampling along with the weights. The weight for each record is constructed so that the particular record represents that many units in the population. For example, in a survey of 6th-grade students in the United States, the weight associated with each responding student reflects how many 6th grade students across the country that record represents. Generally, the weights represent the inverse of the probability of selection, such that the sum of the weights corresponds to the total population size, although some studies may have the sum of the weights equal to the number of respondent records. Some common terminology across the designs are: - **sample size**, generally denoted as $n$, is the number of units selected to be sampled - - **population size**, generally denoted as $N$, is the number of units in the target population + - **population size**, generally denoted as $N$, is the number of units in the population of interest - **sampling frame**, the list of units from which the sample is drawn (see Chapter \@ref(c02-overview-surveys) for more information) ### Simple random sample without replacement -The simple random sample (SRS) without replacement is a sampling design where a fixed sample size is selected from a sampling frame, and every possible subsample has an equal probability of selection. Without replacement refers to the fact that once a sampling unit has been selected, it is removed from the sample frame and cannot be selected again. +The simple random sample (SRS) without replacement is a sampling design in which a fixed sample size is selected from a sampling frame, and every possible subsample has an equal probability of selection. Without replacement refers to the fact that once a sampling unit has been selected, it is removed from the sample frame and cannot be selected again. - **Requirements**: The sampling frame must include the entire population. - **Advantages**: SRS requires no information about the units apart from contact information. @@ -85,7 +85,7 @@ $$se(\bar{y})=\sqrt{\frac{s^2}{n}\left( 1-\frac{n}{N} \right)}$$ where $$s^2=\frac{1}{n-1}\sum_{i=1}^n\left(y_i-\bar{y}\right)^2.$$ -and $N$ is the population size. This standard error estimate might look very similar to equations in other applications except for the part on the right side of the equation: $1-\frac{n}{N}$. This is called the finite population correction (FPC) factor. If the size of the frame, $N$, is very large in comparison to the sample, the FPC is negligible, so it is often ignored. A common guideline is if the sample is less than 10% of the population, the FPC is negligible. +and $N$ is the population size. This standard error estimate might look very similar to equations in other statistical applications except for the part on the right side of the equation: $1-\frac{n}{N}$. This is called the finite population correction (FPC) factor. If the size of the frame, $N$, is very large in comparison to the sample, the FPC is negligible, so it is often ignored. A common guideline is if the sample is less than 10% of the population, the FPC is negligible. To estimate proportions, we define $x_i$ as the indicator if the outcome is observed. That is, $x_i=1$ if the outcome is observed, and $x_i=0$ if the outcome is not observed for respondent $i$. Then the estimated proportion from an SRS design is: @@ -96,21 +96,21 @@ $$se(\hat{p})=\sqrt{\frac{\hat{p}(1-\hat{p})}{n-1}\left(1-\frac{n}{N}\right)} $$ #### The syntax {-} -If a sample was drawn through SRS and had no nonresponse or other weighting adjustments, in R, specify this design as: +If a sample was drawn through SRS and had no nonresponse or other weighting adjustments, in R, we specify this design as: ```r srs1_des <- dat %>% as_survey_design(fpc = fpcvar) ``` -where `dat` is a tibble or data.frame with the survey data, and `fpcvar` is a variable in the data indicating the sampling frame's size (this variable will have the same value for all cases in an SRS design). If the frame is very large, sometimes the frame size is not provided. In that case, the FPC is not needed, and specify the design as: +where `dat` is a tibble or data.frame with the survey data, and `fpcvar` is a variable in the data indicating the sampling frame's size (this variable has the same value for all cases in an SRS design.) If the frame is very large, sometimes the frame size is not provided. In that case, the FPC is not needed, and we specify the design as: ```r srs2_des <- dat %>% as_survey_design() ``` -If some post-survey adjustments were implemented and the weights are not all equal, specify the design as: +If some post-survey adjustments were implemented and the weights are not all equal, we specify the design as: ```r srs3_des <- dat %>% @@ -122,7 +122,7 @@ where `wtvar` is a variable in the data indicating the weight for each case. Aga #### Example {-} -The {survey} package in R provides some example datasets that we will use throughout this chapter. The documentation provides detailed information about the variables. One of the example datasets we will use is from the Academic Performance Index (API). The API was a program administered by the California Department of Education, and the {survey} package includes a population file (sample frame) of all schools with at least 100 students and several different samples pulled from that data using different sampling methods. For this first example, we will use the `apisrs` dataset, which contains an SRS of 200 schools. For printing purposes, we create a new dataset called `apisrs_slim`, which sorts the data by the school district and school ID and subsets the data to only a few columns. The SRS sample data is illustrated below: +The {survey} package in R provides some example datasets that we use throughout this chapter. The documentation provides detailed information about the variables. One of the example datasets we use is from the Academic Performance Index (API.) The API was a program administered by the California Department of Education, and the {survey} package includes a population file (sample frame) of all schools with at least 100 students and several different samples pulled from that data using different sampling methods. For this first example, we use the `apisrs` dataset, which contains an SRS of 200 schools. For printing purposes, we create a new dataset called `apisrs_slim`, which sorts the data by the school district and school ID and subsets the data to only a few columns. The SRS sample data are illustrated below: ```{r} #| label: samp-des-apisrs-display @@ -150,7 +150,7 @@ Variable Name | Description `fpc` | Finite population correction factor (FPC) `pw` | Weight -To create the `tbl_survey` object for this SRS data, the design should be specified as follows: +To create the `tbl_survey` object for the SRS data, we specify the design as: ```{r} #| label: samp-des-apisrs-des @@ -161,7 +161,7 @@ apisrs_des <- apisrs_slim %>% apisrs_des ``` -In the printed design object above, the design is described as an "Independent Sampling design," which is another term for SRS. The ids are specified as `1`, which means there is no clustering (a topic described in Section \@ref(samp-cluster)), the FPC variable is indicated, and the weights are indicated. We can also look at the summary of the design object, and see the distribution of the probabilities (inverse of the weights) along with the population size and a list of the variables in the dataset. +In the printed design object, the design is described as an "Independent Sampling design," which is another term for SRS. The ids are specified as `1`, which means there is no clustering (a topic described in Section \@ref(samp-cluster)), the FPC variable is indicated, and the weights are indicated. We can also look at the summary of the design object (`summary()`), and see the distribution of the probabilities (inverse of the weights) along with the population size and a list of the variables in the dataset. ```{r} #| label: samp-des-apisrs-summary @@ -202,27 +202,27 @@ $$se(\hat{p})=\sqrt{\frac{\hat{p}(1-\hat{p})}{n}} $$ #### The syntax {-} -If we had a sample that was drawn through SRSWR and had no nonresponse or other weighting adjustments, in R, we should specify this design as: +If we had a sample that was drawn through SRSWR and had no nonresponse or other weighting adjustments, in R, we specify this design as: ```r srswr1_des <- dat %>% as_survey_design() ``` -where `dat` is a tibble or data.frame containing our survey data. This syntax is the same as a SRS design, except a finite population correction (FPC) is not included. This is because when you claculate a sample with replacement, the population pool to select from is no longer finite, so a correction is not needed. Therefore, with large populations where the FPC is negligble, the underlying formulas for SRS and SRSWR designs are the same. +where `dat` is a tibble or data.frame containing our survey data. This syntax is the same as a SRS design, except a finite population correction (FPC) is not included. This is because when calculating a sample with replacement, the population pool to select from is no longer finite, so a correction is not needed. Therefore, with large populations where the FPC is negligible, the underlying formulas for SRS and SRSWR designs are the same. -If some post-survey adjustments were implemented and the weights are not all equal, specify the design as: +If some post-survey adjustments were implemented and the weights are not all equal, we specify the design as: ```r srswr2_des <- dat %>% as_survey_design(weights = wtvar) ``` -where `wtvar` is the variable for the weight on the data. +where `wtvar` is the variable for the weight of the data. #### Example {-} -The {survey} package does not include an example of SRSWR, so to illustrate this design we need to create an example. We use the api population data provided by the {survey} package `apipop` and select a sample of 200 cases using the `slice_sample()` function from the tidyverse. One of the arguments in the `slice_sample()` function is `replace`. If `replace=TRUE`, then we are conducting a SRSWR. We then calculate selection weights as the inverse of the probability of selection and call this new dataset `apisrswr`. +The {survey} package does not include an example of SRSWR, so to illustrate this design, we need to create an example. We use the api population data provided by the {survey} package `apipop` and select a sample of 200 cases using the `slice_sample()` function from the tidyverse. One of the arguments in the `slice_sample()` function is `replace`. If `replace=TRUE`, then we are conducting a SRSWR. We then calculate selection weights as the inverse of the probability of selection and call this new dataset `apisrswr`. ```{r} #| label: samp-des-apisrs-wr-display @@ -240,7 +240,7 @@ head(apisrswr) ``` -Because this is a SRS design *with replacement*, there will be duplicates in the data. It is important to keep the duplicates in the data for proper estimation, but for reference we can view the duplicates in the example data we just created. +Because this is a SRS design *with replacement*, there may be duplicates in the data. It is important to keep the duplicates in the data for proper estimation, but for reference, we can view the duplicates in the example data we just created. ```{r} #| label: samp-des-apisrs-wr-duplicates @@ -252,7 +252,7 @@ apisrswr %>% ``` -We created a weight variable in this example data, which is the inverse of the probability of selection. To specify the sampling design for `apisrswr`, the following syntax should be used: +We created a weight variable in this example data, which is the inverse of the probability of selection. We specify the sampling design for `apisrswr` as: ```{r} #| label: samp-des-apisrswr-des @@ -263,25 +263,25 @@ apisrswr_des summary(apisrswr_des) ``` -In the output above, the design object and the object summary are shown. Both note that the sampling is done "with replacement" because no FPC was specified. The probabilities, which are derived from the weights, are summarized in the summary. +In the output above, the design object and the object summary are shown. Both note that the sampling is done "with replacement" because no FPC was specified. The probabilities, which are derived from the weights, are summarized in the summary function output. ### Stratified sampling Stratified sampling occurs when a population is divided into mutually exclusive subpopulations (strata), and then samples are selected independently within each stratum. - - **Requirements**: The sampling frame must include the information to divide the population into groups for every unit. + - **Requirements**: The sampling frame must include the information to divide the population into strata for every unit. - **Advantages**: - This design ensures sample representation in all subpopulations. - If the strata are correlated with survey outcomes, a stratified sample has smaller standard errors compared to a SRS sample of the same size. - This results in a more efficient design. - - **Disadvantages**: Auxiliary data may not exist to divide the sampling frame into groups, or the data may be outdated. + - **Disadvantages**: Auxiliary data may not exist to divide the sampling frame into strata, or the data may be outdated. - **Examples**: - - **Example 1**: A population of North Carolina residents could be separated (stratified) into urban and rural areas, and then a SRS of residents from both rural and urban areas is selected independently. This ensures there are residents from both areas in the sample. - - **Example 2**: Law enforcement agencies could be separated (stratified) into the three primary general-purpose categories in the US: local police, sheriff's departments, and state police. A SRS of agencies from each of the three types is then selected independently to ensure all three types of agencies are represented. + - **Example 1**: A population of North Carolina residents could be stratified into urban and rural areas, and then an SRS of residents from both rural and urban areas is selected independently. This ensures there are residents from both areas in the sample. + - **Example 2**: Law enforcement agencies could be stratified into the three primary general-purpose categories in the U.S.: local police, sheriff's departments, and state police. A SRS of agencies from each of the three types is then selected independently to ensure all three types of agencies are represented. #### The math {-} -Let $\bar{y}_h$ be the sample mean for stratum $h$, $N_h$ be the population size of stratum $h$, and $n_h$ be the sample size of stratum $h$. Then the estimate for the population mean under stratified SRS sampling is: +Let $\bar{y}_h$ be the sample mean for stratum $h$, $N_h$ be the population size of stratum $h$, $n_h$ be the sample size of stratum $h$, and $H$ is the total number of strata. Then, the estimate for the population mean under stratified SRS sampling is: $$\bar{y}=\frac{1}{N}\sum_{h=1}^H N_h\bar{y}_h$$ and the estimate of the standard error of $\bar{y}$ is: @@ -289,19 +289,19 @@ and the estimate of the standard error of $\bar{y}$ is: $$se(\bar{y})=\sqrt{\frac{1}{N^2} \sum_{h=1}^H N_h^2 \frac{s_h^2}{n_h}\left(1-\frac{n_h}{N_h}\right)} $$ where -$$s_h^2=\frac{1}{n_h-1}\sum_{i=1}^{n_h}\left(y_{i,h}-\bar{y}_h\right)^2.$$ +$$s_h^2=\frac{1}{n_h-1}\sum_{i=1}^{n_h}\left(y_{i,h}-\bar{y}_h\right)^2$$ -For estimates of proportions, let $\hat{p}_h$ be the estimated proportion in stratum $h$. Then the population proportion estimate is: +For estimates of proportions, let $\hat{p}_h$ be the estimated proportion in stratum $h$. Then, the population proportion estimate is: $$\hat{p}= \frac{1}{N}\sum_{h=1}^H N_h \hat{p}_h$$ -where $H$ is the total number of strata. The standard error of the proportion is: +The standard error of the proportion is: $$se(\hat{p}) = \frac{1}{N} \sqrt{ \sum_{h=1}^H N_h^2 \frac{\hat{p}_h(1-\hat{p}_h)}{n_h-1} \left(1-\frac{n_h}{N_h}\right)}$$ #### The syntax {-} -In addition to the `fpc` and `weights` arguments discussed in the types above, stratified designs requires the addition of the `strata` argument. For example, to specify a stratified SRS design in {srvyr} when using the FPC, that is, where the population sizes of the strata are not too large and are known, specify the design as: +In addition to the `fpc` and `weights` arguments discussed in the types above, stratified designs require the addition of the `strata` argument. For example, to specify a stratified SRS design in {srvyr} when using the FPC, that is, where the population sizes of the strata are not too large and are known, we specify the design as: ```r stsrs1_des <- dat %>% @@ -309,7 +309,7 @@ stsrs1_des <- dat %>% strata = stratvar) ``` -where `fpcvar` is a variable on our data that indicates $N_h$ for each row, and `stratavar` is a variable indicating the stratum for each row. You can omit the FPC if it is not applicable. Additionally, we can indicate the weight variable if it is present where `wtvar` is a variable on our data with a numeric weight. +where `fpcvar` is a variable on our data that indicates $N_h$ for each row, and `stratavar` is a variable indicating the stratum for each row. We can omit the FPC if it is not applicable. Additionally, we can indicate the weight variable if it is present where `wtvar` is a variable on our data with a numeric weight. ```r stsrs2_des <- dat %>% @@ -333,7 +333,7 @@ apistrat_slim %>% count(stype, fpc) ``` -The FPC is the same for each case within each stratum. This output also shows that 100 elementary schools, 50 middle schools, and 50 high schools were sampled. It is often common for the number of units sampled from each strata to be different based on the goals of the project, or to mirror the size of each strata in the population. This design should be specified as follows: +The FPC is the same for each case within each stratum. This output also shows that 100 elementary schools, 50 middle schools, and 50 high schools were sampled. It is often common for the number of units sampled from each strata to be different based on the goals of the project, or to mirror the size of each strata in the population. We specify the design as: ```{r} #| label: samp-des-apistrat-des @@ -346,24 +346,24 @@ apistrat_des summary(apistrat_des) ``` -When printing the object, it is specified as a "Stratified Independent Sampling design," also known as a stratified SRS, and the strata variable is included. Printing the summary we see a distribution of probabilities, as we saw with SRS, but we also see the sample and populations sizes by stratum. +When printing the object, it is specified as an "Stratified Independent Sampling design," also known as a stratified SRS, and the strata variable is included. Printing the summary, we see a distribution of probabilities, as we saw with SRS, but we also see the sample and population sizes by stratum. ### Clustered sampling {#samp-cluster} -Clustered sampling occurs when a population is divided into mutually exclusive subgroups called clusters or primary sampling units (PSUs). A random selection of PSUs is sampled, and then another level of sampling is done within these clusters. There can be multiple levels of this selection. Clustered sampling is often used when a list of the entire population is not available, or data collection involves interviewers needing direct contact with respondents. +Clustered sampling occurs when a population is divided into mutually exclusive subgroups called clusters or primary sampling units (PSUs.) A random selection of PSUs is sampled, and then another level of sampling is done within these clusters. There can be multiple levels of this selection. Clustered sampling is often used when a list of the entire population is not available or data collection involves interviewers needing direct contact with respondents. - - **Requirements**: There must be a way to divide the population into clusters. Clusters are commonly structural such as institutions (e.g., schools, prisons) or geography (e.g., states, counties). + - **Requirements**: There must be a way to divide the population into clusters. Clusters are commonly structural, such as institutions (e.g., schools, prisons) or geography (e.g., states, counties.) - **Advantages**: - Clustered sampling is advantageous when data collection is done in person, so interviewers are sent to specific sampled areas rather than completely at random across a country. - - With clustered sampling, a list of the entire population is not necessary. For example, if sampling students, we do not need a list of all students but only a list of all schools. Once the schools are sampled, lists of students can be obtained within the sampled schools. + - With clustered sampling, a list of the entire population is not necessary. For example, if sampling students, we do not need a list of all students, but only a list of all schools. Once the schools are sampled, lists of students can be obtained within the sampled schools. - **Disadvantages**: Compared to a simple random sample for the same sample size, clustered samples generally have larger standard errors of estimates. - **Examples**: - - **Example 1**: Consider a study needing a sample of 6th-grade students in the United States, no list likely exists of all these students. However, it is more likely to obtain a list of schools that have 6th graders, so a study design could select a random sample of schools that have 6th graders. The selected schools can then provide a list of students to do a second stage of sampling where 6th-grade students are randomly sampled within each of the sampled schools. This is a one-stage sample design (the one representing the number of clusters) and will be the type of design we will discuss in the formulas below. - - **Example 2**: Consider a study sending interviewers to households for a survey. This is a more complicated example that requires two levels of clustering (two-stage sample design) to efficiently use interviewers in geographic clusters. First, in the U.S., counties could be selected as the PSU, then Census block groups within counties could be selected as the secondary sampling unit (SSU). Households could then be randomly sampled within the block groups. This type of design is popular for in-person surveys as it reduces the travel necessary for interviewers. + - **Example 1**: Consider a study needing a sample of 6th-grade students in the United States. No list likely exists of all these students. However, it is more likely to obtain a list of schools that enroll 6th graders, so a study design could select a random sample of schools that enroll 6th graders. The selected schools can then provide a list of students to do a second stage of sampling where 6th-grade students are randomly sampled within each of the sampled schools. This is a one-stage sample design (the one representing the number of clusters) and is the type of design we discuss in the formulas below. + - **Example 2**: Consider a study sending interviewers to households for a survey. This is a more complicated example that requires two levels of clustering (two-stage sample design) to efficiently use interviewers in geographic clusters. First, in the U.S., counties could be selected as the PSU and then census block groups within counties could be selected as the secondary sampling unit (SSU.) Households could then be randomly sampled within the block groups. This type of design is popular for in-person surveys as it reduces the travel necessary for interviewers. #### The math {-} -Consider a survey where a sample of $a$ clusters are sampled from a population of $A$ clusters via SRS. Units within each sampled cluster are sampled via SRS as well. Within each sampled cluster, $i$, there are $B_i$ units and $b_i$ units are sampled via SRS. Let $\bar{y}_{i}$ be the sample mean of cluster $i$. Then, a ratio estimator of the population mean is: +Consider a survey where a sample of $a$ clusters are sampled from a population of $A$ clusters via SRS. Units within each sampled cluster are sampled via SRS as well. Within each sampled cluster, $i$, there are $B_i$ units in the population and $b_i$ units are sampled via SRS. Let $\bar{y}_{i}$ be the sample mean of cluster $i$. Then, a ratio estimator of the population mean is: $$\bar{y}=\frac{\sum_{i=1}^a B_i \bar{y}_{i}}{ \sum_{i=1}^a B_i}$$ Note this is a consistent but biased estimator. Often the population size is not known, so this is a method to estimate a mean without knowing the population size. The estimated standard error of the mean is: @@ -378,12 +378,12 @@ where $\hat{y}_i =B_i\bar{y_i}$ . The formula for the within-cluster variance ($s_i^2$) is: -$$s_b^2=\frac{1}{a(b_i-1)} \sum_{j=1}^{b_i} \left(y_{ij}-\bar{y}_i\right)^2$$ +$$s_i^2=\frac{1}{a(b_i-1)} \sum_{j=1}^{b_i} \left(y_{ij}-\bar{y}_i\right)^2$$ where $y_{ij}$ is the outcome for sampled unit $j$ within cluster $i$. #### The syntax {-} -Clustered sampling designs require the addition of the `ids` argument which specifies what variables are the cluster levels. To specify a two-stage clustered design without replacement, use the following syntax: +Clustered sampling designs require the addition of the `ids` argument, which specifies what the cluster levels variables. To specify a two-stage clustered design without replacement, we specify the design as: ```r clus2_des <- dat %>% @@ -392,26 +392,26 @@ clus2_des <- dat %>% fpc = c(A, B)) ``` -where `PSU` and `SSU` are the variables indicating the PSU and SSU identifiers, and `A` and `B` are the variables indicating the population sizes for each level (i.e., `A` is the number of clusters, and `B` is the number of units within each cluster). Note that `A` will be the same for all records (within a strata), and `B` will be the same for all records within the same cluster. +where `PSU` and `SSU` are the variables indicating the PSU and SSU identifiers, and `A` and `B` are the variables indicating the population sizes for each level (i.e., `A` is the number of clusters, and `B` is the number of units within each cluster.) Note that `A` is the same for all records, and `B` is the same for all records within the same cluster. -If clusters were sampled with replacement or from a very large population, a FPC is unnecessary. Additionally, only the first stage of selection is necessary regardless of whether the units were selected with replacement at any stage. The subsequent stages of selection are ignored in computation as their contribution to the variance is overpowered by the first stage (see @sarndal2003model or @wolter2007introduction for a more in-depth discussion). Therefore, the syntax below will yield the same estimates in the end: +If clusters were sampled with replacement or from a very large population, the FPC is unnecessary. Additionally, only the first stage of selection is necessary regardless of whether the units were selected with replacement at any stage. The subsequent stages of selection are ignored in computation as their contribution to the variance is overpowered by the first stage (see @sarndal2003model or @wolter2007introduction for a more in-depth discussion.) Therefore, the two design objects specified below yield the same estimates in the end: ```r -clus2wra_des <- dat %>% +clus2ex1_des <- dat %>% as_survey_design(weights = wtvar, ids = c(PSU, SSU)) -clus2wrb_des <- dat %>% +clus2ex2_des <- dat %>% as_survey_design(weights = wtvar, ids = PSU) ``` -Note that there is one additional argument that is sometimes necessary which is `nest = TRUE`. This option relabels cluster IDs to enforce nesting within strata. Sometimes, as an example, there may be a cluster `1` and a cluster `2` within each stratum but these are actually different clusters. This option indicates that the repeated use of numbering does not mean it is the same cluster. If this option is not used and there are repeated cluster IDs across different strata, an error will be generated. +Note that there is one additional argument that is sometimes necessary, which is `nest = TRUE`. This option relabels cluster IDs to enforce nesting within strata. Sometimes, as an example, there may be a cluster `1` within each stratum, but cluster `1` in stratum `1` is a different cluster than cluster `1` in stratum `2`. These are actually different clusters. This option indicates that repeated numbering does not mean it is the same cluster. If this option is not used and there are repeated cluster IDs across different strata, an error is generated. #### Example {-} -The `survey` package includes a two-stage cluster sample data, `apiclus2`, in which school districts were sampled, and then a random sample of five schools was selected within each district. For districts with fewer than five schools, all schools were sampled. School districts are identified by `dnum`, and schools are identified by `snum`. The variable `fpc1` indicates how many districts there are in California (`A`), and `fpc2` indicates how many schools were in a given district with at least 100 students (`B`). The data has a row for each school. In the data printed below, there are 757 school districts, as indicated by `fpc1`, and there are nine schools in District 731, one school in District 742, two schools in District 768, and so on as indicated by `fpc2`. For illustration purposes, the object `apiclus2_slim` has been created from `apiclus2`, which subsets the data to only the necessary columns and sorts data. +The `survey` package includes a two-stage cluster sample data, `apiclus2`, in which school districts were sampled, and then a random sample of five schools was selected within each district. strict. All districts with fewer than five schools were sampled. School districts are identified by `dnum`, and schools are identified by `snum`. The variable `fpc1` indicates how many districts there are in California (the total number of PSUs or `A`), and `fpc2` indicates how many schools were in a given district with at least 100 students (the total number of SSUs or `B`.) The data include a row for each school. In the data printed below, there are 757 school districts, as indicated by `fpc1`, and nine schools in District 731, one school in District 742, two schools in District 768, and so on, as indicated by `fpc2`. For illustration purposes, the object `apiclus2_slim` has been created from `apiclus2`, which subsets the data to only the necessary columns and sorts the data. ```{r} #| label: samp-des-api2clus-dis @@ -424,37 +424,40 @@ apiclus2_slim <- apiclus2_slim ``` -To specify this design in R, the following syntax should be used: +To specify this design in R, we use the following: ```{r} #| label: samp-des-api2clus-des apiclus2_des <- apiclus2_slim %>% - as_survey_design(ids = c(dnum, snum), - fpc = c(fpc1, fpc2), - weights = pw) + as_survey_design( + ids = c(dnum, snum), + fpc = c(fpc1, fpc2), + weights = pw + ) apiclus2_des summary(apiclus2_des) ``` -The design objects are described as "2 - level Cluster Sampling design" and include the ids (cluster), FPC, and weight variables. The summary notes that the sample includes 40 first-level clusters (PSUs), which are school districts, and 126 second-level clusters (SSUs), which are schools. Additionally, the summary includes a numeric summary of the probabilities of selection and the population size (number of PSUs) as 757. +The design objects are described as "2 - level Cluster Sampling design," and include the ids (cluster), FPC, and weight variables. The summary notes that the sample includes 40 first-level clusters (PSUs), which are school districts, and 126 second-level clusters (SSUs), which are schools. Additionally, the summary includes a numeric summary of the probabilities of selection and the population size (number of PSUs) as 757. ## Combining sampling methods {#samp-combo} -SRS, stratified, and clustered designs are the backbone of sampling designs, and the features are often combined in one design. Additionally, rather than using SRS for selection, other sampling mechanisms are commonly used, such as probability proportional to size (PPS), systematic sampling, or selection with unequal probabilities, which are briefly described here. In PPS sampling, a size measure is constructed for each unit (e.g., the population of the PSU or the number of occupied housing units) and then units with larger size measures are more likely to be sampled. Systematic sampling is commonly used to ensure representation across a population. Units are sorted by a feature and then every $k$ units are selected from a random start point so the sample is spread across the population. In addition to PPS, other unequal probabilities of selection may be used. For example, in a study of establishments (e.g., businesses or public institutions) that conducts a survey every year, an establishment that recently participated (e.g., participated last year) may have a reduced chance of selection in a subsequent round to reduce the burden on the establishment. To learn more about sampling designs, refer to @valliant2013practical, @cox2011business, @cochran1977sampling, and @deming1991sample. +SRS, stratified, and clustered designs are the backbone of sampling designs, and the features are often combined in one design. Additionally, rather than using SRS for selection, other sampling mechanisms are commonly used, such as probability proportional to size (PPS), systematic sampling, or selection with unequal probabilities, which are briefly described here. In PPS sampling, a size measure is constructed for each unit (e.g., the population of the PSU or the number of occupied housing units), and units with larger size measures are more likely to be sampled. Systematic sampling is commonly used to ensure representation across a population. Units are sorted by a feature, and then every $k$ units is selected from a random start point so the sample is spread across the population. In addition to PPS, other unequal probabilities of selection may be used. For example, in a study of establishments (e.g., businesses or public institutions) that conducts a survey every year, an establishment that recently participated (e.g., participated last year) may have a reduced chance of selection in a subsequent round to reduce the burden on the establishment. To learn more about sampling designs, refer to @valliant2013practical, @cox2011business, @cochran1977sampling, and @deming1991sample. -A common method of sampling is to stratify PSUs, select PSUs within the stratum using PPS selection, and then select units within the PSUs either with SRS or PPS. Reading survey documentation is an important first step in survey analysis to understand the design of the survey we are using and variables necessary to specify the design. Good documentation will highlight the variables necessary to specify the design. This is often found in User's Guides, methodology, analysis guides, or technical documentation (see Chapter \@ref(c03-survey-data-documentation) for more details). +A common method of sampling is to stratify PSUs, select PSUs within the stratum using PPS selection, and then select units within the PSUs either with SRS or PPS. Reading survey documentation is an important first step in survey analysis to understand the design of the survey we are using and variables necessary to specify the design. Good documentation highlights the variables necessary to specify the design. This is often found in the user guide, methodology report, analysis guide, or technical documentation (see Chapter \@ref(c03-survey-data-documentation) for more details.) ### Example {-} -For example, the (2017-2019 National Survey of Family Growth)[ https://www.cdc.gov/nchs/data/nsfg/NSFG-2017-2019-Sample-Design-Documentation-508.pdf] (NSFG) had a stratified multi-stage area probability sample: +For example, the [2017-2019 National Survey of Family Growth](https://www.cdc.gov/nchs/data/nsfg/NSFG-2017-2019-Sample-Design-Documentation-508.pdf) had a stratified multi-stage area probability sample: + 1. In the first stage, PSUs are counties or collections of counties and are stratified by Census region/division, size (population), and MSA status. Within each stratum, PSUs were selected via PPS. 2. In the second stage, neighborhoods were selected within the sampled PSUs using PPS selection. 3. In the third stage, housing units were selected within the sampled neighborhoods. - 4. In the fourth stage, a person was randomly chosen within the selected housing units among eligible persons using unequal probabilities based on the person's age and sex. + 4. In the fourth stage, a person was randomly chosen among eligible persons within the selected housing units using unequal probabilities based on the person's age and sex. -The public use file does not include all these levels of selection and instead has pseudo-strata and pseudo-clusters, which are the variables used in R to specify the design. As specified on page 4 of the documentation, the stratum variable is `SEST`, the cluster variable is `SECU`, and the weight variable is `WGT2017_2019`. Thus, to specify this design in R, use the following syntax: +The public use file does not include all these levels of selection and instead has pseudo-strata and pseudo-clusters, which are the variables used in R to specify the design. As specified on page 4 of the documentation, the stratum variable is `SEST`, the cluster variable is `SECU`, and the weight variable is `WGT2017_2019`. Thus, to specify this design in R, we use the following syntax: ```r nsfg_des <- nsfgdata %>% @@ -465,27 +468,27 @@ nsfg_des <- nsfgdata %>% ## Replicate weights -Replicate weights are often included on analysis files instead of, or in addition to, the design variables (strata and PSUs). Replicate weights are used as another method to estimate variability. Often researchers choose to use replicate weights to avoid publishing design variables (strata or clustering variables) as a measure to reduce the risk of disclosure. There are several types of replicate weights, including balanced repeated replication (BRR), Fay's BRR, jackknife, and bootstrap methods. An overview of the process for using replicate weights is as follows: +Replicate weights are often included on analysis files instead of, or in addition to, the design variables (strata and PSUs.) Replicate weights are used as another method to estimate variability. Often, researchers choose to use replicate weights to avoid publishing design variables (strata or clustering variables) as a measure to reduce the risk of disclosure. There are several types of replicate weights, including balanced repeated replication (BRR), Fay's BRR, jackknife, and bootstrap methods. An overview of the process for using replicate weights is as follows: 1. Divide the sample into subsample replicates that mirror the design of the sample 2. Calculate weights for each replicate using the same procedures for the full-sample weight (i.e., nonresponse and post-stratification) 3. Calculate estimates for each replicate using the same method as the full-sample estimate -4. Calculate the estimated variance, which will be proportional to the variance of the replicate estimates +4. Calculate the estimated variance, which is proportional to the variance of the replicate estimates -The different types of replicate weights largely differ between step 1 (how the sample is divided into subsamples) and step 4 (which multiplication factors (scales) are used to multiply the variance). The general format for the standard error is: +The different types of replicate weights largely differ between step 1 (how the sample is divided into subsamples) and step 4 (which multiplication factors (scales) are used to multiply the variance.) The general format for the standard error is: $$ \sqrt{\alpha \sum_{r=1}^R \alpha_r (\hat{\theta}_r - \hat{\theta})^2 }$$ where $R$ is the number of replicates, $\alpha$ is a constant that depends on the replication method, $\alpha_r$ is a factor associated with each replicate, $\hat{\theta}$ is the weighted estimate based on the full sample, and $\hat{\theta}_r$ is the weighted estimate of $\theta$ based on the $r^{\text{th}}$ replicate. -To create the design object for surveys with replicate weights, we use `as_survey_rep()` instead of `as_survey_design()` that we use for the common sampling designs in the sections above. +To create the design object for surveys with replicate weights, we use `as_survey_rep()` instead of `as_survey_design()`, which we use for the common sampling designs in the sections above. ### Balanced Repeated Replication (BRR) method -The BRR method requires a stratified sample design with two PSUs in each stratum. Each replicate is constructed by deleting one PSU per stratum using a Hadamard matrix. For the PSU that is included, the weight is generally multiplied by two but may have other adjustments, such as post-stratification. A Hadamard matrix is a special square matrix with entries of +1 or -1 with mutually orthogonal rows. Hadamard matrices must have one row, two rows, or a multiple of four rows. The size of the Hadamard matrix is determined by the first multiple of 4 greater than or equal to the number of strata. For example, if a survey had 7 strata, the Hadamard matrix would be an $8\times8$ matrix. Additionally, a survey with 8 strata would also have an $8\times8$ Hadamard matrix. The columns in the matrix specify the strata and the rows specify the replicate. In each replicate (row), a +1 means to use the first PSU and a -1 means to use the second PSU in the estimate. For example, here is a $4\times4$ Hadamard matrix: +The BRR method requires a stratified sample design with two PSUs in each stratum. Each replicate is constructed by deleting one PSU per stratum using a Hadamard matrix. For the PSU that is included, the weight is generally multiplied by two but may have other adjustments, such as post-stratification. A Hadamard matrix is a special square matrix with entries of +1 or -1 with mutually orthogonal rows. Hadamard matrices must have one row, two rows, or a multiple of four rows. The size of the Hadamard matrix is determined by the first multiple of 4 greater than or equal to the number of strata. For example, if a survey had seven strata, the Hadamard matrix would be an $8\times8$ matrix. Additionally, a survey with eight strata would also have an $8\times8$ Hadamard matrix. The columns in the matrix specify the strata and the rows specify the replicate. In each replicate (row), a +1 means to use the first PSU and a -1 means to use the second PSU in the estimate. For example, here is a $4\times4$ Hadamard matrix: $$ \begin{array}{rrrr} +1 &+1 &+1 &+1\\ +1&-1&+1&-1\\ +1&+1&-1&-1\\ +1 &-1&-1&+1 \end{array} $$ -In the first replicate (row), all the values are +1, so in each stratum, the first PSU would be used in the estimate. In the second replicate, the first PSU would be used in stratum 1 and 3, while the second PSU would be used in stratum 2 and 4. In the third replicate, the first PSU would be used in stratum 1 and 2, while the second PSU would be used in strata 3 and 4. Finally, in the fourth replicate, the first PSU would be used in strata 1 and 4, while the second PSU would be used in strata 2 and 3. For more information about Hadamard matrices see @wolter2007introduction. Note that supplied BRR weights from a data provider will already incorporate this adjustment, and the {survey} package generates the Hadamard matrix, if necessary for calculating BRR weights so an analyst will not need to provide the matrix. +In the first replicate (row), all the values are +1, so in each stratum, the first PSU would be used in the estimate. In the second replicate, the first PSU would be used in strata 1 and 3, while the second PSU would be used in strata 2 and 4. In the third replicate, the first PSU would be used in strata 1 and 2, while the second PSU would be used in strata 3 and 4. Finally, in the fourth replicate, the first PSU would be used in strata 1 and 4, while the second PSU would be used in strata 2 and 3. For more information about Hadamard matrices, see @wolter2007introduction. Note that supplied BRR weights from a data provider already incorporate this adjustment, and the {survey} package generates the Hadamard matrix, if necessary, for calculating BRR weights, so an analyst does not need to create or provide the matrix. #### The math {-} @@ -493,17 +496,17 @@ A weighted estimate for the full sample is calculated as $\hat{\theta}$, and the $$se(\hat{\theta})=\sqrt{\frac{1}{R} \sum_{r=1}^R \left( \hat{\theta}_r-\hat{\theta}\right)^2}$$ -Specifying replicate weights in R requires specifying the type of replicate weights, the main weight variable, the replicate weight variables, and other options. One of the key options is for the mean squared error (MSE). If `mse=TRUE`, variances are computed around the point estimate $(\hat{\theta})$, whereas if `mse=FALSE`, variances are computed around the mean of the replicates $(\bar{\theta})$ instead which looks like this: +Specifying replicate weights in R requires specifying the type of replicate weights, the main weight variable, the replicate weight variables, and other options. One of the key options is for the mean squared error (MSE.) If `mse=TRUE`, variances are computed around the point estimate $(\hat{\theta})$, whereas if `mse=FALSE`, variances are computed around the mean of the replicates $(\bar{\theta})$ instead, which looks like this: $$se(\hat{\theta})=\sqrt{\frac{1}{R} \sum_{r=1}^R \left( \hat{\theta}_r-\bar{\theta}\right)^2}$$ where $$\bar{\theta}=\frac{1}{R}\sum_{r=1}^R \hat{\theta}_r$$ -The default option for `mse` is to use the global option of "survey.replicates.mse" which is set to `FALSE` initially unless a user changes it. To determine if `mse` should be set to `TRUE` or `FALSE`, read the survey documentation. If there is no indication in the survey documentation, for BRR, we recommend setting `mse` to `TRUE` as this is the default in other software (e.g., SAS, SUDAAN). +The default option for `mse` is to use the global option of "survey.replicates.mse" which is set to `FALSE` initially unless a user changes it. To determine if `mse` should be set to `TRUE` or `FALSE`, read the survey documentation. If there is no indication in the survey documentation for BRR, we recommend setting `mse` to `TRUE` as this is the default in other software (e.g., SAS, SUDAAN.) #### The syntax {-} -Replicate weights generally come in groups and are sequentially numbered, such as PWGTP1, PWGTP2, ..., PWGTP80 for the person weights in the American Community Survey (ACS) [@acs-pums-2021] or BRRWT1, BRRWT2, ..., BRRWT96 in the 2015 Residential Energy Consumption Survey (RECS) [@recs-2015-micro]. This makes it easy to use some of the (tidy selection)[https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html] functions in R. +Replicate weights generally come in groups and are sequentially numbered, such as PWGTP1, PWGTP2, ..., PWGTP80 for the person weights in the American Community Survey (ACS) [@acs-pums-2021] or BRRWT1, BRRWT2, ..., BRRWT96 in the 2015 Residential Energy Consumption Survey (RECS) [@recs-2015-micro]. This makes it easy to use some of the [tidy selection](https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html) functions in R. -To specify a BRR design, we need to specify the weight variable (`weights`), the replicate weight variables (`repweights`), the type of replicate weights is BRR (`type = BRR`), and whether the mean squared error should be used (`mse = TRUE`) or not (`mse = FALSE`). For example, if a dataset had WT0 for the main weight and had 20 BRR weights indicated WT1, WT2, ..., WT20, we can use the following syntax (both are equivalent): +To specify a BRR design, we need to specify the weight variable (`weights`), the replicate weight variables (`repweights`), the type of replicate weights as BRR (`type = BRR`), and whether the mean squared error should be used (`mse = TRUE`) or not (`mse = FALSE`.) For example, if a dataset had WT0 for the main weight and had 20 BRR weights indicated WT1, WT2, ..., WT20, we can use the following syntax (both are equivalent): ```r brr_des <- dat %>% @@ -519,7 +522,7 @@ brr_des <- dat %>% mse = TRUE) ``` -If a dataset had WT for the main weight and had 20 BRR weights indicated REPWT1, REPWT2, ..., REPWT20, the following syntax could be used (both are equivalent): +If a dataset had WT for the main weight and had 20 BRR weights indicated REPWT1, REPWT2, ..., REPWT20, we can use the following syntax (both are equivalent): ```r brr_des <- dat %>% @@ -535,7 +538,7 @@ brr_des <- dat %>% mse = TRUE) ``` -If the replicate weight variables are in the file consecutively, the following syntax can also be used: +If the replicate weight variables are in the file consecutively, we can also use the following syntax: ```r brr_des <- dat %>% @@ -545,7 +548,7 @@ brr_des <- dat %>% mse = TRUE) ``` -Typically, each replicate weight sums to a value similar to the main weight, as both the replicate weights and the main weight are supposed to provide population estimates. Rarely, an alternative method will be used where the replicate weights have values of 0 or 2 in the case of BRR weights. This would be indicated in the documentation (see Chapter \@ref(c03-survey-data-documentation) for more information on how to understand the provided documentation). In this case, the replicate weights are not combined, and the option `combined_weights = FALSE` should be indicated, as the default value for this argument is TRUE. This specific syntax is shown below: +Typically, each replicate weight sums to a value similar to the main weight, as both the replicate weights and the main weight are supposed to provide population estimates. Rarely, an alternative method is used where the replicate weights have values of 0 or 2 in the case of BRR weights. This would be indicated in the documentation (see Chapter \@ref(c03-survey-data-documentation) for more information on reading documentation.) In this case, the replicate weights are not combined, and the option `combined_weights = FALSE` should be indicated, as the default value for this argument is `TRUE`. This specific syntax is shown below: ```r brr_des <- dat %>% @@ -558,7 +561,7 @@ brr_des <- dat %>% #### Example {-} -The {survey} package includes a data example from Section 12.2 of @levy2013sampling. In this fictional data, two out of five ambulance stations were sampled from each of three emergency service areas (ESAs), thus BRR weights are appropriate with 2 PSUs (stations) sampled in each stratum (ESA). In the code below, BRR weights are created as was done by @levy2013sampling. +The {survey} package includes a data example from Section 12.2 of @levy2013sampling. In this fictional data, two out of five ambulance stations were sampled from each of three emergency service areas (ESAs), thus BRR weights are appropriate with 2 PSUs (stations) sampled in each stratum (ESA.) In the code below, we create BRR weights as was done by @levy2013sampling. ```{r} #| label: samp-des-brr-display @@ -573,7 +576,7 @@ scdbrr <- scd %>% scdbrr ``` -To specify the BRR weights, the following syntax is used: +To specify the BRR weights, we use the following syntax: ```{r} #| label: samp-scdbrr-des @@ -588,11 +591,11 @@ scdbrr_des summary(scdbrr_des) ``` -Note that `combined_weights` was specified as `FALSE` because these weights are simply specified as 0 and 2 and do not incorporate the overall weight. When printing the object, the type of replication is noted as Balanced Repeated Replicates, and the replicate weights and the weight variable are specified. Additionally, the summary lists the variables included. +Note that `combined_weights` was specified as `FALSE` because these weights are simply specified as 0 and 2 and do not incorporate the overall weight. When printing the object, the type of replication is noted as Balanced Repeated Replicates, and the replicate weights and the weight variable are specified. Additionally, the summary lists the variables included in the data and design object. ### Fay's BRR method -Fay's BRR method for replicate weights is similar to the BRR method in that it uses a Hadamard matrix to construct replicate weights. However, rather than deleting PSUs for each replicate, with Fay's BRR half of the PSUs have a replicate weight which is the main weight multiplied by $\rho$, and the other half have the main weight multiplied by $(2-\rho)$ where $0 \le \rho < 1$. Note that when $\rho=0$, this is equivalent to the standard BRR weights, and as $\rho$ becomes closer to 1, this method is more similar to jackknife discussed in the next section. To obtain the value of $\rho$, it is necessary to read the survey documentation (see Chapter \@ref(c03-survey-data-documentation)). +Fay's BRR method for replicate weights is similar to the BRR method in that it uses a Hadamard matrix to construct replicate weights. However, rather than deleting PSUs for each replicate, with Fay's BRR, half of the PSUs have a replicate weight, which is the main weight multiplied by $\rho$, and the other half have the main weight multiplied by $(2-\rho)$, where $0 \le \rho < 1$. Note that when $\rho=0$, this is equivalent to the standard BRR weights, and as $\rho$ becomes closer to 1, this method is more similar to jackknife discussed in Section \@ref(samp-jackknife). To obtain the value of $\rho$, it is necessary to read the survey documentation (see Chapter \@ref(c03-survey-data-documentation).) #### The math {-} @@ -602,7 +605,7 @@ $$se(\hat{\theta})=\sqrt{\frac{1}{R (1-\rho)^2} \sum_{r=1}^R \left( \hat{\theta} #### The syntax {-} -The syntax is very similar for BRR and Fay's BRR. To specify a Fay's BRR design, we need to specify the weight variable (`weights`), the replicate weight variables (`repweights`), the type of replicate weights is Fay's BRR (`type = Fay`), whether the mean squared error should be used (`mse = TRUE`) or not (`mse = FALSE`), and Fay's multiplier (`rho`). For example, if a dataset had WT0 for the main weight and had 20 BRR weights indicated as WT1, WT2, ..., WT20, and Fay's multiplier is 0.3, use the following syntax: +The syntax is very similar for BRR and Fay's BRR. To specify a Fay's BRR design, we need to specify the weight variable (`weights`), the replicate weight variables (`repweights`), the type of replicate weights as Fay's BRR (`type = Fay`), whether the mean squared error should be used (`mse = TRUE`) or not (`mse = FALSE`), and Fay's multiplier (`rho`.) For example, if a dataset had WT0 for the main weight and had 20 BRR weights indicated as WT1, WT2, ..., WT20, and Fay's multiplier is 0.3, we use the following syntax: ```r fay_des <- dat %>% @@ -615,16 +618,7 @@ fay_des <- dat %>% #### Example {-} -The 2015 RECS [@recs-2015-micro] uses Fay's BRR weights with the final weight as NWEIGHT and replicate weights as BRRWT1 - BRRWT96 and the documentation specifies a Fay's multiplier of 0.5. On the file, DOEID is a unique identifier for each respondent, TOTALDOL is the total cost of energy, TOTSQFT_EN is the total square footage of the residence, and REGOINC is the Census region. We have already pulled in the 2015 RECS data from the {srvyrexploR} package that provides data for this book. To specify the design for the `recs_2015` data, use the following syntax: - -```{r} -#| label: samp-des-recs-2015-read -#| echo: FALSE -#| warning: FALSE -#| message: FALSE -data(recs_2015) -``` - +The 2015 RECS [@recs-2015-micro] uses Fay's BRR weights with the final weight as NWEIGHT and replicate weights as BRRWT1 - BRRWT96, and the documentation specifies a Fay's multiplier of 0.5. On the file, DOEID is a unique identifier for each respondent, TOTALDOL is the total energy cost, TOTSQFT_EN is the total square footage of the residence, and REGOINC is the Census region. We use the 2015 RECS data from the {srvyrexploR} package that provides data for this book (see the prerequisites box at the beginning of this chapter.) To specify the design for the `recs_2015` data, we use the following syntax: ```{r} #| label: samp-des-recs-des @@ -643,13 +637,13 @@ summary(recs_2015_des) ``` -In specifying the design, the `variables` option was also used to include which variables might be used in analyses. This is optional but can make our object smaller and easier to work with. When printing the design object or looking at the summary, the replicate weight type is re-iterated as `Fay's variance method (rho= 0.5) with 96 replicates and MSE variances`, and the variables are included. No weight or probability summary is included in this output as we have seen in some other design objects. +In specifying the design, the `variables` option was also used to include which variables might be used in analyses. This is optional but can make our object smaller and easier to work with. When printing the design object or looking at the summary, the replicate weight type is re-iterated as `Fay's variance method (rho= 0.5) with 96 replicates and MSE variances`, and the variables are included. No weight or probability summary is included in this output, as we have seen in some other design objects. -### Jackknife method +### Jackknife method {#samp-jackknife} -There are three jackknife estimators implemented in {srvyr} - jackknife 1 (JK1), jackknife n (JKn), and jackknife 2 (JK2). The JK1 method can be used for unstratified designs, and replicates are created by removing one PSU at a time so the number of replicates is the same as the number of PSUs. If there is no clustering, then the PSU is the ultimate sampling unit (e.g., unit). +There are three jackknife estimators implemented in {srvyr} - jackknife 1 (JK1), jackknife n (JKn), and jackknife 2 (JK2.) The JK1 method can be used for unstratified designs, and replicates are created by removing one PSU at a time so the number of replicates is the same as the number of PSUs. If there is no clustering, then the PSU is the ultimate sampling unit (e.g., students.) -The JKn method is used for stratified designs and requires two or more PSUs per stratum. In this case, each replicate is created by deleting one PSU from a single stratum, so the number of replicates is the number of total PSUs across all strata. The JK2 method is a special case of JKn when there are exactly 2 PSUs sampled per stratum. For variance estimation, scaling constants must also be specified. +The JKn method is used for stratified designs and requires two or more PSUs per stratum. In this case, each replicate is created by deleting one PSU from a single stratum, so the number of replicates is the number of total PSUs across all strata. The JK2 method is a special case of JKn when there are exactly 2 PSUs sampled per stratum. For variance estimation, we also need to specify the scaling constants. #### The math {-} @@ -662,7 +656,7 @@ $$se(\hat{\theta})=\sqrt{\sum_{r=1}^R \alpha_r \left( \hat{\theta}_r-\hat{\theta #### The syntax {-} -To specify the jackknife method, we use the survey documentation to understand the type of jackknife (1, n, or 2) and the multiplier. In the syntax we need to specify the weight variable (`weights`), the replicate weight variables (`repweights`), the type of replicate weights as jackknife 1 (`type = "JK1"`), n (`type = "JKN"`), or 2 (`type = "JK2"`), whether the mean squared error should be used (`mse = TRUE`) or not (`mse = FALSE`), and the multiplier (`scale`). For example, if the survey is a jackknife 1 method with a multiplier of $\alpha_r=(R-1)/R=19/20=0.95$, the dataset has WT0 for the main weight and 20 replicate weights indicated as WT1, WT2, ..., WT20, use the following syntax: +To specify the jackknife method, we use the survey documentation to understand the type of jackknife (1, n, or 2) and the multiplier. In the syntax, we need to specify the weight variable (`weights`), the replicate weight variables (`repweights`), the type of replicate weights as jackknife 1 (`type = "JK1"`), n (`type = "JKN"`), or 2 (`type = "JK2"`), whether the mean squared error should be used (`mse = TRUE`) or not (`mse = FALSE`), and the multiplier (`scale`.) For example, if the survey is a jackknife 1 method with a multiplier of $\alpha_r=(R-1)/R=19/20=0.95$, the dataset has WT0 for the main weight and 20 replicate weights indicated as WT1, WT2, ..., WT20, we use the following syntax: ```r jk1_des <- dat %>% @@ -673,7 +667,7 @@ jk1_des <- dat %>% scale=0.95) ``` -For a jackknife n method, we need to specify the multiplier for all replicates. In this case we use the `rscales` argument to specify each one. The documentation will provide details on what the multipliers ($\alpha_r$) are, and they may be the same for all replicates. For example, consider a case where $\alpha_r=0.1$ for all replicates and the dataset had WT0 for the main weight and had 20 replicate weights indicated as WT1, WT2, ..., WT20. We specify the type as `type = "JKN"`, and the multiplier as `rscales=rep(0.1,20)`: +For a jackknife n method, we need to specify the multiplier for all replicates. In this case, we use the `rscales` argument to specify each one. The documentation provides details on what the multipliers ($\alpha_r$) are, and they may be the same for all replicates. For example, consider a case where $\alpha_r=0.1$ for all replicates, and the dataset had WT0 for the main weight and had 20 replicate weights indicated as WT1, WT2, ..., WT20. We specify the type as `type = "JKN"`, and the multiplier as `rscales=rep(0.1,20)`: ```r jkn_des <- dat %>% @@ -686,9 +680,9 @@ jkn_des <- dat %>% #### Example {-} -The 2020 RECS [@recs-2020-micro] uses jackknife weights with the final weight as NWEIGHT and replicate weights as NWEIGHT1 - NWEIGHT60 with a scale of $(R-1)/R=59/60$. On the file, DOEID is a unique identifier for each respondent, TOTALDOL is the total cost of energy, TOTSQFT_EN is the total square footage of the residence, and REGOINC is the Census region. We have already read in the RECS data and created a dataset called `recs_2020` above in the prerequisites. +The 2020 RECS [@recs-2020-micro] uses jackknife weights with the final weight as NWEIGHT and replicate weights as NWEIGHT1 - NWEIGHT60 with a scale of $(R-1)/R=59/60$. On the file, DOEID is a unique identifier for each respondent, TOTALDOL is the total cost of energy, TOTSQFT_EN is the total square footage of the residence, and REGOINC is the Census region. We use the 2020 RECS data from the {srvyrexploR} package that provides data for this book (see the prerequisites box at the beginning of this chapter.) -To specify this design, use the following syntax: +To specify this design, we use the following syntax: ```{r} #| label: samp-des-recs2020-des @@ -720,9 +714,6 @@ recs_des <- recs_2020 %>% scale = 59/60, mse = TRUE ) - - - ``` @@ -730,7 +721,7 @@ When printing the design object or looking at the summary, the replicate weight ### Bootstrap method -In bootstrap resampling, replicates are created by selecting random samples of the PSUs with replacement (SRSWR). If there are $M$ PSUs in the sample, then each replicate will be created by selecting a random sample of $M$ PSUs with replacement. Each replicate is created independently, and the weights for each replicate are adjusted to reflect the population, generally using the same method as how the analysis weight was adjusted. +In bootstrap resampling, replicates are created by selecting random samples of the PSUs with replacement (SRSWR.) If there are $A$ PSUs in the sample, then each replicate is created by selecting a random sample of $A$ PSUs with replacement. Each replicate is created independently, and the weights for each replicate are adjusted to reflect the population, generally using the same method as how the analysis weight was adjusted. #### The math {-} @@ -738,12 +729,12 @@ A weighted estimate for the full sample is calculated as $\hat{\theta}$, and the $$se(\hat{\theta})=\sqrt{\alpha \sum_{r=1}^R \left( \hat{\theta}_r-\hat{\theta}\right)^2}$$ -where $\alpha$ is the scaling constant. Note that the scaling constant ($\alpha$) is provided in the survey documentation as there are many types of bootstrap methods which generate custom scaling constants. +where $\alpha$ is the scaling constant. Note that the scaling constant ($\alpha$) is provided in the survey documentation, as there are many types of bootstrap methods that generate custom scaling constants. #### The syntax {-} -To specify a bootstrap method, we need to specify the weight variable (`weights`), the replicate weight variables (`repweights`), the type of replicate weights as bootstrap (`type = "bootstrap"`), whether the mean squared error should be used (`mse = TRUE`) or not (`mse = FALSE`), and the multiplier (`scale`). For example, if a dataset had WT0 for the main weight, 20 bootstrap weights indicated WT1, WT2, ..., WT20, and a multiplier of $\alpha=.02$, use the following syntax: +To specify a bootstrap method, we need to specify the weight variable (`weights`), the replicate weight variables (`repweights`), the type of replicate weights as bootstrap (`type = "bootstrap"`), whether the mean squared error should be used (`mse = TRUE`) or not (`mse = FALSE`), and the multiplier (`scale`.) For example, if a dataset had WT0 for the main weight, 20 bootstrap weights indicated WT1, WT2, ..., WT20, and a multiplier of $\alpha=.02$, we use the following syntax: ```r bs_des <- dat %>% @@ -757,7 +748,7 @@ bs_des <- dat %>% #### Example {-} -Returning to the api example, we are going to create a dataset with bootstrap weights to use as an example. In this example, we construct a one-cluster design with fifty replicate weights.^[We provide the code here for you to replicate this example, but are not focusing on the creation of the weights as that is outside the scope of this book. We recommend you reference @wolter2007introduction for more information on creating bootstrap weights.] +Returning to the api example, we are going to create a dataset with bootstrap weights to use as an example. In this example, we construct a one-cluster design with fifty replicate weights.^[We provide the code here to replicate this example, but are not focusing on the creation of the weights as that is outside the scope of this book. We recommend referencing @wolter2007introduction for more information on creating bootstrap weights.] ```{r} #| label: samp-des-genbs @@ -787,10 +778,10 @@ apiclus1_slim <- bwmata %>% apiclus1_slim ``` -The output of `apiclus1_slim` includes the same variables we have seen in other api examples (see Table \@ref(tab:apidata)), but now additionally includes bootstrap weights `pw1`, ..., `pw50`. When creating the survey design object, we use the bootstrap weights as the replicate weights. Additionally, with replicate weights we need to include the scale ($\alpha$). For this example we created, +The output of `apiclus1_slim` includes the same variables we have seen in other api examples (see Table \@ref(tab:apidata)), but now additionally includes bootstrap weights `pw1`, ..., `pw50`. When creating the survey design object, we use the bootstrap weights as the replicate weights. Additionally, with replicate weights we need to include the scale ($\alpha$.) For this example, we created: -$$\alpha=\frac{M}{(M-1)(R-1)}=\frac{15}{(15-1)*(50-1)}=0.02186589$$ -where $M$ is the average number of PSUs per strata and $R$ is the number of replicates. There is only 1 stratum and the number of clusters/PSUs is 15 so $M=15$. +$$\alpha=\frac{A}{(A-1)(R-1)}=\frac{15}{(15-1)*(50-1)}=0.02186589$$ +where $A$ is the average number of PSUs per strata and $R$ is the number of replicates. There is only 1 stratum and the number of clusters/PSUs is 15 so $A=15$. Using this information, we specify the design object as: ```{r} #| label: samp-des-bsexamp @@ -813,6 +804,6 @@ As with other replicate design objects, when printing the object or looking at t For this chapter, the exercises entail reading public documentation to determine how to specify the survey design. While reading the documentation, be on the lookout for description of the weights and the survey design variables or replicate weights. -1. The National Health Interview Survey (NHIS) is an annual household survey conducted by the National Center for Health Statistics (NCHS). The NHIS includes a wide variety of health topics for adults including health status and conditions, functioning and disability, health care access and health service utilization, health-related behaviors, health promotion, mental health, barriers to care, and community engagement. Like many national in-person surveys, the sampling design is a stratified clustered design with details included in the Survey Description [@nhis-svy-des]. The Survey Description provides information on setting up syntax in SUDAAN, Stata, SPSS, SAS, and R ({survey} package implementation). You have imported the data and the variable containing the data is: `nhis_adult_data`. How would you specify the design using {srvyr} using either `as_survey_design()` or `as_survey_rep()`? +1. The National Health Interview Survey (NHIS) is an annual household survey conducted by the National Center for Health Statistics (NCHS.) The NHIS includes a wide variety of health topics for adults, including health status and conditions, functioning and disability, health care access and health service utilization, health-related behaviors, health promotion, mental health, barriers to care, and community engagement. Like many national in-person surveys, the sampling design is a stratified clustered design with details included in the Survey Description [@nhis-svy-des]. The Survey Description provides information on setting up syntax in SUDAAN, Stata, SPSS, SAS, and R ({survey} package implementation.) We have imported the data and the variable containing the data as `nhis_adult_data`. How would we specify the design using either `as_survey_design()` or `as_survey_rep()`? -2. The General Social Survey is a survey that has been administered since 1972 on social, behavioral, and attitudinal topics. The 2016-2020 GSS Panel codebook provides examples of setting up syntax in SAS and Stata but not R [@gss-codebook]. You have imported the data and the variable containing the data is: `gss_data`. How would you specify the design in R using either `as_survey_design()` or `as_survey_rep()`? \ No newline at end of file +2. The General Social Survey is a survey that has been administered since 1972 on social, behavioral, and attitudinal topics. The 2016-2020 GSS Panel codebook provides examples of setting up syntax in SAS and Stata but not R [@gss-codebook]. We have imported the data and the variable containing the data as: `gss_data`. How would we specify the design in R using either `as_survey_design()` or `as_survey_rep()`? \ No newline at end of file diff --git a/11-missing-data.Rmd b/11-missing-data.Rmd index 46b2bd95..e2d084d6 100644 --- a/11-missing-data.Rmd +++ b/11-missing-data.Rmd @@ -26,7 +26,9 @@ library(haven) library(gt) ``` -We will be using data from ANES and RECS. Here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter \@ref(c03-survey-data-documentation) for more information). + +We are using data from ANES and RECS described in Chapter \@ref(c04-getting-started). As a reminder, here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter \@ref(c04-getting-started) for more information.) + ```{r} #| label: missing-anes-des #| eval: FALSE @@ -62,30 +64,30 @@ recs_des <- recs_2020 %>% ## Introduction -Missing data in surveys refers to situations where participants do not provide complete responses to survey questions. Respondents may not have seen a question by design. Or, they may not respond to a question for various other reasons, such as not wanting to answer a particular question, not understanding the question, or simply forgetting to answer. Missing data is important to consider and account for, as it can introduce bias and reduce the representativeness of the data. This chapter provides an overview of the types of missing data, how to assess missing data in surveys, and how to conduct analysis when missing data is present. Understanding this complex topic can help ensure accurate reporting of survey results and can provide insight into potential changes to the survey design for the future. +Missing data in surveys refer to situations where participants do not provide complete responses to survey questions. Respondents may not have seen a question by design. Or, they may not respond to a question for various other reasons, such as not wanting to answer a particular question, not understanding the question, or simply forgetting to answer. Missing data are important to consider and account for, as it can introduce bias and reduce the representativeness of the data. This chapter provides an overview of the types of missing data, how to assess missing data in surveys, and how to conduct analysis when missing data are present. Understanding this complex topic can help ensure accurate reporting of survey results and provide insight into potential changes to the survey design for the future. ## Missing data mechanisms -There are two main categories that missing data typically fall into: missing by design or unintentional missing data. Missing by design is part of the survey plan and can be more easily incorporated into weights and analyses. Unintentional missing data on the other hand, can lead to bias in survey estimates if not correctly accounted for. Below we provide more information on the types of missing data. +There are two main categories that missing data typically fall into: missing by design and unintentional missing data. Missing by design is part of the survey plan and can be more easily incorporated into weights and analyses. Unintentional missing data on the other hand, can lead to bias in survey estimates if not correctly accounted for. Below we provide more information on the types of missing data. 1. **Missing by design/questionnaire skip logic**: This type of missingness occurs when certain respondents are intentionally directed to skip specific questions based on their previous responses or characteristics. For example, in a survey about employment, if a respondent indicates that they are not employed, they may be directed to skip questions related to their job responsibilities. Additionally, some surveys randomize questions or modules so that not all participants respond to all questions. In these instances, respondents would have missing data for the modules not randomly assigned to them. 2. **Unintentional missing data**: This type of missingness occurs when researchers do not intend for there to be missing data on a particular question, for example, if respondents did not finish the survey or refused to answer individual questions. There are three main types of unintentional missing data that each should be considered and handled differently [@mack; @Schafer2002]: - a. **Missing completely at random (MCAR)**: The missing data is unrelated to both observed and unobserved data, and the probability of being missing is the same across all cases. For example, if a respondent missed a question because they had to leave the survey early due to an emergency. + a. **Missing completely at random (MCAR)**: The missing data are unrelated to both observed and unobserved data, and the probability of being missing is the same across all cases. For example, if a respondent missed a question because they had to leave the survey early due to an emergency. - b. **Missing at random (MAR)**: The missing data is related to observed data but not unobserved data, and the probability of being missing is the same within groups. For example, if older respondents choose not to answer specific questions but younger respondents do answer them and we know the respondent's age. + b. **Missing at random (MAR)**: The missing data are related to observed data but not unobserved data, and the probability of being missing is the same within groups. For example, we know the respondents' ages if and older respondents choose not to answer specific questions but younger respondents do answer them. - c. **Missing not at random (MNAR)**: The missing data is related to unobserved data, and the probability of being missing varies for reasons we are not measuring. For example, if respondents with depression do not answer a question about depression severity. + c. **Missing not at random (MNAR)**: The missing data are related to unobserved data, and the probability of being missing varies for reasons we are not measuring. For example, if respondents with depression do not answer a question about depression severity. ## Assessing missing data -Before beginning analysis, we should explore the data to determine if there is missing data and what types of missing data are present. Conducting this descriptive analysis can help with analysis and reporting of survey data (see Section \@ref(c12-recommendations)), and can inform the survey design in future studies. For example, large amounts of unexpected missing data may indicate the questions were unclear or difficult to recall. There are several ways to explore missing data which we walk through below. When assessing the missing data, we recommend using a data.frame object and not the survey object as most of the analysis is about patterns of records and weights are not necessary. +Before beginning an analysis, we should explore the data to determine if there is missing data and what types of missing data are present. Conducting this descriptive analysis can help with the analysis and reporting of survey data (see Section \@ref(c12-recommendations)) and can inform the survey design in future studies. For example, large amounts of unexpected missing data may indicate the questions were unclear or difficult to recall. There are several ways to explore missing data, which we walk through below. When assessing the missing data, we recommend using a data.frame object and not the survey object, as most of the analysis is about patterns of records, and weights are not necessary. ### Summarize data -A very rudimentary first exploration is to use the `summary()` function to summarize the data which will illuminate `NA` values in the data. Let's look at a few analytic variables on the ANES 2020 data using `summary()`: +A very rudimentary first exploration is to use the `summary()` function to summarize the data, which illuminates `NA` values in the data. Let's look at a few analytic variables on the ANES 2020 data using `summary()`: ```{r} #| label: missing-anes-summary @@ -95,7 +97,7 @@ anes_2020 %>% summary() ``` -We see that there are NA values in several of the derived variables (those not beginning with "V") and negative values in the original variables (those beginning with "V"). We can also use the `count()` function to get an understanding of the different types of missing data on the original variables. For example, let's look at the count of data for `V202072`, which corresponds to our `VotedPres2020` variable. +We see that there are NA values in several of the derived variables (those not beginning with "V") and negative values in the original variables (those beginning with "V".) We can also use the `count()` function to get an understanding of the different types of missing data on the original variables. For example, let's look at the count of data for `V202072`, which corresponds to our `VotedPres2020` variable. ```{r} #| label: missing-anes-count @@ -104,11 +106,11 @@ anes_2020 %>% count(VotedPres2020,V202072) ``` -Here we can see that there are three types of missing data, and that the majority of them fall under the "Inapplicable" category. This is usually a term associated with data missing due to skip patterns and is considered to be missing data by design. Based on the documentation from ANES [@debell], we can see that this question was only asked to respondents who voted in the election. +Here, we can see that there are three types of missing data, and the majority of them fall under the "Inapplicable" category. This is usually a term associated with data missing due to skip patterns and is considered to be missing data by design. Based on the documentation from ANES [@debell], we can see that this question was only asked to respondents who voted in the election. ### Visualization of missing data -It can be challenging to look at tables for every variable, and instead may be more efficient to view missing data in a graphical format to help narrow in on patterns or unique variables. The {naniar} package is very useful in exploring missing data visually. It provides quick graphics to explore the missingness patterns in the data. We can use the `vis_miss()` function available in both {visdat} and {naniar} packages to view the amount of missing data by variable [@visdat2017; @naniar2023]. +It can be challenging to look at tables for every variable and instead may be more efficient to view missing data in a graphical format to help narrow in on patterns or unique variables. The {naniar} package is very useful in exploring missing data visually. We can use the `vis_miss()` function available in both {visdat} and {naniar} packages to view the amount of missing data by variable (see Figure \@ref(fig:missing-anes-vismiss)) [@visdattierney; @naniar2023]. ```{r} #| label: missing-anes-vismiss @@ -128,9 +130,9 @@ anes_2020_derived %>% ``` -From this visualization, we can start to get a picture of what questions may be related to each other in terms of missing data. Even if we did not have the informative variable names, we could be able to deduce that `VotedPres2020`, `VotedPres2020_selection`, and `EarlyVote2020` are likely related since their missing data patterns are similar. +From the visualization in Figure \@ref(fig:missing-anes-vismiss), we can start to get a picture of what questions may be connected to each other in terms of missing data. Even if we did not have the informative variable names, we could deduce that `VotedPres2020`, `VotedPres2020_selection`, and `EarlyVote2020` are likely connected since their missing data patterns are similar. -Additionally, we can also look at `VotedPres2016_selection` and see that there is a lot of missing data in that variable. Most likely this is due to a skip pattern, and we can look at further graphics to see how it might be related to other variables. The {naniar} package has multiple visualization functions that can help dive deeper such as the `gg_miss_fct()` function which looks at missing data for all variables by levels of another variable. +Additionally, we can also look at `VotedPres2016_selection` and see that there is a lot of missing data in that variable. The missing data are likely due to a skip pattern, and we can look at other graphics to see how they relate to other variables. The {naniar} package has multiple visualization functions that can help dive deeper, such as the `gg_miss_fct()` function, which looks at missing data for all variables by levels of another variable (see Figure \@ref(fig:missing-anes-ggmissfct).) ```{r} #| label: missing-anes-ggmissfct @@ -149,9 +151,9 @@ anes_2020_derived %>% xlab("Voted for President in 2016") ``` -In this case, we can see that if they did not vote for president in 2016 or did not answer that question, then they were not asked about who they voted for in 2016 (the percentage of missing data if 100%). Additionally, we can see with this graphic, that there is more missing data across all questions if they did not provide an answer to `VotedPres2016`. +In Figure \@ref(fig:missing-anes-ggmissfct), we can see that if respondents did not vote for president in 2016 or did not answer that question, then they were not asked about who they voted for in 2016 (the percentage of missing data is 100%.) Additionally, we can see with Figure \@ref(fig:missing-anes-ggmissfct), that there is more missing data across all questions if they did not provide an answer to `VotedPres2016`. -There are other graphics that work well with numeric data. For example, in the RECS 2020 data we can plot two continuous variables and the missing data associated with it to see if there are any patterns to the missingness. To do this, we can use the `bind_shadow()` function from the {naniar} package. This creates a **nabular** (combination of "na" with "tabular"), which features the original columns followed by the same number of columns with a specific `NA` format. These `NA` columns are indicators of if the value in the original data is missing or not. The example printed below shows how most levels of `HeatingBehavior` are not missing `!NA` in the NA variable of `HeatingBehavior_NA`, but those missing in `HeatingBehavior` are also missing in `HeatingBehavior_NA`. +There are other visualizations that work well with numeric data. For example, in the RECS 2020 data, we can plot two continuous variables and the missing data associated with them to see if there are any patterns in the missingness. To do this, we can use the `bind_shadow()` function from the {naniar} package. This creates a **nabular** (combination of "na" with "tabular"), which features the original columns followed by the same number of columns with a specific `NA` format. These `NA` columns are indicators of whether the value in the original data is missing or not. The example printed below shows how most levels of `HeatingBehavior` are not missing (`!NA`) in the NA variable of `HeatingBehavior_NA`, but those missing in `HeatingBehavior` are also missing in `HeatingBehavior_NA`. ```{r} #| label: missing-recs-shadow @@ -166,40 +168,42 @@ recs_2020_shadow %>% count(HeatingBehavior,HeatingBehavior_NA) ``` -We can then use these new variables to plot the missing data along side the actual data. For example, let's plot a histogram of the total electric bill grouped by those that are missing and not missing by heating behavior. +We can then use these new variables to plot the missing data alongside the actual data. For example, let's plot a histogram of the total electric bill grouped by those missing and not missing by heating behavior (see Figure \@ref(fig:missing-recs-hist).) ```{r} #| label: missing-recs-hist #| fig.cap: "Histogram of Energy Cost by Heating Behavior Missing Data" #| fig.alt: "This chart has title 'Histogram of Energy Cost by Heating Behavior Missing Data'. It has x-axis 'Total Energy Cost (Truncated at $5000)' with labels 0, 1000, 2000, 3000, 4000 and 5000. It has y-axis 'Number of Households' with labels 0, 500, 1000 and 1500. There is a legend indicating fill is used to show HeatingBehavior_NA, with 2 levels: !NA shown as very pale blue fill and NA shown as dark blue fill. The chart is a bar chart with 30 vertical bars. These are stacked, as sorted by HeatingBehavior_NA." - -recs_2020_shadow %>% +recs_2020_shadow %>% filter(TOTALDOL < 5000) %>% - ggplot(aes(x=TOTALDOL,fill=HeatingBehavior_NA)) + + ggplot(aes(x = TOTALDOL, fill = HeatingBehavior_NA)) + geom_histogram() + - scale_fill_manual(values = book_colors[c(3, 1)], - labels = c("Present", "Missing"), - name = "Heating Behavior") + + scale_fill_manual( + values = book_colors[c(3, 1)], + labels = c("Present", "Missing"), + name = "Heating Behavior" + ) + theme_minimal() + xlab("Total Energy Cost (Truncated at $5000)") + ylab("Number of Households") + - labs(title = "Histogram of Energy Cost by Heating Behavior Missing Data") + ggtitle("Histogram of Energy Cost by Heating Behavior Missing Data") ``` -This plot indicates that respondents who did not provide a response for the heating behavior question may have a different distribution of total energy cost compared to respondents who did provide a response. This view of the raw data and missingness could indicate some bias in the data. Researchers take these different bias aspects into account when calculating weights and we need to make sure that the weights are incorporated when analyzing the data. +Figure \@ref(fig:missing-recs-hist) indicates that respondents who did not provide a response for the heating behavior question may have a different distribution of total energy cost compared to respondents who did provide a response. This view of the raw data and missingness could indicate some bias in the data. Researchers take these different bias aspects into account when calculating weights, and we need to make sure that we incorporate the weights when analyzing the data. There are many other visualizations that can be helpful in reviewing the data, and we recommend reviewing the {naniar} documentation for more information [@naniar2023]. ## Analysis with missing data -Once we understand the types of missingness, we can begin the analysis of the data. Different missingness types may be handled in different ways. In most publicly available datasets, researchers will have already calculated weights and imputed missing values if deemed necessary. Those interested in learning more about how to calculate weights and impute data for different missing data mechanisms, we recommended @Kim2021 and @Valliant2018weights. +Once we understand the types of missingness, we can begin the analysis of the data. Different missingness types may be handled in different ways. In most publicly available datasets, researchers have already calculated weights and imputed missing values if necessary. For those interested in learning more about how to calculate weights and impute data for different missing data mechanisms, we recommended @Kim2021 and @Valliant2018weights. -Even with weights and imputation, missing data will still most likely exist in the data and need to be accounted for in analysis. This section provides an overview on how to recode missing data in R, and how to account for skip patterns in analysis. +Even with weights and imputation, missing data are most likely still present and need to be accounted for in analysis. This section provides an overview on how to recode missing data in R, and how to account for skip patterns in analysis. ### Recoding missing data -Even within a variable, there can be different reasons for missing data. In publicly released data negative values are often present to provide different meaning for values. For example, in the ANES 2020 data they have the following negative values to represent different types of missing data: +Even within a variable, there can be different reasons for missing data. In publicly released data, negative values are often present to provide different meanings for values. For example, in the ANES 2020 data, they have the following negative values to represent different types of missing data: + * -9: Refused * -8: Don't Know * -7: No post-election data, deleted due to incomplete interview @@ -210,7 +214,7 @@ Even within a variable, there can be different reasons for missing data. In publ * -2: Other missing reason (question specific) * -1: Inapplicable -When we created the derived variables for use in this book, we coded all negative values as `NA` and proceeded to analyze the data. For most cases this is an appropriate approach as long as you filter the data appropriately to account for skip patterns (see next section). However, the {naniar} package does have the option to code special missing values. For example, if we wanted to have two `NA` values, one that indicated the question was missing by design (e.g., due to skip patterns) and one for the other missing categories we can use the `nabular` format to incorporate these with the `recode_shadow()` function. +When we created the derived variables for use in this book, we coded all negative values as `NA` and proceeded to analyze the data. For most cases, this is an appropriate approach as long as we filter the data appropriately to account for skip patterns (see Section \@ref(missing-skip-patt)). However, the {naniar} package does have the option to code special missing values. For example, if we wanted to have two `NA` values, one that indicated the question was missing by design (e.g., due to skip patterns) and one for the other missing categories, we can use the `nabular` format to incorporate these with the `recode_shadow()` function. ```{r} @@ -227,14 +231,13 @@ anes_2020_shadow %>% count(V201103,V201103_NA) ``` -However it is important to note that at the time of publication, there is no easy way to implement `recode_shadow()` to multiple variables at once (e.g., we cannot use the tidyverse feature of `across()`). The example code above only implements this for a single variable, so this would have to be done to all variables of interest manually or in a loop. - +However, it is important to note that at the time of publication, there is no easy way to implement `recode_shadow()` to multiple variables at once (e.g., we cannot use the tidyverse feature of `across()`.) The example code above only implements this for a single variable, so this would have to be done manually or in a loop for all variables of interest. -### Accounting for skip patterns +### Accounting for skip patterns {#missing-skip-patt} -When questions are skipped by design in a survey, it is meaningful that the data is later missing. For example the RECS survey asks people how they control the heat in their home in the winter (`HeatingBehavior`). This is only among those who have heat in their home (`SpaceHeatingUsed`). If no there is no heating equipment used, the value of `HeatingBehavior` is missing. One has several choices when analyzing this data which include 1) only including those with a valid value of `HeatingBehavior` and specifying the universe as those with heat or 2) including those who do not have heat. It is important to specify what population an analysis generalizes to. +When questions are skipped by design in a survey, it is meaningful that the data are later missing. For example, the RECS survey asks people how they control the heat in their homes in the winter (`HeatingBehavior`.) This is only among those who have heat in their home (`SpaceHeatingUsed`.) If no heating equipment was used, the value of `HeatingBehavior` is missing. One has several choices when analyzing these data, which include 1) only including those with a valid value of `HeatingBehavior` and specifying the universe as those with heat, and 2) including those who do not have heat. It is important to specify what population an analysis generalizes to. -Here is example code where we only include those with a valid value of `HeatingBehavior` (choice 1). Note that we use the design object (`recs_des`) then filter to those that are not missing on `HeatingBehavior`. +Here is an example where we only include those with a valid value of `HeatingBehavior` (choice 1.) Note that we use the design object (`recs_des`) and then filter to those that are not missing on `HeatingBehavior`. ```{r} #| label: missing-recs-heatcc @@ -249,7 +252,7 @@ heat_cntl_1 <- recs_des %>% heat_cntl_1 ``` -Here is example code where we include those that do not have heat (choice 2). To help understand what we are looking at we have included the output to show both variables `SpaceHeatingUsed` and `HeatingBehavior`. +Here is an example where we include those that do not have heat (choice 2.) To help understand what we are looking at, we have included the output to show both variables, `SpaceHeatingUsed` and `HeatingBehavior`. ```{r} #| label: missing-recs-heatpop @@ -279,9 +282,9 @@ pct_2 <- heat_cntl_2 %>% ``` -If we ran the first analysis, we would say that `r pct_1`% **of households with heat** use a programmable or smart thermostat for the heating of their home. While if we used the results from the second analysis, we could say that `r pct_2`% of households use a programmable or smart thermostat for the heating of their home. The distinction of the two statements is bolded for emphasis. Skip patterns often change the universe that we are talking about and need to be carefully examined. +If we ran the first analysis, we would say that `r pct_1`% **of households with heat** use a programmable or smart thermostat for heating of their home. If we used the results from the second analysis, we would say that `r pct_2`% **of households** use a programmable or smart thermostat for heating of their home. The distinction between the two statements is made bold for emphasis. Skip patterns often change the universe we are talking about and need to be carefully examined. -Filtering to the correct universe is important when handling these types of missing data. The `nabular` we created above can also help with this. If we have `NA_skip` values in the shadow, we can make sure that we filter out all of these values and only include relevant missing. To do this with survey data we could first create the `nabular`, then create the design object on that data, and then use the shadow variables to assist with filtering the data. Let's use the `nabular` we created above for ANES 2020 (`anes_2020_shadow`) to create the design object. +Filtering to the correct universe is important when handling these types of missing data. The `nabular` we created above can also help with this. If we have `NA_skip` values in the shadow, we can make sure that we filter out all of these values and only include relevant missing values. To do this with survey data, we could first create the `nabular`, then create the design object on that data, and then use the shadow variables to assist with filtering the data. Let's use the `nabular` we created above for ANES 2020 (`anes_2020_shadow`) to create the design object. ```{r} #| label: missing-anes-shadow-des @@ -299,7 +302,7 @@ anes_des_shadow <- anes_adjwgt_shadow %>% ) ``` -Then we can use this design object to look at the percent of the population that voted for each candidate in 2016 (`V201103`). First, let's look at the percentages without removing any cases: +Then, we can use this design object to look at the percentage of the population that voted for each candidate in 2016 (`V201103`.) First, let's look at the percentages without removing any cases: ```{r} #| label: missing-anes-shadow-ex1 @@ -313,7 +316,7 @@ pres16_select1<-anes_des_shadow %>% pres16_select1 ``` -Next, we will look at the percentages removing only those that were missing due to skip patterns (i.e., they did not receive this question). +Next, we look at the percentages, removing only those missing due to skip patterns (i.e., they did not receive this question.) ```{r} #| label: missing-anes-shadow-ex2 @@ -328,7 +331,7 @@ pres16_select2<-anes_des_shadow %>% pres16_select2 ``` -Finally, we will look at the percentages removing all missing values both due to skip patterns and due to those who refused to answer the question. +Finally, we look at the percentages, removing all missing values both due to skip patterns and due to those who refused to answer the question. ```{r} #| label: missing-anes-shadow-ex3 @@ -414,4 +417,4 @@ pres16_select2_out<-round(pres16_select2_1*100-pres16_select2_2*100,1) ``` -As Table \@ref(tab:missing-anes-shadow-tab) shows, the results can vary greatly depending on which type of missing data that are removed. If we remove only the skip patterns the margin between the Clinton and Trump is `r pres16_select2_out` percentage points, but if we include all data even including those that did not vote in 2016, the margin is `r pres16_select1_out` percentage points. How we handle the different types of missing values is important for interpretation of the data. +As Table \@ref(tab:missing-anes-shadow-tab) shows, the results can vary greatly depending on which type of missing data are removed. If we remove only the skip patterns the margin between Clinton and Trump is `r pres16_select2_out` percentage points, but if we include all data, even including those that did not vote in 2016, the margin is `r pres16_select1_out` percentage points. How we handle the different types of missing values is important for interpreting the data. diff --git a/12-successful-survey-data-analysis.Rmd b/12-successful-survey-data-analysis.Rmd index fec81494..79fb2785 100644 --- a/12-successful-survey-data-analysis.Rmd +++ b/12-successful-survey-data-analysis.Rmd @@ -23,13 +23,13 @@ library(srvyr) library(srvyrexploR) ``` -To illustrate the importance of data visualization, we will discuss Anscombe's Quartet. The dataset can be replicated by running the code below: +To illustrate the importance of data visualization, we discuss Anscombe's Quartet. The dataset can be replicated by running the code below: ```{r} #| label: recommendations-anscombe-setup anscombe_tidy <- anscombe %>% - mutate(observation = row_number()) %>% - pivot_longer(-observation, names_to = "key", values_to = "value") %>% + mutate(obs = row_number()) %>% + pivot_longer(-obs, names_to = "key", values_to = "value") %>% separate(key, c("variable", "set"), 1, convert = TRUE) %>% mutate(set = c("I", "II", "III", "IV")[set]) %>% pivot_wider(names_from = variable, values_from = value) @@ -61,11 +61,11 @@ example_des <- ## Introduction -The previous chapters in this book aimed to provide the technical skills and knowledge required for running survey analyses. This chapter builds upon the previously mentioned best practices to present a curated set of recommendations for running a *successful* survey analysis. We hope this list equips you with practical insights that assist in producing meaningful and reliable results. +The previous chapters in this book aimed to provide the technical skills and knowledge required for running survey analyses. This chapter builds upon the previously mentioned best practices to present a curated set of recommendations for running a *successful* survey analysis. We hope this list provides practical insights that assist in producing meaningful and reliable results. -## Follow survey analysis process {#recs-survey-process} +## Follow the survey analysis process {#recs-survey-process} -As we first introduced in Chapter \@ref(c04-getting-started) (Section \@ref(survey-analysis-process)), there are four main steps to successfully analyze survey data: +As we first introduced in Chapter \@ref(c04-getting-started), there are four main steps to successfully analyze survey data: 1. Create a `tbl_svy` object (a survey object) using: `as_survey_design()` or `as_survey_rep()` @@ -77,17 +77,17 @@ As we first introduced in Chapter \@ref(c04-getting-started) (Section \@ref(surv The order of these steps matters in survey analysis. For example, if we need to subset the data, we must use `filter()` on our data **after** creating the survey design. If we do this before the survey design is created, we may not be correctly accounting for the study design, resulting in incorrect findings. -Additionally, correctly identifying the survey design is one of the most important steps in survey analysis. Knowing the type of sample design (e.g., clustered, stratified) will help ensure the underlying error structure is correctly calculated and weights are correctly used. Reviewing the documentation (see Chapter \@ref(c03-survey-data-documentation)) will help us understand what variables to use from the data. Learning about complex design factors such as clustering, stratification, and weighting is foundational to complex survey analysis, and we recommend that all analysts review Chapter \@ref(c10-sample-designs-replicate-weights) before creating their first design object. +Additionally, correctly identifying the survey design is one of the most important steps in survey analysis. Knowing the type of sample design (e.g., clustered, stratified) helps ensure the underlying error structure is correctly calculated and weights are correctly used. Reviewing the documentation (see Chapter \@ref(c03-survey-data-documentation)) helps us understand what variables to use from the data. Learning about complex design factors such as clustering, stratification, and weighting is foundational to complex survey analysis, and we recommend that all analysts review Chapter \@ref(c10-sample-designs-replicate-weights) before creating their first design object. -Making sure to use the survey analysis functions from the {srvyr} and {survey} packages is also important in survey analysis. For example, using `mean()` and `survey_mean()` on the same data will result in different findings and outputs. Each of the survey functions from {srvyr} and {survey} impacts standard errors and variance, and we cannot treat complex surveys as unweighted simple random samples if we want to produce unbiased estimates [@R-srvyr; @lumley2010complex]. +Making sure to use the survey analysis functions from the {srvyr} and {survey} packages is also important in survey analysis. For example, using `mean()` and `survey_mean()` on the same data results in different findings and outputs. Each of the survey functions from {srvyr} and {survey} impacts standard errors and variance, and we cannot treat complex surveys as unweighted simple random samples if we want to produce unbiased estimates [@R-srvyr; @lumley2010complex]. ## Begin with descriptive analysis -When receiving a fresh batch of data, it's tempting to jump right into running models to find significant results. However, a successful data analyst begins by exploring the dataset. This involves running descriptive analysis on the dataset as a whole, as well as individual variables and combinations of variables. As described in Chapter \@ref(c05-descriptive-analysis), descriptive analyses should always precede statistical analysis to prevent avoidable (and potentially embarrassing) mistakes. +When receiving a fresh batch of data, it is tempting to jump right into running models to find significant results. However, a successful data analyst begins by exploring the dataset. Chapter \@ref(c11-missing-data) talks about the importance of reviewing data when examining missing data patterns. In this chapter, we illustrate the value of reviewing all types of data. This involves running descriptive analysis on the dataset as a whole, as well as individual variables and combinations of variables. As described in Chapter \@ref(c05-descriptive-analysis), descriptive analyses should always precede statistical analysis to prevent avoidable (and potentially embarrassing) mistakes. ### Table review -Even before applying weights, consider running cross-tabulations on the raw data. Crosstabs can help us see if any patterns stand out that may be alarming or something worth further investigating. +Even before applying weights, consider running cross-tabulations on the raw data. Cross-tabs can help us see if any patterns stand out that may be alarming or something worth further investigating. For example, let’s explore the example survey dataset introduced in the Prerequisites box, `example_srvy`. We run the code below on the unweighted data to inspect the `gender` variable: @@ -98,13 +98,13 @@ example_srvy %>% summarise(n = n()) ``` -The data shows that males comprise 1 out of 10, or 10%, of the sample. Generally, we assume something close to a 50/50 split between male and female respondents in a population. The sizeable female proportion could indicate either a unique sample or a potential error in the data. If we review the survey documentation and see this was a deliberate part of the design, we can continue our analysis using the appropriate methods. If this was not an intentional choice by the researchers, the results alert us that something may be incorrect in the data or our code, and we can verify if there’s an issue by comparing the results with the weighted means. +The data show that males comprise 1 out of 10, or 10%, of the sample. Generally, we assume something close to a 50/50 split between male and female respondents in a population. The sizable female proportion could indicate either a unique sample or a potential error in the data. If we review the survey documentation and see this was a deliberate part of the design, we can continue our analysis using the appropriate methods. If this was not an intentional choice by the researchers, the results alert us that something may be incorrect in the data or our code, and we can verify if there’s an issue by comparing the results with the weighted means. ### Graphical review Tables provide a quick check of our assumptions, but there is no substitute for graphs and plots to visualize the distribution of data. We might miss outliers or nuances if we scan only summary statistics. -For example, Anscombe's Quartet demonstrates the importance of visualization in analysis. Let's say we have a dataset with x- and y- variables in an object called `anscombe_tidy`. Let's take a look at how the da taset is structured: +For example, Anscombe's Quartet demonstrates the importance of visualization in analysis. Let's say we have a dataset with x- and y- variables in an object called `anscombe_tidy`. Let's take a look at how the dataset is structured: ```{r} #| label: recommendations-anscombe-head @@ -126,7 +126,7 @@ anscombe_tidy %>% ) ``` -These are useful statistics. We can note that the data doesn’t have high variability, and the two variables are strongly correlated. Now, let’s check all the sets (I-IV) in the Anscombe data. Notice anything interesting? +These are useful statistics. We can note that the data do not have high variability, and the two variables are strongly correlated. Now, let’s check all the sets (I-IV) in the Anscombe data. Notice anything interesting? ```{r} #| label: recommendations-anscombe-calc-2 @@ -141,10 +141,16 @@ anscombe_tidy %>% ) ``` -The summary results for these four sets are nearly identical! Based on this, we might assume that each distribution is similar. Let's look at a data visualization to see if our assumption is correct. +The summary results for these four sets are nearly identical! Based on this, we might assume that each distribution is similar. Let's look at a graphical visualization to see if our assumption is correct (see Figure \@ref(fig:recommendations-anscombe-plot).) ```{r} #| label: recommendations-anscombe-plot +#| warning: false +#| error: false +#| message: false +#| fig.cap: "Plot of Anscombe's Quartet data and the importance of reviewing data graphically" +#| fig.alt: "This figure shows four plots one for each of Anscombe's sets. The upper left plot is a plot of set I and has a trend line with a slope of 0.5 and an intercept of 3. The data points are distributed evenly around the trend line. The upper right plot is a plot of set II and has the same trend line as set I. The data points are curved around the trend line. The lower left plot is a plot of set III and has the same trend line as set I. The data points closely followly the trend line with one outlier where the y-value for the point is much larger than the others. The lower right plot is a plot of set IV and has the same trend line as set I. The data points all share the same x-value but different y-values with the exception of one data point, which has a much larger value for both y and x values." + ggplot(anscombe_tidy, aes(x, y)) + geom_point() + facet_wrap( ~ set) + @@ -152,9 +158,7 @@ ggplot(anscombe_tidy, aes(x, y)) + theme_minimal() ``` -Although each of the four sets has the same summary statistics and regression line, when reviewing the plots, it becomes apparent that the distributions of the data are not the same at all. Each set of points results in different shapes and distributions. Imagine sharing each set (I-IV) and the corresponding plot with a different colleague. The interpretations and descriptions of the data would be very different even though the statistics are similar. Plotting data can also ensure that we are using the correct analysis method on the data, so understanding the underlying distributions is an important first step. - -With survey data, we may not always have continuous data that we can plot like Anscombe's Quartet. However, if the dataset does contain continuous data or other types of data that would benefit from a visual representation, we recommend taking the time to graph distributions and correlations. +Although each of the four sets has the same summary statistics and regression line, when reviewing the plots (see Figure \@ref(fig:recommendations-anscombe-plot)), it becomes apparent that the distributions of the data are not the same at all. Each set of points results in different shapes and distributions. Imagine sharing each set (I-IV) and the corresponding plot with a different colleague. The interpretations and descriptions of the data would be very different even though the statistics are similar. Plotting data can also ensure that we are using the correct analysis method on the data, so understanding the underlying distributions is an important first step. ## Check variable types @@ -166,7 +170,7 @@ example_srvy %>% glimpse() ``` -The output shows that `q_d2_1` is a character variable, but the values of that variable show three options (Very interested / Somewhat interested / Not at all interested). In this case, we will most likely want to change `q_d2_1` to be a factor variable and order the factor levels to indicate that this is an ordinal variable. Here is some code on how we might approach this task using the {forcats} package [@R-forcats]: +The output shows that `q_d2_1` is a character variable, but the values of that variable show three options (Very interested / Somewhat interested / Not at all interested.) In this case, we most likely want to change `q_d2_1` to be a factor variable and order the factor levels to indicate that this is an ordinal variable. Here is some code on how we might approach this task using the {forcats} package [@R-forcats]: ```{r} #| label: recommendations-example-dat-fct @@ -185,7 +189,7 @@ example_srvy_fct %>% count(q_d2_1_fct, q_d2_1) ``` -This example data also includes a column called `region`, which is imported as a number (``). This is a good hint to use the questionnaire and codebook along with the data to find out if the values actually reflect a number or are perhaps a coded categorical variable (see Chapter \@ref(c03-survey-data-documentation) for more details). R will calculate the mean even if it is not appropriate, leading to the common mistake of applying an average to categorical values instead of a proportion function. For example, for ease of coding, we may use the `across()` function to calculate the mean across all numeric variables: +This example dataset also includes a column called `region`, which is imported as a number (``.) This is a good reminder to use the questionnaire and codebook along with the data to find out if the values actually reflect a number or are perhaps a coded categorical variable (see Chapter \@ref(c03-survey-data-documentation) for more details.) R calculates the mean even if it is not appropriate, leading to the common mistake of applying an average to categorical values instead of a proportion function. For example, for ease of coding, we may use the `across()` function to calculate the mean across all numeric variables: ```{r} #| label: recommendations-example-dat-num-calc @@ -194,7 +198,7 @@ example_des %>% summarize(across(where(is.numeric), ~ survey_mean(.x, na.rm = TRUE))) ``` -In this example, if we do not adjust `region` to be a factor variable type, we might accidentally report an average region of `r round(example_des %>% summarize(across(where(is.numeric), ~ survey_mean(.x, na.rm = TRUE))) %>% pull(region), 2)` in our findings which is meaningless. Checking that our variables are appropriate will avoid this pitfall and ensure the measures and models are suitable for the variable type. +In this example, if we do not adjust `region` to be a factor variable type, we might accidentally report an average region of `r round(example_des %>% summarize(across(where(is.numeric), ~ survey_mean(.x, na.rm = TRUE))) %>% pull(region), 2)` in our findings, which is meaningless. Checking that our variables are appropriate avoids this pitfall and ensures the measures and models are suitable for the variable type. ## Improve debugging skills @@ -225,7 +229,7 @@ example_des %>% svyttest(q_d1~gender) ``` -In this case, we need to remember that with functions from the {survey} packages like `svyttest()`, the design object is not the first argument, and we have to use the dot (`.`) notation (see Chapter \@ref(c06-statistical-testing)). Adding in the named argument of `design=.` will fix this error. +In this case, we need to remember that with functions from the {survey} packages like `svyttest()`, the design object is not the first argument, and we have to use the dot (`.`) notation (see Chapter \@ref(c06-statistical-testing).) Adding in the named argument of `design=.` fixes this error. ```{r} #| label: recommendations-desobj-locfix @@ -247,8 +251,8 @@ The internet also offers many resources for debugging. Searching for a specific ## Think critically about conclusions -Once we have our findings, we need to learn to think critically about our findings. As mentioned in Chapter \@ref(c02-overview-surveys), many aspects of the study design can impact our interpretation of the results, for example, the number and types of response options provided to the respondent or who was asked the question (both thinking about the full sample and any skip patterns). Knowing the overall study design can help us accurately think through what the findings may mean and identify any issues with our analyses. Additionally, we should make sure that our survey design object is correctly defined (see Chapter \@ref(c10-sample-designs-replicate-weights)), carefully consider how we are managing missing data (see Chapter \@ref(c11-missing-data)), and follow statistical analysis procedures such as avoiding model overfitting by using too many variables in our formulas. +Once we have our findings, we need to learn to think critically about our findings. As mentioned in Chapter \@ref(c02-overview-surveys), many aspects of the study design can impact our interpretation of the results, for example, the number and types of response options provided to the respondent or who was asked the question (both thinking about the full sample and any skip patterns.) Knowing the overall study design can help us accurately think through what the findings may mean and identify any issues with our analyses. Additionally, we should make sure that our survey design object is correctly defined (see Chapter \@ref(c10-sample-designs-replicate-weights)), carefully consider how we are managing missing data (see Chapter \@ref(c11-missing-data)), and follow statistical analysis procedures such as avoiding model overfitting by using too many variables in our formulas. These considerations allow us to conduct our analyses and review findings for statistically significant results. It's important to note that even significant results do not mean that they are meaningful or important. A large enough sample can produce statistically significant results. Therefore, we want to look at our results in context, such as comparing them with results from other studies or analyzing them in conjunction with confidence intervals and other measures. -Communicating the results (see Chapter \@ref(c08-communicating-results)) in an unbiased manner is also a critical step in any analysis project. If we present results without error measures or only present results that support our initial hypotheses, we are not thinking critically and may incorrectly represent the data. As survey data analysts, we often interpret the survey data for the public. We must ensure that we are the best stewards of the data and work to bring light to meaningful and interesting findings that the public will want and need to know about. \ No newline at end of file +Communicating the results (see Chapter \@ref(c08-communicating-results)) in an unbiased manner is also a critical step in any analysis project. If we present results without error measures or only present results that support our initial hypotheses, we are not thinking critically and may incorrectly represent the data. As survey data analysts, we often interpret the survey data for the public. We must ensure that we are the best stewards of the data and work to bring light to meaningful and interesting findings that the public wants and needs to know about. \ No newline at end of file diff --git a/13-ncvs-vignette.Rmd b/13-ncvs-vignette.Rmd index addbc778..3028816b 100644 --- a/13-ncvs-vignette.Rmd +++ b/13-ncvs-vignette.Rmd @@ -26,20 +26,20 @@ library(srvyrexploR) library(gt) ``` -We will use data from the United States National Crime Victimization Survey (NCVS). These data are available in the {srvyrexploR} package as `ncvs_2021_incident`, `ncvs_2021_household`, and `ncvs_2021_person`. +We use data from the United States National Crime Victimization Survey (NCVS.) These data are available in the {srvyrexploR} package as `ncvs_2021_incident`, `ncvs_2021_household`, and `ncvs_2021_person`. ::: ## Introduction -The NCVS is a household survey sponsored by the Bureau of Justice Statistics (BJS), which collects data on criminal victimization, including characteristics of the crimes, offenders, and victims. Crime types include both household and personal crimes, as well as violent and non-violent crimes. The target population of this survey is all people in the United States age 12 and older living in housing units and noninstitutional group quarters. +The National Crime Victimization Survey (NCVS) is a household survey sponsored by the Bureau of Justice Statistics (BJS), which collects data on criminal victimization, including characteristics of the crimes, offenders, and victims. Crime types include both household and personal crimes, as well as violent and non-violent crimes. The population of interest of this survey is all people in the United States age 12 and older living in housing units and noninstitutional group quarters. -The NCVS has been ongoing since 1992. An earlier survey, the National Crime Survey, was run from 1972 to 1991 [@ncvs_tech_2016]. The survey is administered using a rotating panel. When an address enters the sample, the residents of that address are interviewed every six months for a total of seven interviews. If the initial residents move away from the address during the period, the new residents are included in the survey, as people are not followed when they move. +The NCVS has been ongoing since 1992. An earlier survey, the National Crime Survey, was run from 1972 to 1991 [@ncvs_tech_2016]. The survey is administered using a rotating panel. When an address enters the sample, the residents of that address are interviewed every six months for a total of seven interviews. If the initial residents move away from the address during the period and new residents move in, the new residents are included in the survey, as people are not followed when they move. -NCVS data is publicly available and distributed by Inter-university Consortium for Political and Social Research (ICPSR), with data going back to 1992. The vignette in this book will include data from 2021 [@ncvs_data_2021]. The NCVS data structure is complicated, and the User's Guide contains examples for analysis in SAS, SUDAAN, SPSS, and Stata, but not R [@ncvs_user_guide]. This vignette will adapt those examples for R. +NCVS data are publicly available and distributed by Inter-university Consortium for Political and Social Research (ICPSR), with data going back to 1992. The vignette in this book includes data from 2021 [@ncvs_data_2021]. The NCVS data structure is complicated, and the User's Guide contains examples for analysis in SAS, SUDAAN, SPSS, and Stata, but not R [@ncvs_user_guide]. This vignette adapts those examples for R. ## Data structure -The data from ICPSR is distributed with five files, each having its unique identifier indicated: +The data from ICPSR are distributed with five files, each having its unique identifier indicated: - Address Record - `YEARQ`, `IDHH` - Household Record - `YEARQ`, `IDHH` @@ -47,37 +47,37 @@ The data from ICPSR is distributed with five files, each having its unique ident - Incident Record - `YEARQ`, `IDHH`, `IDPER` - 2021 Collection Year Incident - `YEARQ`, `IDHH`, `IDPER` -We will focus on the household, person, and incident files. From these files, we selected a subset of columns for examples to use in this vignette. We have included data in the {srvyexploR} package with a subset of columns, but you can download the complete files at ICPSR [@ncvs_data_2021]. +In this vignette, we focus on the household, person, and incident files and have selected a subset of columns for use in the examples. We have included data in the {srvyexploR} package with this subset of columns, but the complete data files can be downloaded from [ICPSR](https://www.icpsr.umich.edu/web/NACJD/studies/38429). ## Survey notation The NCVS User Guide [@ncvs_user_guide] uses the following notation: * $i$ represents NCVS households, identified on the household-level file with the household identification number `IDHH`. -* $j$ represents NCVS individual respondents within households $i$, identified on the person-level file with the person identification number `IDPER`. -* $k$ represents reporting periods (i.e., `YEARQ`) for households $i$ and individual respondent $j$. +* $j$ represents NCVS individual respondents within household $i$, identified on the person-level file with the person identification number `IDPER`. +* $k$ represents reporting periods (i.e., `YEARQ`) for household $i$ and individual respondent $j$. * $l$ represents victimization records for respondent $j$ in household $i$ and reporting period $k$. Each record on the NCVS incident-level file is associated with a victimization record $l$. -* $D$ represents one or more domain characteristics of interest in the calculation of NCVS estimates. For victimization totals and proportions, domains can be defined on the basis of crime types (e.g., violent crimes, property crimes), characteristics of victims (e.g., age, sex, household income), or characteristics of the victimizations (e.g., victimizations reported to police, victimizations committed with a weapon present). Domains could also be a combination of all of these types of characteristics. For example, in the calculation of victimization rates, domains are defined on the basis of the characteristics of the victims. -* $A_a$ represents the level $a$ of covariate $A$. Covariate $A$ is defined in the calculation of victimization proportions and represents the characteristic for which the analyst wants to obtain the distribution of victimizations in domain $D$. +* $D$ represents one or more domain characteristics of interest in the calculation of NCVS estimates. For victimization totals and proportions, domains can be defined on the basis of crime types (e.g., violent crimes, property crimes), characteristics of victims (e.g., age, sex, household income), or characteristics of the victimizations (e.g., victimizations reported to police, victimizations committed with a weapon present.) Domains could also be a combination of all of these types of characteristics. For example, in the calculation of victimization rates, domains are defined on the basis of the characteristics of the victims. +* $A_a$ represents the level $a$ of covariate $A$. Covariate $A$ is defined in the calculation of victimization proportions and represents the characteristic we want to obtain the distribution of victimizations in domain $D$. * $C$ represents the personal or property crime for which we want to obtain a victimization rate. -In this vignette, we will discuss four estimates: +In this vignette, we discuss four estimates: 1. *Victimization totals* estimate the number of criminal victimizations with a given characteristic. As demonstrated below, these can be calculated from any of the data files. The estimated victimization total, $\hat{t}_D$ for domain $D$ is estimated as $$ \hat{t}_D = \sum_{ijkl \in D} v_{ijkl}$$ -where $v_{ijkl}$ is the series-adjusted victimization weight for household $i$, respondent $j$, reporting period $k$, and victimization $l$, that is `WGTVICCY`. +where $v_{ijkl}$ is the series-adjusted victimization weight for household $i$, respondent $j$, reporting period $k$, and victimization $l$, represented in the data as `WGTVICCY`. 2. *Victimization proportions* estimate characteristics among victimizations or victims. Victimization proportions are calculated using the incident data file. The estimated victimization proportion for domain $D$ across level $a$ of covariate $A$, $\hat{p}_{A_a,D}$ is $$ \hat{p}_{A_a,D} =\frac{\sum_{ijkl \in A_a, D} v_{ijkl}}{\sum_{ijkl \in D} v_{ijkl}}.$$ The numerator is the number of incidents with a particular characteristic in a domain, and the denominator is the number of incidents in a domain. -3. *Victimization rates* are estimates of the number of victimizations per 1,000 persons or households in the population^[BJS publishes victimization rates per 1,000, which are also presented in these examples]. Victimization rates are calculated using the household or person-level data files. The estimated victimization rate for crime $C$ in domain $D$ is +3. *Victimization rates* are estimates of the number of victimizations per 1,000 persons or households in the population.^[BJS publishes victimization rates per 1,000, which are also presented in these examples] Victimization rates are calculated using the household or person-level data files. The estimated victimization rate for crime $C$ in domain $D$ is $$\hat{VR}_{C,D}= \frac{\sum_{ijkl \in C,D} v_{ijkl}}{\sum_{ijk \in D} w_{ijk}}\times 1000$$ -where $w_{ijk}$ is the person weight (`WGTPERCY`) or household weight (`WGTHHCY`) for personal and household crimes, respectively. The numerator is the number of incidents in a domain, and the denominator is the number of persons or households in a domain. Notice that the weights in the numerator and denominator are different - this is important, and in the syntax and examples below, we will discuss how to make an estimate that involves two weights. +where $w_{ijk}$ is the person weight (`WGTPERCY`) for personal crimes or household weight (`WGTHHCY`) for household crimes. The numerator is the number of incidents in a domain, and the denominator is the number of persons or households in a domain. Notice that the weights in the numerator and denominator are different - this is important, and in the syntax and examples below, we discuss how to make an estimate that involves two weights. 4. *Prevalence rates* are estimates of the percentage of the population (persons or households) who are victims of a crime. These are estimated using the household or person-level data files. The estimated prevalence rate for crime $C$ in domain $D$ is @@ -95,7 +95,7 @@ For victimization rates, we need to know the victimization status for both victi Each record on the incident file represents one victimization, which is not the same as one incident. Some victimizations have several instances that make it difficult for the victim to differentiate the details of these incidents, labeled as "series crimes". Appendix A of the User's Guide indicates how to calculate the series weight in other statistical languages. -Here, we adapt that code for R. Essentially, if a victimization is a series crime, its series weight is top-coded at 10 based on the number of actual victimizations, that is that even if the crime repeatedly occurred more than 10 times, it is counted as 10 times to reduce the influence of extreme outliers. If an incident is a series crime, but the number of occurrences is unknown, the series weight is set to 6. A description of the variables used to create indicators of series and the associated weights is included in Table \@ref(tab:cb-incident). +Here, we adapt that code for R. Essentially, if a victimization is a series crime, its series weight is top-coded at 10 based on the number of actual victimizations, that is, even if the crime occurred more than 10 times, it is counted as 10 times to reduce the influence of extreme outliers. If an incident is a series crime, but the number of occurrences is unknown, the series weight is set to 6. A description of the variables used to create indicators of series and the associated weights is included in Table \@ref(tab:cb-incident). Table: (\#tab:cb-incident) Codebook for incident variables - related to series weight @@ -114,7 +114,7 @@ Table: (\#tab:cb-incident) Codebook for incident variables - related to series w | | | 8 | Residue (invalid data) | | WGTVICCY | Adjusted victimization weight | | Numeric | -We want to create four variables to indicate if an incident is a series crime. First, we create a variable called series using `V4017`, `V4018`, and `V4019` where an incident is considered a series crime if there are 6 or more incidents (`V4107`), the incidents are similar in detail (`V4018`), or there is not enough detail to distinguish the incidents (`V4019`). Next, we top-code the number of incidents (`V4016`) by creating a variable `n10v4016` which is set to 10 if `V4016 > 10`. Finally, we create the series weight using our new top-coded variable and the existing weight. +We want to create four variables to indicate if an incident is a series crime. First, we create a variable called `series` using `V4017`, `V4018`, and `V4019` where an incident is considered a series crime if there are 6 or more incidents (`V4107`), the incidents are similar in detail (`V4018`), or there is not enough detail to distinguish the incidents (`V4019`.) Second, we top-code the number of incidents (`V4016`) by creating a variable `n10v4016`, which is set to 10 if `V4016 > 10`. Third, we create the `serieswgt` using the two new variables `series` and `n10v4019` to classify the max series based on missing data and number of incidents. Finally, we create the new weight using our new `serieswgt` variable and the existing weight (`WGTVICCY`.) ```{r} #| label: ncvs-vign-incfile @@ -137,7 +137,7 @@ inc_series <- ncvs_2021_incident %>% ) ``` -The next step in preparing the files for estimation is to create indicators on the victimization file for characteristics of interest. Almost all BJS publications limit the analysis to records where the victimization occurred in the United States, where `V4022` is not equal to 1, and we will do this for all estimates as well. A brief codebook of variables for this task is located in Table \@ref(tab:cb-crimetype) +The next step in preparing the files for estimation is to create indicators on the victimization file for characteristics of interest. Almost all BJS publications limit the analysis to records where the victimization occurred in the United States (where `V4022` is not equal to 1). We do this for all estimates as well. A brief codebook of variables for this task is located in Table \@ref(tab:cb-crimetype) Table: (\#tab:cb-crimetype) Codebook for incident variables - crime type indicators and characteristics @@ -200,7 +200,7 @@ Table: (\#tab:cb-crimetype) Codebook for incident variables - crime type indicat | | | 58 | Completed theft value NA | | | | 59 | Attempted theft | -Using these variables, we will create the following indicators: +Using these variables, we create the following indicators: 1. Property crime - `V4529` >= 31 @@ -274,7 +274,7 @@ inc_ind %>% AAST_Other) ``` -After creating indicators of victimization types and characteristics, the file is summarized, and crimes are summed across persons or households by `YEARQ.` Property crimes (i.e., crimes committed against households, such as household burglary or motor vehicle theft) are summed across households, and personal crimes (i.e., crimes committed against an individual, such as assault, robbery, and personal theft) are summed across persons. The indicators are summed using the `serieswgt`, and the variable `WGTVICCY` needs to be retained for later analysis. +After creating indicators of victimization types and characteristics, the file is summarized, and crimes are summed across persons or households by `YEARQ.` Property crimes (i.e., crimes committed against households, such as household burglary or motor vehicle theft) are summed across households, and personal crimes (i.e., crimes committed against an individual, such as assault, robbery, and personal theft) are summed across persons. The indicators are summed using our created series weight variable (`serieswgt`.) Additionally, the existing weight variable (`WGTVICCY`) needs to be retained for later analysis. ```{r} #| label: ncvs-vign-inc-sum @@ -299,7 +299,7 @@ inc_pers_sums <- .groups = "drop") ``` -Now, we merge the victimization summary files into the appropriate files. For any record on the household or person file that is not on the victimization file, the victimization counts are set to 0 after merging. In this step, we will also create the victimization adjustment factor. See 2.2.4 in the User's Guide for details of why this adjustment is created (@ncvs_user_guide). It is calculated as follows: +Now, we merge the victimization summary files into the appropriate files. For any record on the household or person file that is not on the victimization file, the victimization counts are set to 0 after merging. In this step, we also create the victimization adjustment factor. See Section 2.2.4 in the User's Guide for details of why this adjustment is created [@ncvs_user_guide]. It is calculated as follows: $$ A_{ijk}=\frac{v_{ijk}}{w_{ijk}}$$ @@ -309,7 +309,6 @@ where $w_{ijk}$ is the person weight (`WGTPERCY`) for personal crimes or the hou #| label: ncvs-vign-merge-inc-sum #| cache: TRUE -# Set up a list of 0s for each crime type/characteristic to replace NA's hh_z_list <- rep(0, ncol(inc_hh_sums) - 3) %>% as.list() %>% setNames(names(inc_hh_sums)[-(1:3)]) pers_z_list <- rep(0, ncol(inc_pers_sums) - 4) %>% as.list() %>% @@ -332,7 +331,7 @@ A final step in file preparation for the household and person files is creating #### Household variables -For the household file, we create categories for tenure (rental status), urbanicity, income, place size, and region. A codebook of the household variables are located in Table \@ref(tab:cb-hh). +For the household file, we create categories for tenure (rental status), urbanicity, income, place size, and region. A codebook of the household variables is located in Table \@ref(tab:cb-hh). Table: (\#tab:cb-hh) Codebook for household variables @@ -426,7 +425,7 @@ hh_vsum_der %>% count(Region, V2127B) #### Person variables -For the person file, we create categories for sex, race/Hispanic origin, age categories, and marital status. A codebook of the household variables is located in Table \@ref(tab:cb-pers). We also merge the household demographics to the person file as well as the design variables (`V2117` and `V2118`). +For the person file, we create categories for sex, race/Hispanic origin, age categories, and marital status. A codebook of the household variables is located in Table \@ref(tab:cb-pers). We also merge the household demographics to the person file as well as the design variables (`V2117` and `V2118`.) Table: (\#tab:cb-pers) Codebook for person variables @@ -465,7 +464,6 @@ Table: (\#tab:cb-pers) Codebook for person variables ```{r} #| label: ncvs-vign-pers-der -# Set label for usage later NHOPI <- "Native Hawaiian or Other Pacific Islander" pers_vsum_der <- pers_vsum %>% @@ -564,7 +562,7 @@ The tibbles `hh_vsum_slim`, `pers_vsum_slim`, and `inc_analysis` can now be used ## Survey design objects -All the data prep above is necessary to prepare the data for survey analysis. At this point, we can create the design objects and finally begin analysis. We will create three design objects for different types of analysis as they depend on which type of estimate we are creating. For the incident data, the weight of analysis is `NEWWGT`, which we constructed previously. The household and person-level data use `WGTHHCY` and `WGTPERCY`, respectively. For all analyses, `V2117` is the strata variable, and `V2118` is the cluster/PSU variable for analysis. +All the data prep above is necessary to prepare the data for survey analysis. At this point, we can create the design objects and finally begin analysis. We create three design objects for different types of analysis as they depend on which type of estimate we are creating. For the incident data, the weight of analysis is `NEWWGT`, which we constructed previously. The household and person-level data use `WGTHHCY` and `WGTPERCY`, respectively. For all analyses, `V2117` is the strata variable, and `V2118` is the cluster/PSU variable for analysis. All this information can be found in the User's Guide [@ncvs_user_guide]. ```{r} #| label: ncvs-vign-desobj @@ -604,36 +602,125 @@ Now that we have prepared our data and created the design objects, we can calcul 3. *Victimization rates* are estimates of the number of victimizations per 1,000 persons or households in the population. -4. Prevalence rates are estimates of the percentage of the population (persons or households) who are victims of a crime. +4. *Prevalence rates* are estimates of the percentage of the population (persons or households) who are victims of a crime. ### Estimation 1: Victimization totals {#vic-tot} -There are two ways to calculate victimization totals. Using the incident design object (`inc_des`) is the most straightforward method, but the person (`pers_des`) and household (`hh_des`) design objects can be used as well if the adjustment factor (`ADJINC_WT`) is incorporated. In the example below, the total number of property and violent victimizations is first calculated using the incident file and then using the household and person design objects. The incident file is smaller, and thus, estimation is faster using that file, but the estimates will be the same as illustrated below: +There are two ways to calculate victimization totals. Using the incident design object (`inc_des`) is the most straightforward method, but the person (`pers_des`) and household (`hh_des`) design objects can be used as well if the adjustment factor (`ADJINC_WT`) is incorporated. In the example below, the total number of property and violent victimizations is first calculated using the incident file and then using the household and person design objects. The incident file is smaller and estimation is faster using that file, but the estimates are the same as illustrated in Table \@ref(tab:ncvs-vign-vt1), Table \@ref(tab:ncvs-vign-vt2a), and Table \@ref(tab:ncvs-vign-vt2b). + +```{r} +#| label: ncvs-vign-victot-examp-calc +#| echo: false +#| warning: false +vt1df <- inc_des %>% + summarize( + Property_Vzn = survey_total(Property, na.rm = TRUE), + Violent_Vzn = survey_total(Violent, na.rm = TRUE) + ) + +vt2adf <- hh_des %>% + summarize(Property_Vzn = survey_total(Property * ADJINC_WT, + na.rm = TRUE + )) + +vt2bdf <- pers_des %>% + summarize(Violent_Vzn = survey_total(Violent * ADJINC_WT, + na.rm = TRUE + )) +``` + + ```{r} #| label: ncvs-vign-victot-examp -vt1 <- inc_des %>% +vt1 <- + inc_des %>% summarize(Property_Vzn = survey_total(Property, na.rm = TRUE), - Violent_Vzn = survey_total(Violent, na.rm = TRUE)) - + Violent_Vzn = survey_total(Violent, na.rm = TRUE)) %>% + gt() %>% + tab_spanner( + label="Property crime", + columns=starts_with("Property") + ) %>% + tab_spanner( + label="Violent crime", + columns=starts_with("Violent") + ) %>% + cols_label( + ends_with("Vzn")~"Total", + ends_with("se")~"S.E." + ) %>% + fmt_number(decimals=0) + vt2a <- hh_des %>% summarize(Property_Vzn = survey_total(Property * ADJINC_WT, - na.rm = TRUE)) + na.rm = TRUE)) %>% + gt() %>% + tab_spanner( + label="Property crime", + columns=starts_with("Property") + ) %>% + cols_label( + ends_with("Vzn")~"Total", + ends_with("se")~"S.E." + ) %>% + fmt_number(decimals=0) vt2b <- pers_des %>% summarize(Violent_Vzn = survey_total(Violent * ADJINC_WT, - na.rm = TRUE)) + na.rm = TRUE)) %>% + gt() %>% + tab_spanner( + label="Violent crime", + columns=starts_with("Violent") + ) %>% + cols_label( + ends_with("Vzn")~"Total", + ends_with("se")~"S.E." + ) %>% + fmt_number(decimals=0) +``` + +(ref:ncvs-vign-vt1) Estimates of total property and violent victimizations with standard errors calculated using the incident design object, 2021 (vt1) + +```{r} +#| label: ncvs-vign-vt1 +#| echo: FALSE +#| warning: FALSE + +vt1 %>% + print_gt_book(knitr::opts_current$get()[["label"]]) +``` + + +(ref:ncvs-vign-vt2a) Estimates of total property victimizations with standard errors calculated using the household design object, 2021 (vt2a) + +```{r} +#| label: ncvs-vign-vt2a +#| echo: FALSE +#| warning: FALSE + +vt2a %>% + print_gt_book(knitr::opts_current$get()[["label"]]) +``` + -vt1 -vt2a -vt2b +(ref:ncvs-vign-vt2b) Estimates of total violent victimizations with standard errors calculated using the person design object, 2021 (vt2b) + +```{r} +#| label: ncvs-vign-vt2b +#| echo: FALSE +#| warning: FALSE + +vt2b %>% + print_gt_book(knitr::opts_current$get()[["label"]]) ``` -The number of victimizations estimated using the incident file is equivalent to the person and household file method. There are `r prettyNum(vt1$Property_Vzn, big.mark=",")` property incidents and `r prettyNum(vt1$Violent_Vzn, big.mark=",")` violent incidents in a six-month period. +The number of victimizations estimated using the incident file is equivalent to the person and household file method. There are an estimated `r prettyNum(vt1df$Property_Vzn, big.mark=",")` property victimizations and `r prettyNum(vt1df$Violent_Vzn, big.mark=",")` violent victimizations in 2021. ### Estimation 2: Victimization proportions {#vic-prop} -Victimization proportions are proportions describing features of a victimization. The key here is that these are questions among victimizations, not among the population. These types of estimates can only be calculated using the incident design object (`inc_des`). +Victimization proportions are proportions describing features of a victimization. The key here is that these are estimates among victimizations, not among the population. These types of estimates can only be calculated using the incident design object (`inc_des`.) For example, we could be interested in the percentage of property victimizations reported to the police as shown in the following code with an estimate, the standard error, and 95% confidence interval: @@ -641,7 +728,10 @@ For example, we could be interested in the percentage of property victimizations #| label: ncvs-vign-vic-prop-police prop1 <- inc_des %>% filter(Property) %>% - summarize(Pct = survey_mean(ReportPolice, na.rm = TRUE, proportion=TRUE, vartype=c("se", "ci")) * 100) + summarize(Pct = survey_mean(ReportPolice, + na.rm = TRUE, + proportion=TRUE, + vartype=c("se", "ci")) * 100) prop1 ``` @@ -652,18 +742,19 @@ Or, the percentage of violent victimizations that are in urban areas: #| label: ncvs-vign-vic-prop-urban prop2 <- inc_des %>% filter(Violent) %>% - summarize(Pct = survey_mean(Urbanicity=="Urban", na.rm = TRUE) * 100) + summarize(Pct = survey_mean(Urbanicity=="Urban", + na.rm = TRUE) * 100) prop2 ``` -In 2021, we estimate that `r formatC(prop1$Pct, digits=1, format="f")`% of property crimes were reported to the police and `r formatC(prop2$Pct, digits=1, format="f")`% of violent crimes occurred in urban areas. +In 2021, we estimate that `r formatC(prop1$Pct, digits=1, format="f")`% of property crimes were reported to the police, and `r formatC(prop2$Pct, digits=1, format="f")`% of violent crimes occurred in urban areas. ### Estimation 3: Victimization rates {#vic-rate} -Victimization rates measure the number of victimizations per population. They are not an estimate of the proportion of households or persons who are victimized, which is a prevalence rate described in section \@ref(prev-rate). Victimization rates are estimated using the household (`hh_des`) or person (`pers_des`) design objects depending on the type of crime, and the adjustment factor (`ADJINC_WT`) must be incorporated. We return to the example of property and violent victimizations used in the example for victimization totals (section \@ref(vic-tot)). In the following example, the property victimization totals are calculated as above, as well as the property victimization rate (using `survey_mean()`) and the population size using `survey_total()`. +Victimization rates measure the number of victimizations per population. They are not an estimate of the proportion of households or persons who are victimized, which is a prevalence rate described in Section \@ref(prev-rate). Victimization rates are estimated using the household (`hh_des`) or person (`pers_des`) design objects depending on the type of crime, and the adjustment factor (`ADJINC_WT`) must be incorporated. We return to the example of property and violent victimizations used in the example for victimization totals (Section \@ref(vic-tot).) In the following example, the property victimization totals are calculated as above, as well as the property victimization rate (using `survey_mean()`) and the population size using `survey_total()`. -As mentioned in the introduction, victimization rates use the incident weight in the numerator and the person or household weight in the denominator. This is accomplished by calculating the rates with the weight adjustment (`ADJINC_WT`) multiplied by the estimate of interest. Let's look at an example of property victimization. +Victimization rates use the incident weight in the numerator and the person or household weight in the denominator. This is accomplished by calculating the rates with the weight adjustment (`ADJINC_WT`) multiplied by the estimate of interest. Let's look at an example of property victimization. ```{r} #| label: ncvs-vign-vic-rate @@ -679,7 +770,7 @@ vr_prop <- hh_des %>% vr_prop ``` -In the output above, we see the estimate for property victimization rate in 2021 was `r formatC(vr_prop$Property_Rate, format="f", digits=1)` per 1,000 households, which is consistent with calculating as the number of victimizations per 1,000 population as demonstrated in the next chunk: +In the output above, we see the estimate for property victimization rate in 2021 was `r formatC(vr_prop$Property_Rate, format="f", digits=1)` per 1,000 households. This is consistent with calculating the number of victimizations per 1,000 population, as demonstrated in the following code output. ```{r} #| label: ncvs-vign-vic-rate-2 @@ -689,7 +780,7 @@ vr_prop %>% mutate(Property_Rate_manual=Property_Vzn/PopSize*1000) ``` -Victimization rates can also be calculated for particular characteristics of the victimization. In the following example, the rate of aggravated assault with no weapon, with a firearm, with a knife, and with another weapon. +Victimization rates can also be calculated based on particular characteristics of the victimization. In the following example, we calculate the rate of aggravated assault with no weapon, a firearm, a knife, and another weapon. ```{r} #| label: ncvs-vign-pers-rates-char @@ -727,7 +818,7 @@ pers_est_df <- bind_rows() ``` -The output from all the estimates is cleanded to create better labels such as going from "RaceHispOrigin" to "Race/Hispanic Origin". Finally, the {gt} package is used to make a publishable table (Table \@ref(tab:ncvs-vign-rates-demo-tab)). Using the functions from the {gt} package, column labels and footnotes are added and estimates are presented to the first decimal place [@R-gt]. +The output from all the estimates is cleaned to create better labels, such as going from "RaceHispOrigin" to "Race/Hispanic Origin". Finally, the {gt} package is used to make a publishable table (Table \@ref(tab:ncvs-vign-rates-demo-tab).) Using the functions from the {gt} package, we add column labels and footnotes and present estimates rounded to the first decimal place [@R-gt]. ```{r} #| label: ncvs-vgn-rates-demo-gt-create @@ -811,7 +902,7 @@ vr_gt %>% ### Estimation 4: Prevalence rates {#prev-rate} -Prevalence rates differ from victimization rates as the numerator is the number of people or households victimized rather than the number of victimizations. To calculate the prevalence rates, we must run another summary of the data by calculating an indicator for whether a person or household is a victim of a particular crime at any point in the year. Below is an example of calculating first the indicator and then the prevalence rate of violent crime and aggravated assault. +Prevalence rates differ from victimization rates as the numerator is the number of people or households victimized rather than the number of victimizations. To calculate the prevalence rates, we must run another summary of the data by calculating an indicator for whether a person or household is a victim of a particular crime at any point in the year. Below is an example of calculating the indicator and then the prevalence rate of violent crime and aggravated assault. ```{r} #| label: ncvs-vign-prevexamp @@ -854,10 +945,10 @@ prop_tenure <- hh_des %>% prop_tenure ``` -The property victimization rate for rented households is `r prop_tenure %>% filter(Tenure=="Rented") %>% pull(Property_Rate) %>% round(1)` per 1,000 households while the property victimization rate for owned households is `r prop_tenure %>% filter(Tenure=="Owned") %>% pull(Property_Rate) %>% round(1)`, which seem very different especially given the non-overlapping confidence intervals. However, survey data is inheriently non-independent so statistical testing cannot be done by comparing confidence intervals. To conduct the statistical test, we first need to create a variable that we will compare which incorporates the adjusted incident weight (`ADJINC_WT`) and then the test can be conducted as discussed in Chapter \@ref(c06-statistical-testing). +The property victimization rate for rented households is `r prop_tenure %>% filter(Tenure=="Rented") %>% pull(Property_Rate) %>% round(1)` per 1,000 households, while the property victimization rate for owned households is `r prop_tenure %>% filter(Tenure=="Owned") %>% pull(Property_Rate) %>% round(1)`, which seem very different, especially given the non-overlapping confidence intervals. However, survey data are inherently non-independent, so statistical testing cannot be done by comparing confidence intervals. To conduct the statistical test, we first need to create a variable that incorporates the adjusted incident weight (`ADJINC_WT`), and then the test can be conducted on this adjusted variable as discussed in Chapter \@ref(c06-statistical-testing). ```{r} -#| label: ncvs-vgn-prop-stat-test +#| label: ncvs-vign-prop-stat-test prop_tenure_test <- hh_des %>% mutate( Prop_Adj=Property * ADJINC_WT * 1000 @@ -868,12 +959,10 @@ prop_tenure_test <- hh_des %>% na.rm = TRUE ) %>% broom::tidy() - -prop_tenure_test ``` ```{r} -#| label: ncvs-vgn-prop-stat-test-gt +#| label: ncvs-vign-prop-stat-test-gt #| eval: FALSE prop_tenure_test %>% mutate(p.value = pretty_p_value(p.value)) %>% @@ -881,10 +970,10 @@ prop_tenure_test %>% fmt_number() ``` -(ref:ncvs-vgn-prop-stat-test-gt-tab) T-test output for estimates of property victimization rates between properties that are owned versus rented, NCVS 2021 +(ref:ncvs-vign-prop-stat-test-gt-tab) T-test output for estimates of property victimization rates between properties that are owned versus rented, NCVS 2021 ```{r} -#| label: ncvs-vgn-prop-stat-test-gt-tab +#| label: ncvs-vign-prop-stat-test-gt-tab #| echo: FALSE #| warning: FALSE @@ -895,11 +984,11 @@ prop_tenure_test %>% print_gt_book(knitr::opts_current$get()[["label"]]) ``` -The output of the statistical test shows the same difference of `r prop_tenure_test$estimate %>% round(1)` between the property victimization rates of renters and owners and the test is highly significant with the p-value of `r prettyunits::pretty_p_value(prop_tenure_test$p.value)`. +The output of the statistical test shown in Table \@ref(tab:ncvs-vign-prop-stat-test-gt-tab) indicates a difference of `r prop_tenure_test$estimate %>% round(1)` between the property victimization rates of renters and owners, and the test is highly significant with the p-value of `r prettyunits::pretty_p_value(prop_tenure_test$p.value)`. ## Exercises -1. What proportion of completed motor vehicle thefts are **not** reported to the police? Hint: Use the codebook to look at the definition of Type of Crime (V4529). +1. What proportion of completed motor vehicle thefts are **not** reported to the police? Hint: Use the codebook to look at the definition of Type of Crime (V4529.) 2. How many violent crimes occur in each region? diff --git a/14-ambarom-vignette.Rmd b/14-ambarom-vignette.Rmd index c8c5bfd0..1411c34e 100644 --- a/14-ambarom-vignette.Rmd +++ b/14-ambarom-vignette.Rmd @@ -27,7 +27,7 @@ library(gt) library(ggpattern) ``` -In this vignette, we use a subset of data from the 2021 AmericasBarometer survey. Download the raw files, available on the [LAPOP website](http://datasets.americasbarometer.org/database/index.php). We work with version 1.2 of the data, and there are separate files for each of the 22 countries. To read all files into R while ignoring the Stata labels, we recommend running code like this using `read_stata()` function from the {haven} package to import the data [@R-haven]: +In this vignette, we use a subset of data from the 2021 AmericasBarometer survey. Download the raw files, available on the [LAPOP website.](http://datasets.americasbarometer.org/database/index.php) We work with version 1.2 of the data, and there are separate files for each of the 22 countries. To read all files into R while ignoring the Stata labels, we recommend running the following code using `read_stata()` function from the {haven} package to import the data [@R-haven]: ```r stata_files <- list.files(here("RawData", "LAPOP_2021"), "*.dta") @@ -61,7 +61,7 @@ The survey includes a core set of questions for all countries, but not every que ## Data structure -Each country and year has its own file available in Stata format (`.dta`). In this vignette, we download and combine all the data from the 22 participating countries in 2021. We subset the data to a smaller set of columns, as noted in the prerequisites box. Review the core questionnaire to understand the common variables across the countries [@lapop-svy]. +Each country and year has its own file available in Stata format (`.dta`.) In this vignette, we download and combine all the data from the 22 participating countries in 2021. We subset the data to a smaller set of columns, as noted in the prerequisites box. We recommend reviewing the core questionnaire to understand the common variables across the countries [@lapop-svy]. ## Preparing files @@ -150,7 +150,7 @@ ambarom %>% ## Survey design objects -The technical report is the best reference for understanding how to specify the sampling design in R [@lapop-tech]. The data includes two weights: `wt` and `weight1500`. The first weight variable is specific to each country and sums to the sample size, but it is calibrated to reflect each country's demographics. The second weight variable sums to 1500 for each country and is recommended for multi-country analyses. Although not explicitly stated in the documentation, the Stata syntax example (`svyset upm [pw=weight1500], strata(strata)`) indicates the variable `upm` is a clustering variable and `strata` is the strata variable. Therefore, the design object is created in R as follows: +The technical report is the best reference for understanding how to specify the sampling design in R [@lapop-tech]. The data include two weights: `wt` and `weight1500`. The first weight variable is specific to each country and sums to the sample size, but it is calibrated to reflect each country's demographics. The second weight variable sums to 1500 for each country and is recommended for multi-country analyses. Although not explicitly stated in the documentation, the Stata syntax example (`svyset upm [pw=weight1500], strata(strata)`) indicates the variable `upm` is a clustering variable, and `strata` is the strata variable. Therefore, the design object for multi-country analysis is created in R as follows: ```{r} #| label: ambarom-design @@ -160,7 +160,7 @@ ambarom_des <- ambarom %>% weight = weight1500) ``` -One interesting thing to note is that these weight variables can provide estimates for comparing countries but not for multi-country estimates. The reason is that the weights do not account for the different sizes of countries. For example, Canada has about 10% of the population of the United States, but an estimate that uses records from both countries would weigh them equally. +One interesting thing to note is that these weight variables can provide estimates for comparing countries rather than for multi-country estimates. The reason is that the weights do not account for the different sizes of countries. For example, Canada has about 10% of the population of the United States, but an estimate that uses records from both countries would weigh them equally. ## Calculating estimates {#ambarom-estimates} @@ -168,7 +168,7 @@ When calculating estimates from the data, we use the survey design object `ambar ### Example: Worried about COVID -This survey was administered between March and August of 2021, with the specific timing varying by country^[See Table 2 in @lapop-tech for dates by country]. Given the state of the pandemic at that time, several questions about COVID were included. The first question about COVID asked: +This survey was administered between March and August of 2021, with the specific timing varying by country.^[See Table 2 in @lapop-tech for dates by country] Given the state of the pandemic at that time, several questions about COVID were included. The first question about COVID asked: > How worried are you about the possibility that you or someone in your household will get sick from coronavirus in the next 3 months? > @@ -177,7 +177,7 @@ This survey was administered between March and August of 2021, with the specific > - A little worried > - Not worried at all -If we are interested in those who are very worried or somewhat worried, we can create a new variable (`CovidWorry_bin`) that groups levels of the original question using the `fct_collapse()` function from the {forcats} package [@R-forcats]. We then use the `survey_count()` function to understand how responses are distributed across each category of the original variable (`CovidWorry`) and the new variable (`CovidWorry_bin`). +If we are interested in those who are very worried or somewhat worried, we can create a new variable (`CovidWorry_bin`) that groups levels of the original question using the `fct_collapse()` function from the {forcats} package [@R-forcats]. We then use the `survey_count()` function to understand how responses are distributed across each category of the original variable (`CovidWorry`) and the new variable (`CovidWorry_bin`.) ```{r} #| label: ambarom-worry-est1 @@ -357,12 +357,13 @@ In the countries that were asked this question, many households experienced a ch ## Mapping survey data {#ambarom-maps} -While the table effectively presents the data, a map could also be insightful. To generate maps of the countries, we can use the package {rnaturalearth} and subset North and South America with the `ne_countries()` function [@R-rnaturalearth]. The function returns an sf (simple features) object with many columns [@sf2023], but most importantly, `soverignt` (sovereignty), `geounit` (country or territory), and `geometry` (the shape). For an example of the difference between sovereignty and country/territory, the United States, Puerto Rico, and the US Virgin Islands are all separate units with the same sovereignty. A map without data is plotted in Figure \@ref(fig:ambarom-americas-map) using `geom_sf()` from the {ggplot2} package which plots sf objects [@ggplot22016]. +While the table effectively presents the data, a map could also be insightful. To generate maps of the countries, we can use the package {rnaturalearth} and subset North and South America with the `ne_countries()` function [@R-rnaturalearth]. The function returns an sf (simple features) object with many columns [@sf2023], but most importantly, `soverignt` (sovereignty), `geounit` (country or territory), and `geometry` (the shape.) For an example of the difference between sovereignty and country/territory, the United States, Puerto Rico, and the U.S. Virgin Islands are all separate units with the same sovereignty. A map without data is plotted in Figure \@ref(fig:ambarom-americas-map) using `geom_sf()` from the {ggplot2} package, which plots sf objects [@ggplot2wickham]. ```{r} #| label: ambarom-americas-map #| fig.cap: "Map of North and South America" -#| error: true +#| fig.alt: "A blank map of the world, showing only the outlines of the countries in Western Hemisphere." + country_shape <- ne_countries( scale = "medium", @@ -387,7 +388,7 @@ country_shape_crop <- country_shape %>% ymax = 90)) ``` -Now that we have the necessary shape files, our next step is to match our survey data to the map. Countries can be named differently (e.g., "U.S", "U.S.A", "United States"). To make sure we can visualize our survey data on the map, we need to match the country names in both the survey data and the map data. To do this, we can use the `anti_join()` function to identify the countries in the survey data that aren't in the map data. For example, as shown below, the United States is referred to as "United States" in the survey data but "United States of America" in the map data. Table \@ref(tab:ambarom-map-merge-check-1-tab) shows the countries in the survey data but not the map data and Table \@ref(tab:ambarom-map-merge-check-2-tab) shows the countries in the map data but not the survey data. +Now that we have the necessary shape files, our next step is to match our survey data to the map. Countries can be named differently (e.g., "U.S", "U.S.A", "United States".) To make sure we can visualize our survey data on the map, we need to match the country names in both the survey data and the map data. To do this, we can use the `anti_join()` function to identify the countries in the survey data that aren't in the map data. For example, as shown below, the United States is referred to as "United States" in the survey data but "United States of America" in the map data. Table \@ref(tab:ambarom-map-merge-check-1-tab) shows the countries in the survey data but not the map data, and Table \@ref(tab:ambarom-map-merge-check-2-tab) shows the countries in the map data but not the survey data. ```{r} #| label: ambarom-map-merge-check-1-gt @@ -450,7 +451,7 @@ country_shape_upd <- country_shape_crop %>% "United States", geounit)) ``` -Now that the country names match, we can merge the survey and map data and then plot the data. We begin with the map file and merge it with the survey estimates generated in Section \@ref(ambarom-estimates) (`covid_worry_country_ests` and `covid_educ_ests`). We use the {sf} function of `full_join()`, which joins the rows in the map data and the survey estimates based on the columns `geounit` and `Country`. A full join keeps all the rows from both datasets, matching rows when possible. For any rows without matches, the function fills in an `NA` for the missing value [@sf2023]. +Now that the country names match, we can merge the survey and map data and then plot the data. We begin with the map file and merge it with the survey estimates generated in Section \@ref(ambarom-estimates) (`covid_worry_country_ests` and `covid_educ_ests`.) We use the {sf} function of `full_join()`, which joins the rows in the map data and the survey estimates based on the columns `geounit` and `Country`. A full join keeps all the rows from both datasets, matching rows when possible. For any rows without matches, the function fills in an `NA` for the missing value [@sf2023]. ```{r} #| label: ambarom-join-maps-ests @@ -461,12 +462,13 @@ covid_sf <- country_shape_upd %>% by = c("geounit" = "Country")) ``` -After the merge, we create two figures that display the population estimates for the percentage of people worried about COVID (Figure \@ref(fig:ambarom-make-maps-covid)) and the percentage of households with at least one child participating in virtual or hybrid learning (Figure \@ref(fig:ambarom-make-maps-covid-ed)). We also add a cross-hatching pattern to the countries without any data using the `geom_sf_pattern()` function from the {ggpattern} package [@R-ggpattern]. +After the merge, we create two figures that display the population estimates for the percentage of people worried about COVID (Figure \@ref(fig:ambarom-make-maps-covid)) and the percentage of households with at least one child participating in virtual or hybrid learning (Figure \@ref(fig:ambarom-make-maps-covid-ed).) We also add a cross-hatching pattern to the countries without any data using the `geom_sf_pattern()` function from the {ggpattern} package [@R-ggpattern]. ```{r} #| label: ambarom-make-maps-covid #| fig.cap: "Percent of households worried someone in their household will get COVID-19 in the next 3 months by country" -#| error: true +#| fig.alt: "A choropleth map of the Western Hemisphere where the color scale filling in each country corresponds to the percent of households worried someone in their household will get COVID-19 in the next 3 months. The bottom of the range is 30% and the top of the range is 80%. Brazil and Chile look like the countries with the highest percentage of worry, with North America showing a lower percentage of worry. Countries without data, such as Venezuela, are displayed with a hash pattern." + ggplot() + geom_sf(data = covid_sf, @@ -493,7 +495,8 @@ ggplot() + ```{r} #| label: ambarom-make-maps-covid-ed #| fig.cap: "Percent of households who had at least one child participate in virtual or hybrid learning" -#| error: true +#| fig.alt: "A choropleth map of the Western Hemisphere where the color scale filling in each country corresponds to the percent of households who had at least one child participate in virtual or hybrid learning. The bottom of the range is 20% and the top of the range is 100%. Most of North America is missing data and are filled in with a hash pattern. The countries with data show a high percentage of households who had at least one child participate in virtual or hybrid learning." + ggplot() + geom_sf( data = covid_sf, @@ -523,7 +526,8 @@ In Figure \@ref(fig:ambarom-make-maps-covid-ed), we observe missing data (repres ```{r} #| label: ambarom-make-maps-covid-ed-c-s #| fig.cap: "Percent of households who had at least one child participate in virtual or hybrid learning, Central and South America" -#| error: true +#| fig.alt: "A choropleth map of Central and South America where the color scale filling in each country corresponds to the percent of households who had at least one child participate in virtual or hybrid learning. The bottom of the range is 20% and the top of the range is 100%. Most of North America is missing data and are filled in with a hash pattern. The countries with data show a high percentage of households who had at least one child participate in virtual or hybrid learning." + covid_c_s <- covid_sf %>% filter(region_wb == "Latin America & Caribbean") @@ -552,10 +556,10 @@ ggplot() + theme_minimal() ``` -In Figure \@ref(fig:ambarom-make-maps-covid-ed-c-s), we can see that most countries with available data have similar percentages (reflected in their similar shades). However, Haiti stands out with a lighter shade, indicating a considerably lower percentage of households with at least one child participating in virtual or hybrid learning. +In Figure \@ref(fig:ambarom-make-maps-covid-ed-c-s), we can see that most countries with available data have similar percentages (reflected in their similar shades.) However, Haiti stands out with a lighter shade, indicating a considerably lower percentage of households with at least one child participating in virtual or hybrid learning. ## Exercises -1. Calculate the percentage of households with broadband internet in and those with any internet at home, including from a phone or tablet in Latin America and the Caribbean. Hint: if you come across countries with 0% internet usage, you may want to filter by something first. +1. Calculate the percentage of households with broadband internet and those with any internet at home, including from a phone or tablet in Latin America and the Caribbean. Hint: if there are countries with 0% internet usage, try filtering by something first. 2. Create a faceted map showing both broadband internet and any internet usage. \ No newline at end of file diff --git a/89-Appendix-DataImport.Rmd b/89-Appendix-DataImport.Rmd index a9ec6bda..03658767 100644 --- a/89-Appendix-DataImport.Rmd +++ b/89-Appendix-DataImport.Rmd @@ -64,14 +64,14 @@ read_csv( The arguments are: * `file`: the path to the Excel file to import -* `col_names`: a value of `TRUE` will import the first row of the `file` as column names and not included in the data frame. A value of `FALSE` will create automated column names. Alternatively, we can provide a vector of column names. -* `col_types`: by default, R will infer the column variable types. We can also provide a column specification using `list()` or `cols()`; for example, use `col_types = cols(.default = "c")` to read all the columns as characters. Alternatively, we can use a string to specify the variable types for each column. +* `col_names`: a value of `TRUE` imports the first row of the `file` as column names and not included in the data frame. A value of `FALSE` creates automated column names. Alternatively, we can provide a vector of column names. +* `col_types`: by default, R infers the column variable types. We can also provide a column specification using `list()` or `cols()`; for example, use `col_types = cols(.default = "c")` to read all the columns as characters. Alternatively, we can use a string to specify the variable types for each column. * `col_select`: the columns to include in the results * `id`: a column for storing the file path. This is useful for keeping track of the input file when importing multiple CSVs at a time. * `locale`: the location-specific defaults for the file * `na`: a character vector of values to interpret as missing * `comment`: a character vector of values to interpret as comments -* `trim_ws`: a value of `TRUE` will trim leading and trailing white space +* `trim_ws`: a value of `TRUE` trims leading and trailing white space * `skip`: number of lines to skip before importing the data * `n_max`: maximum number of lines to read * `guess_max`: maximum number of lines use for guessing column types @@ -79,8 +79,8 @@ The arguments are: * `num_threads`: the number of processing threads to use for initial parsing and lazy reading of data * `progress`: a value of `TRUE` displays a progress bar * `show_col_types`: a value of `TRUE` displays the column types -* `skip_empty_rows`: a value of `TRUE` will ignore blank rows -* `lazy`: a value of `TRUE` will read values lazily +* `skip_empty_rows`: a value of `TRUE` ignores blank rows +* `lazy`: a value of `TRUE` reads values lazily The other functions share a similar syntax to `read_csv()`. To find more details, run `??` followed by the function name. For example, run `??read_delim` in the Console for additional information. @@ -97,7 +97,7 @@ anes_csv <- Excel, a widely used spreadsheet software program created by Microsoft, is a common file format in survey research. We can load Excel spreadsheets into the R environment using the {readxl} package. The package supports both the legacy `.xls` files and the modern `.xlsx` format. -To load Excel data into R, we can use the `read_excel()` function from the {readxl} package. This function offers a range of customizable options for the import process. Let's explore the syntax: +To load Excel data into R, we can use the `read_excel()` function from the {readxl} package. This function offers a range of options for the import process. Let's explore the syntax: ``` read_excel( @@ -160,7 +160,7 @@ read_dta( The arguments are: -* `file`: the path to the proprietary data file to import +* `file`: the path to the proprietary data file to import * `encoding`: specifies the character encoding of the data file * `col_select`: select specific columns for import * `skip` and `n_max`: control the number of rows skipped and the maximum number of rows imported @@ -182,10 +182,10 @@ read_sav( The arguments are: -* `file`: the path to the proprietary data file to import +* `file`: the path to the proprietary data file to import * `encoding`: specifies the character encoding of the data file * `col_select`: select specific columns for import -* `user_na`: a value of `TRUE` will read variables with user defined missing labels will be read into `labelled_spss()` objects +* `user_na`: a value of `TRUE` reads variables with user defined missing labels into `labelled_spss()` objects * `skip` and `n_max`: control the number of rows skipped and the maximum number of rows imported * `.name_repair`: determines how column names are repaired if they are not valid @@ -214,7 +214,7 @@ The arguments are: * `skip` and `n_max`: control the number of rows skipped and the maximum number of rows imported * `.name_repair`: determines how column names are repaired if they are not valid -In the code examples below, we demonstrate how to load Stata, SPSS, and SAS files into R using the respective {haven} functions. The resulting data is stored in `anes_dta`, `anes_sav`, and `anes_sas` objects as tibbles, ready for use in R. +In the code examples below, we demonstrate how to load Stata, SPSS, and SAS files into R using the respective {haven} functions. The resulting data are stored in `anes_dta`, `anes_sav`, and `anes_sas` objects as tibbles, ready for use in R. For the Stata example, we show you how to load in the data from the {srvyrexploR} package and will use this data in examples later in this Appendix. Stata: @@ -223,7 +223,9 @@ Stata: library(haven) anes_dta <- - read_dta(system.file("extdata", "anes_2020_stata_example.dta", package="srvyrexploR")) + read_dta(system.file("extdata", + "anes_2020_stata_example.dta", + package="srvyrexploR")) ``` SPSS: @@ -248,11 +250,11 @@ anes_sas <- Stata, SPSS, and SAS files often contain labeled variables and values. These labels provide descriptive information about categorical data, making it easier to understand and analyze. When importing data from Stata, SPSS, or SAS, preserving these labels is essential for maintaining data fidelity. -Consider a variable like 'Education Level' with coded values (e.g., 1, 2, 3). Without labels, these codes can be cryptic. However, with labels ('High School Graduate,' 'Bachelor's Degree,' 'Master's Degree'), the data becomes more informative and easier to work with. +Consider a variable like 'Education Level' with coded values (e.g., 1, 2, 3.) Without labels, these codes can be cryptic. However, with labels ('High School Graduate,' 'Bachelor's Degree,' 'Master's Degree'), the data become more informative and easier to work with. With the {haven} package, we have the capability to import and work with labeled data from Stata, SPSS, and SAS files. The package uses a special class of data called `haven_labelled` to store labeled variables. When a dataset label is defined in Stata, it is stored in the 'label' attribute of the tibble when imported, ensuring that the information is not lost. -We can use functions like `select()`, `glimpse()`, and `is.labelled()` to inspect the imported data and verify if variables are labeled. Take a look at the ANES Stata file. Notice that categorical variables are marked with a type of ``. This notation indicates that these variables are labeled. +We can use functions like `select()`, `glimpse()`, and `is.labelled()` to inspect the imported data and verify if the variables are labeled. Take a look at the ANES Stata file. Notice that categorical variables are marked with a type of ``. This notation indicates that these variables are labeled. ```{r} #| label: readr-glimpse @@ -271,7 +273,7 @@ We can confirm this label status using the `haven::is.labelled()` function. haven::is.labelled(anes_dta$V200002) ``` -To explore the labels further, we can use the `attributes()` function. This function provides insights into both the variable labels (`$label`) and the associated value labels (`$labels`). +To explore the labels further, we can use the `attributes()` function. This function provides insights into both the variable labels (`$label`) and the associated value labels (`$labels`.) ```{r} #| label: readr-attributes @@ -282,7 +284,7 @@ When we import a labeled dataset using {haven}, it results in a tibble containin #### Option 1: Convert the vector into a factor {-} -Factors are native R data types for working with categorical data. They consist of integer values that correspond to character values, known as levels. Below is a dummy example of factors. Printing `factors` shows the four different levels in the data: `strongly agree`, `agree`, `disagree`, and `strongly disagree`. +Factors are native R data types for working with categorical data. They consist of integer values that correspond to character values, known as levels. Below is a dummy example of factors. The `factors` show the four different levels in the data: `strongly agree`, `agree`, `disagree`, and `strongly disagree`. ```{r} #| label: readr-factor @@ -329,7 +331,7 @@ anes_dta_factor %>% The second option is to remove the labels altogether, converting the labeled data into a regular R data frame. To remove, or 'zap' the labels from our tibble, we can use the {haven} package's `zap_label()` and `zap_labels()` functions. This approach removes the labels but retains the data values in their original form. -The ANES Stata file columns contains variable labels. Using purrr's `map()`, we can review the labels using `attr`. In the example below, we list the first two variables and their labels. For instance, the label for `V200002` is "Mode of interview: pre-election interview". +The ANES Stata file columns contain variable labels. Using the function `map()` from {purrr}, we can review the labels using `attr`. In the example below, we list the first two variables and their labels. For instance, the label for `V200002` is "Mode of interview: pre-election interview". ```{r} #| label: readr-label-show @@ -355,7 +357,7 @@ zap_labels(anes_dta) %>% glimpse() ``` -While it is important to convert labeled datasets into regular R data frames for working in R, the labels themselves often contain valuable information that provide context and meaning to the survey variables. To aid with interpretability and documention, consider creating a data dictionary from the labeled dataset. A data dictionary is a reference document that provides detailed information about the variables and values of a survey. +While it is important to convert labeled datasets into regular R data frames for working in R, the labels themselves often contain valuable information that provides context and meaning to the survey variables. To aid with interpretability and documentation, consider creating a data dictionary from the labeled dataset. A data dictionary is a reference document that provides detailed information about the variables and values of a survey. The {labelled} package offers a convenient function, `generate_dictionary()`, that creates data dictionaries directly from a labeled dataset [@R-labelled]. This function extracts variable labels, value labels, and other metadata and organizes them into a structured document that we can browse and reference throughout our analysis. @@ -404,7 +406,7 @@ head(gss_dta$HEALTH) #> NA(n) NA ``` -In contrast, SPSS uses a different approach called 'user-defined values' to denote missing values. Each column in an SPSS dataset can have up to three distinct values designated as missing or a specified range of missing values. To model these additional user-defined missing values, {haven} provides the `labeled_spss()` subclass of `labeled()`. When you import SPSS data using {haven}, it ensures that user-defined missing values are correctly handled. You can work with this data in R while preserving the unique missing value conventions from SPSS. +In contrast, SPSS uses a different approach called 'user-defined values' to denote missing values. Each column in an SPSS dataset can have up to three distinct values designated as missing or a specified range of missing values. To model these additional user-defined missing values, {haven} provides the `labeled_spss()` subclass of `labeled()`. When importing SPSS data using {haven}, it ensures that user-defined missing values are correctly handled. We can work with these data in R while preserving the unique missing value conventions from SPSS. Here is what the GSS SPSS data looks like when loaded with {haven}. @@ -427,14 +429,14 @@ head(gss_sps$HEALTH) ## Importing data from APIs into R -In addition to working with data saved as files, we may also need to retrieve data through Application Programming Interfaces (APIs). APIs provide a structured way to access data hosted on external servers and import it directly into R for analysis. +In addition to working with data saved as files, we may also need to retrieve data through Application Programming Interfaces (APIs.) APIs provide a structured way to access data hosted on external servers and import it directly into R for analysis. -To access this data, you need to understand how to construct API requests. Each API has unique endpoints, parameters, and authentication requirements. Pay attention to: +To access these data, we need to understand how to construct API requests. Each API has unique endpoints, parameters, and authentication requirements. Pay attention to: -* Endpoints: These are URLs that point to specific data or services. -* Parameters: Information you pass to the API to customize your request (e.g., date ranges, filters). -* Authentication: APIs may require API keys or tokens for access. -* Rate Limits: APIs may have usage limits, so be aware of any rate limits or quotas. +* Endpoints: These are URLs that point to specific data or services +* Parameters: Information passed to the API to customize the request (e.g., date ranges, filters) +* Authentication: APIs may require API keys or tokens for access +* Rate Limits: APIs may have usage limits, so be aware of any rate limits or quotas Typically, we begin by making a GET request to an API endpoint. The {httr2} package allows us to generate and process HTTP requests [@R-httr2]. We can make the GET request by pointing to the URL that contains the data we would like. @@ -445,28 +447,40 @@ api_url <- "https://api.example.com/survey-data" response <- GET(api_url) ``` -Once we make the request, we will obtain the data as the `response`. The data often comes in JSON format. We can extract and parse the data using the {jsonlite} package, allowing us to work with it in R [@jsonlite2014]. The `fromJSON()` function, shown below, coverts JSON data to an R object. +Once we make the request, we obtain the data as the `response`. The data often come in JSON format. We can extract and parse the data using the {jsonlite} package, allowing us to work with it in R [@jsonliteooms]. The `fromJSON()` function, shown below, converts JSON data to an R object. ```r survey_data <- fromJSON(content(response, "text")) ``` -Note that these are dummy examples. Please review the documentation to understand how to make requests from your specific API. +Note that these are dummy examples. Please review the documentation to understand how to make requests from a specific API. -R offers several packages that simplify API access by providing ready-to-use functions for popular APIs. These packages are called "wrappers", as they "wrap" the API to make it easier to use. For example, the {tidycensus} package used in this book simplifies access to U.S. Census data, allowing us to retrieve data with R commands instead of writing complex API requests [@R-tidycensus]. For example, if we are interested in the population (`B01003_001`) of each census tract in North Carolina from the 2020 ACS, we would use the `get_acs()` function and the code below. Behind the scenes, `get_acs()` is making a GET request from the Census API and the {tidycensus} functions are converting the response into an R-friendly format. +R offers several packages that simplify API access by providing ready-to-use functions for popular APIs. These packages are called "wrappers", as they "wrap" the API to make it easier to use. For example, the {tidycensus} package used in this book simplifies access to U.S. Census data, allowing us to retrieve data with R commands instead of writing API requests from scratch [@R-tidycensus]. Behind the scenes, `get_pums()` is making a GET request from the Census API, and the {tidycensus} functions are converting the response into an R-friendly format. For example, if we are interested in the age, sex, race, and Hispanicity of those in the American Community Survey sample in Durham County, North Carolina^[The public use microdata areas (PUMA) for Durham County were identified using the 2020 PUMA Names File: https://www2.census.gov/geo/pdfs/reference/puma2020/2020_PUMA_Names.pdf], we can use the `get_pums()` function to extract this microdata as shown in the code below. We can then use the replicate weights to create a survey object and calculate estimates for Durham County. -```r +```{r} +#| label: readr-pumsin +#| results: false library(tidycensus) -census_data <- - get_acs( - geography = "tract", - variables = "B01003_001", - year = 2020, - state = "NC" - ) +durh_pums <- get_pums( + variables = c("PUMA", "SEX", "AGEP", "RAC1P", "HISP"), + state = "NC", + puma = c("01301", "01302"), + survey = "acs1", + year = 2022, + rep_weights = "person" +) + +``` + +```{r} +#| label: readr-pumsprint + +durh_pums ``` + + In Chapter \@ref(c04-getting-started), we used the {censusapi} package to get data from the Census data API for the Current Population Survey. To discover if there's an R package that directly interfaces with a specific survey or data source, search for "[survey] R wrapper" or "[data source] R package" online. ## Accessing databases in R @@ -476,18 +490,20 @@ Databases provide a secure and organized solution as the volume and complexity o There are various ways of working with databases in RStudio. We can connect to different databases through the Connections Pane in the top right of the IDE. We can also use packages like {DBI} and {odbc} to access database tables in R files. Here is an example script connecting to a database: ```r -con <- DBI::dbConnect(odbc::odbc(), - Driver = "[your driver's name]", - Server = "[your server's path]", - UID = rstudioapi::askForPassword("Database user"), - PWD = rstudioapi::askForPassword("Database password"), - Database = "[your database's name]", - Warehouse = "[your warehouse's name]", - Schema = "[your schema's name]" - ) -``` - -The {dbplyr} and {dplyr} packages allow us to make queries and run data analysis entirely using {dplyr} syntax. All of the code can be written in R so we do not have to switch between R and SQL to explore the data. Here is some sample code: +con <- + DBI::dbConnect( + odbc::odbc(), + Driver = "[driver name]", + Server = "[server path]", + UID = rstudioapi::askForPassword("Database user"), + PWD = rstudioapi::askForPassword("Database password"), + Database = "[database name]", + Warehouse = "[warehouse name]", + Schema = "[schema name]" + ) +``` + +The {dbplyr} and {dplyr} packages allow us to make queries and run data analysis entirely using {dplyr} syntax. All of the code can be written in R, so we do not have to switch between R and SQL to explore the data. Here is some sample code: ```r q1 <- tbl(con, "bank") %>% diff --git a/93-AppendixD.Rmd b/93-AppendixD.Rmd index 0921dd18..b019a57e 100644 --- a/93-AppendixD.Rmd +++ b/93-AppendixD.Rmd @@ -6,7 +6,7 @@ knitr::opts_chunk$set(tidy = 'styler') ``` -The chapter exercises use the survey design objects and packages provided in the Prerequisites box in the beginning of the chapter. Please ensure they are loaded in your environment before running the exercise solutions. Code chunks to load these are also included below. +The chapter exercises use the survey design objects and packages provided in the Prerequisites box in the beginning of the chapter. Please ensure they are loaded in the environment before running the exercise solutions. Code chunks to load these are also included below. ```r @@ -243,7 +243,7 @@ pers_des <- pers_vsum_slim %>% nest = TRUE ) ``` - +The chapter exercises use the survey design objects and packages provided in the Prerequisites box in the beginning of the chapter. Please ensure they are loaded in the environment before running the exercise solutions. ## 5 - Descriptive analysis {-} @@ -420,7 +420,7 @@ quant_baenergyexp %>% ## 6 - Statistical testing {-} -1. Using the RECS data, do more than 50% of U.S. households use AC (`ACUsed`)? +1. Using the RECS data, do more than 50% of U.S. households use A/C (`ACUsed`)? ```{r} #| label: stattest-ex-solution1 @@ -472,9 +472,7 @@ ttest_solution3 On average, those who voted for Joseph Biden in 2020 were `r ttest_solution3$estimate %>% round(1)` years younger than voters for other candidates and this is significantly different (p `r ttest_solution3$p.value %>% pretty_p_value()`). - - -4. If you wanted to determine if the political party affiliation differed for males and females, what test would you use? +4. If we wanted to determine if the political party affiliation differed for males and females, what test would we use? a. Goodness of fit test (`svygofchisq()`) b. Test of independence (`svychisq()`) @@ -546,7 +544,7 @@ tidy(exp_unit_out) Answer: The reference level should be `r expense_by_hut %>% slice(1) %>% pull(HousingUnitType) %>% as.character()`. All p-values are very small indicating there is a significant relationship between housing unit type and total energy expenditure. -2. Does temperature play a role in electricity expenditure (`DOLLAREL`)? Cooling degree days are a measure of how hot a place is. CDD65 for a given day indicates the number of degrees Fahrenheit warmer than 65°F (18.3°C) it is in a location. On a day that averages 65°F and below, CDD65=0. While a day that averages 85°F (29.4°C) would have CDD65=20 because it is 20 degrees Fahrenheit warmer. For each day in the year, this is summed to give an indicator of how hot the place is throughout the year. Similarly, HDD65 indicates the days colder than 65°F^[]. Can energy expenditure be predicted using these temperature indicators along with square footage? Is there a significant relationship? Include main effects and two-way interactions. +2. Does temperature play a role in electricity expenditure? Cooling degree days are a measure of how hot a place is. CDD65 for a given day indicates the number of degrees Fahrenheit warmer than 65°F (18.3°C) it is in a location. On a day that averages 65°F and below, CDD65=0. While a day that averages 85°F (29.4°C) would have CDD65=20 because it is 20 degrees Fahrenheit warmer [@eia-cdd]. For each day in the year, this is summed to give an indicator of how hot the place is throughout the year. Similarly, HDD65 indicates the days colder than 65°F. Can energy expenditure be predicted using these temperature indicators along with square footage? Is there a significant relationship? Include main effects and two-way interactions. ```{r} #| label: model-ex-solution2 @@ -601,7 +599,7 @@ temps_sqft_exp_fit %>% theme_minimal() ``` -4. Early voting expanded in 2020^[]. Build a logistic model predicting early voting in 2020 (`EarlyVote2020`) using age (`Age`), education (`Education`), and party identification (`PartyID`). Include two-way interactions. +4. Early voting expanded in 2020 [@npr-voting-trend]. Build a logistic model predicting early voting in 2020 (`EarlyVote2020`) using age (`Age`), education (`Education`), and party identification (`PartyID`.) Include two-way interactions. Answer: ```{r} @@ -644,7 +642,8 @@ Answer: We predict that the 28 year old with a graduate degree who identifies as ## 10 - Specifying sample designs and replicate weights in {srvyr} {-} -1. The National Health Interview Survey (NHIS) is an annual household survey conducted by the National Center for Health Statistics (NCHS). The NHIS includes a wide variety of health topics for adults including health status and conditions, functioning and disability, health care access and health service utilization, health-related behaviors, health promotion, mental health, barriers to care, and community engagement. Like many national in-person surveys, the sampling design is a stratified clustered design with details included in the Survey Description [@nhis-svy-des]. The Survey Description provides information on setting up syntax in SUDAAN, Stata, SPSS, SAS, and R ({survey} package implementation). You have imported the data and the variable containing the data is: `nhis_adult_data`. How would you specify the design using {srvyr} using either `as_survey_design()` or `as_survey_rep()`? + +1. The National Health Interview Survey (NHIS) is an annual household survey conducted by the National Center for Health Statistics (NCHS.) The NHIS includes a wide variety of health topics for adults including health status and conditions, functioning and disability, health care access and health service utilization, health-related behaviors, health promotion, mental health, barriers to care, and community engagement. Like many national in-person surveys, the sampling design is a stratified clustered design with details included in the Survey Description [@nhis-svy-des]. The Survey Description provides information on setting up syntax in SUDAAN, Stata, SPSS, SAS, and R ({survey} package implementation.) We have imported the data and the variable containing the data as: `nhis_adult_data`. How would we specify the design using either `as_survey_design()` or `as_survey_rep()`? Answer: @@ -660,7 +659,7 @@ nhis_adult_des <- nhis_adult_data %>% ) ``` -2. The General Social Survey is a survey that has been administered since 1972 on social, behavioral, and attitudinal topics. The 2016-2020 GSS Panel codebook provides examples of setting up syntax in SAS and Stata but not R [@gss-codebook]. You have imported the data and the variable containing the data is: `gss_data`. How would you specify the design in R using either `as_survey_design()` or `as_survey_rep()`? +2. The General Social Survey is a survey that has been administered since 1972 on social, behavioral, and attitudinal topics. The 2016-2020 GSS Panel codebook provides examples of setting up syntax in SAS and Stata but not R [@gss-codebook]. We have imported the data and the variable containing the data as: `gss_data`. How would we specify the design in R using either `as_survey_design()` or `as_survey_rep()`? Answer: @@ -675,7 +674,7 @@ gss_des <- gss_data %>% ## 13 - National Crime Victimization Survey Vignette {-} -1. What proportion of completed motor vehicle thefts are **not** reported to the police? Hint: Use the codebook to look at the definition of Type of Crime (V4529). +1. What proportion of completed motor vehicle thefts are **not** reported to the police? Hint: Use the codebook to look at the definition of Type of Crime (V4529.) ```{r} #| label: ncvs-vign-ex-solution1 @@ -750,8 +749,7 @@ Answer: The difference between male and female victimization rate is estimated a ## 14 - AmericasBarometer Vignette {-} -1. Calculate the percentage of households with broadband internet in and those with any internet at home, including from a phone or tablet in Latin America and the Caribbean. Hint: if you come across countries with 0% internet usage, you may want to filter by something first. - +1. Calculate the percentage of households with broadband internet and those with any internet at home, including from a phone or tablet in Latin America and the Caribbean. Hint: if there are countries with 0% internet usage, try filtering by something first. Answer: ```{r} diff --git a/99-references.Rmd b/99-references.Rmd index 11bf8eda..43dce1df 100644 --- a/99-references.Rmd +++ b/99-references.Rmd @@ -43,6 +43,10 @@ our_write_bib <- function (x = .packages(), file = "", tweak = TRUE, width = NUL "\\1", cite$title) cite$title = gsub(pkg, paste0("{", pkg, "}"), cite$title) cite$title = gsub("\\b(R)\\b", "{R}", cite$title) + cite$title = gsub("\\b(ggplot2)\\b", "{ggplot2}", cite$title) + cite$title = gsub("\\b(dplyr)\\b", "{dplyr}", cite$title) + cite$title = gsub("\\b(tidyverse)\\b", "{tidyverse}", cite$title) + cite$title = gsub("\\b(sf)\\b", "{sf}", cite$title) cite$title = gsub(" & ", " \\\\& ", cite$title) } entry = toBibtex(cite) @@ -58,8 +62,8 @@ our_write_bib <- function (x = .packages(), file = "", tweak = TRUE, width = NUL bib = lapply(bib, function(b) { b["author"] = sub("Duncan Temple Lang", "Duncan {Temple Lang}", b["author"]) - b["title"] = gsub("(^|\\W)'([^']+)'(\\W|$)", "\\1\\2\\3", - b["title"]) + # b["title"] = gsub("(^|\\W)'([^']+)'(\\W|$)", "\\1\\2\\3", + # b["title"]) if (!is.na(b["note"])) b["note"] = gsub("(^.*?https?://.*?),\\s+https?://.*?(},\\s*)$", "\\1\\2", b["note"]) diff --git a/book.bib b/book.bib index 28da04a3..a466ba30 100644 --- a/book.bib +++ b/book.bib @@ -296,7 +296,7 @@ @misc{recs-2020-meth } @misc{anes-2020-tech, title = {{Methodology Report for the ANES 2020 Time Series Study}}, - author = {{DeBell, Matthew and Amsbary, Michelle and Brader, Ted and Brock, Shelley and Good, Cindy and Kamens, Justin and Maisel, Natalya and Pinto, Sarah}}, + author = {DeBell, Matthew and Amsbary, Michelle and Brader, Ted and Brock, Shelley and Good, Cindy and Kamens, Justin and Maisel, Natalya and Pinto, Sarah}, year = 2022, howpublished = {\url{https://electionstudies.org/wp-content/uploads/2022/08/anes_timeseries_2020_methodology_report.pdf}} } @@ -494,4 +494,56 @@ @misc{gss-codebook editor = {NORC, Chicago}, year = 2021, howpublished = {\url{https://gss.norc.org/Documents/codebook/2016-2020%20GSS%20Panel%20Codebook%20-%20R1a.pdf}} -} \ No newline at end of file +} + +@Book{ggplot2wickham, + author = {Hadley Wickham}, + title = {{ggplot2}: Elegant Graphics for Data Analysis}, + publisher = {Springer-Verlag New York}, + year = {2016}, + isbn = {978-3-319-24277-4}, + url = {https://ggplot2.tidyverse.org}, +} + +@Article{gtsummarysjo, + author = {Daniel D. Sjoberg and Karissa Whiting and Michael Curry and Jessica A. Lavery and Joseph Larmarange}, + title = {Reproducible Summary Tables with the {gtsummary} Package}, + journal = {{The R Journal}}, + year = {2021}, + url = {https://doi.org/10.32614/RJ-2021-053}, + doi = {10.32614/RJ-2021-053}, + volume = {13}, + issue = {1}, + pages = {570-580}, +} + +@Article{targetslandau, + title = {The {targets} {R} package: a dynamic {Make}-like function-oriented pipeline toolkit for reproducibility and high-performance computing}, + author = {William Michael Landau}, + journal = {Journal of Open Source Software}, + year = {2021}, + volume = {6}, + number = {57}, + pages = {2959}, + url = {https://doi.org/10.21105/joss.02959}, +} + +@Article{jsonliteooms, + title = {The {jsonlite} Package: A Practical and Consistent Mapping Between JSON Data and {R} Objects}, + author = {Jeroen Ooms}, + journal = {arXiv:1403.2805 [stat.CO]}, + year = {2014}, + url = {https://arxiv.org/abs/1403.2805}, +} + +@Article{visdattierney, + title = {{visdat}: Visualising Whole Data Frames}, + author = {Nicholas Tierney}, + doi = {10.21105/joss.00355}, + url = {http://dx.doi.org/10.21105/joss.00355}, + year = {2017}, + journal = {Journal of Open Source Software}, + volume = {2}, + number = {16}, + pages = {355} +} diff --git a/index.Rmd b/index.Rmd index 5c2f29c0..a0712681 100644 --- a/index.Rmd +++ b/index.Rmd @@ -16,12 +16,9 @@ github-repo: tidy-survey-r/tidy-survey-book graphics: yes #cover-image: images/cover.jpg header-includes: - - \usepackage{draftwatermark} - \usepackage[titles]{tocloft} --- -\SetWatermarkText{DRAFT} - ```{r setup} #| include: false diff --git a/renv.lock b/renv.lock index 287e5eb6..0b75bd4f 100644 --- a/renv.lock +++ b/renv.lock @@ -2025,18 +2025,18 @@ }, "srvyrexploR": { "Package": "srvyrexploR", - "Version": "0.0.0.9000", + "Version": "1.0.0", "Source": "GitHub", "RemoteType": "github", - "RemoteHost": "api.github.com", - "RemoteRepo": "srvyrexploR", "RemoteUsername": "tidy-survey-r", + "RemoteRepo": "srvyrexploR", "RemoteRef": "HEAD", - "RemoteSha": "914fc0fd0b7812d7d7260e15da882561602b21d2", + "RemoteSha": "e03f36c51c34f7d0f1a036246a15d3ed67806b4f", + "RemoteHost": "api.github.com", "Requirements": [ "R" ], - "Hash": "3586abacf9e95b432824b9e9e60037d0" + "Hash": "30a1302b8eabd8d1a72228c799794665" }, "stringi": { "Package": "stringi",