diff --git a/404.html b/404.html index c3b92ef6..83ddc72c 100644 --- a/404.html +++ b/404.html @@ -23,7 +23,7 @@ - + @@ -229,7 +229,7 @@
Surveys are valuable tools for gathering information about a population, and are used by researchers, governments, and businesses alike to better understand public opinion and behaviors. For example, a non-profit group may analyze societal trends to measure their impact, government agencies may study behaviors to inform policy, or companies may seek to learn customer product preferences to refine business strategy. With survey data, we can explore the world around us.
-Surveys are often conducted with a sample of the population. Therefore, in order to use the survey data to understand the population, we use weights to adjust the survey results for unequal probabilities of selection, non-response, and post-stratification. These adjustments ensure the sample accurately represents the population of interest (Gard et al. 2023). To account for the intricate nature of the survey design, analysts rely on statistical software such as SAS, Stata, SUDAAN, and R.
-In this book, we focus on R to introduce survey analysis. Our goal is to provide a comprehensive guide for individuals new to survey analysis but with some familiarity with statistics and R programming. We use a combination of the {survey} and {srvyr} packages and present the code following best practices from the tidyverse and assume weights have already been calculated and are available (Freedman Ellis and Schneider 2023; Lumley 2010; Wickham et al. 2019).
+Surveys are valuable tools for gathering information about a population. Researchers, governments, and businesses use surveys to better understand public opinion and behaviors. For example, a non-profit group may analyze societal trends to measure their impact, government agencies may study behaviors to inform policy, or companies may seek to learn customer product preferences to refine business strategy. With survey data, we can explore the world around us.
+Surveys are often conducted with a sample of the population. Therefore, to use the survey data to understand the population, we use weights to adjust the survey results for unequal probabilities of selection, non-response, and post-stratification. These adjustments ensure the sample accurately represents the population of interest (Gard et al. 2023). To account for the intricate nature of the survey design, analysts rely on statistical software such as SAS, Stata, SUDAAN, and R.
+In this book, we focus on R to introduce survey analysis. Our goal is to provide a comprehensive guide for individuals new to survey analysis but with some familiarity with statistics and R programming. We use a combination of the {survey} and {srvyr} packages and present the code following best practices from the tidyverse (Freedman Ellis and Schneider 2023; Lumley 2010; Wickham et al. 2019).
The {survey} package was released on the Comprehensive R Archive Network (CRAN) in 2003 and has been continuously developed over time. This package, primarily authored by Thomas Lumley, offers an extensive array of features, including:
@@ -531,13 +531,13 @@%>%
), which seamlessly works with functions from the {srvyr} package. Moreover, several common functions from {dplyr}, such as filter()
, mutate()
, and summarize()
, can be applied to survey objects (Wickham et al. 2023). This enables users to streamline their analysis workflow and leverage the benefits of both the {srvyr} and {tidyverse} packages.
-While the {srvyr} package offers many advantages, there is one notable limitation: it doesn’t fully incorporate the modeling capabilities of the {survey} package into tidy wrappers. When discussing modeling and hypothesis testing, we primarily rely on the {survey} package. However, we guide you on how to apply the pipe operator to these functions to maintain clarity and consistency in your analyses.
+The {srvyr} package builds on the {survey} package by providing wrappers for functions that align with the tidyverse philosophy. This is our motivation for using and recommending the {srvyr} package. We find that it is user-friendly for those familiar with the tidyverse packages in R.
+For example, while many functions in the {survey} package access variables through formulas, the {srvyr} package uses tidy selection to pass variable names, a common feature in the tidyverse (Henry and Wickham 2022). Users of the tidyverse are also likely familiar with the magrittr pipe operator (%>%
), which seamlessly works with functions from the {srvyr} package. Moreover, several common functions from {dplyr}, such as filter()
, mutate()
, and summarize()
, can be applied to survey objects (Wickham et al. 2023). This enables users to streamline their analysis workflow and leverage the benefits of both the {srvyr} and {tidyverse} packages.
While the {srvyr} package offers many advantages, there is one notable limitation: it doesn’t fully incorporate the modeling capabilities of the {survey} package into tidy wrappers. When discussing modeling and hypothesis testing, we primarily rely on the {survey} package. However, we provide information on how to apply the pipe operator to these functions to maintain clarity and consistency in analyses.
This book covers many aspects of survey design and analysis, from understanding how to create design objects to conducting descriptive analysis, statistical tests, and models. We emphasize coding best practices and effective presentation techniques while using real-world data and practical examples to help you gain proficiency in survey analysis.
+This book covers many aspects of survey design and analysis, from understanding how to create design objects to conducting descriptive analysis, statistical tests, and models. We emphasize coding best practices and effective presentation techniques while using real-world data and practical examples to help readers gain proficiency in survey analysis.
Below is a summary of each chapter:
The majority of chapters contain code that you can follow. Each of these chapters starts with a “set-up” section, which includes the code needed to load the packages and datasets. We then provide the main idea of the chapter and examples of how to use the functions. Most chapters conclude with exercises to work through. We provide the solutions to the exercises in the online version of the book, available at tidy-survey-r.github.io.
-While we provide a brief overview of survey methodology and statistical theory, this book is not intended to be the sole resource for these topics. We reference other materials throughout the book and encourage readers to seek those out for more information.
+The majority of chapters contain code that readers can follow. Each of these chapters starts with a “Prerequisites” section, which includes the code needed to load the packages and datasets used in the chapter. We then provide the main idea of the chapter and examples of how to use the functions. Most chapters conclude with exercises to work through. We provide the solutions to the exercises in the online version of the book.
+While we provide a brief overview of survey methodology and statistical theory, this book is not intended to be the sole resource for these topics. We reference other materials and encourage readers to seek them out for more information.
To get the most of our this book, we assume that you have already conducted a survey and have the data or obtained a microdata file. Microdata, also known as respondent-level or row-level data, differs from summarized data typically found in tables. It contains individual survey responses, along with analysis weights and design variables such as strata or clusters.
-Additionally, the survey data should already include weights and design variables. These are required to accurately calculate unbiased estimates. The concepts and techniques discussed in this book will help you to extract meaningful insights from your survey data, but will not cover how to create weights in the first place as this is a separate complex topic. If you do not already have weights created for the survey data you are using, we recommend reviewing other resources focused on weight creation such as Valliant and Dever (2018).
-This book is tailored for analysts already familiar with R and the tidyverse but who may be new to complex survey analysis in R. We anticipate that readers of this book can:
+To get the most out of our this book, we assume a survey has already been conducted and readers have obtained a microdata file. Microdata, also known as respondent-level or row-level data, differs from summarized data typically found in tables. They contain individual survey responses, along with analysis weights and design variables such as strata or clusters.
+Additionally, the survey data should already include weights and design variables. These are required to accurately calculate unbiased estimates. The concepts and techniques discussed in this book help readers to extract meaningful insights from survey data, but do not cover how to create weights as this is a separate complex topic. If weights are not already created for the survey data, we recommend reviewing other resources focused on weight creation such as Valliant and Dever (2018).
+This book is tailored for analysts already familiar with R and the tidyverse, but who may be new to complex survey analysis in R. We anticipate that readers of this book can:
If these concepts or skills are new, we recommend starting with introductory resources to cover these topics before reading this book. R for Data Science (Wickham, Çetinkaya-Rundel, and Grolemund 2023) is a beginner-friendly guide for getting started in data science using R. It offers guidance on preliminary installation steps, basic R syntax, and tidyverse concepts and packages.
We work with two key datasets throughout the book: the Residential Energy Consumption Survey (RECS – U.S. Energy Information Administration 2023b) and the American National Election Studies (ANES – DeBell 2010). We introduce and demonstrate the loading and preparation of these datasets in Chapter 4.
+We work with two key datasets throughout the book: the Residential Energy Consumption Survey (RECS – U.S. Energy Information Administration 2023b) and the American National Election Studies (ANES – DeBell 2010). We introduce the loading and preparation of these datasets in Chapter 4.
Throughout the book, we use the following typographical conventions:
survey_mean()
anes_des
survey_mean()
anes_des
We recommend first trying to resolve errors and issues independently using the tips provided in Chapter 12.
-If you have questions or face issues while working through the book, please report them to its GitHub repository.
There are several community forums for asking questions, including:
Please report any bugs and issues to the book’s GitHub repository.
We would like to thank Holly Cast, Greg Freedman Ellis, Joe Murphy, and Sheila Saia for their reviews of the initial draft. Their detailed and honest feedback helped to make this book considerably better, and we are grateful for their input. Additionally, this book started from two short courses. The first at the Annual Conference for the American Association for Public Opinion Research (AAPOR) and the second as a series of webinars for the Midwest Association of Public Opinion Research (MAPOR). We would like to also thank those that assisted us by moderating breakout rooms and answering questions from attendees: Greg Freedman Ellis, Raphael Nishimura, and Benjamin Schneider.
+We would like to thank Holly Cast, Greg Freedman Ellis, Joe Murphy, and Sheila Saia for their reviews of the initial draft. Their detailed and honest feedback helped improve this book, and we are grateful for their input. Additionally, this book started with two short courses. The first was at the Annual Conference for the American Association for Public Opinion Research (AAPOR) and the second was a series of webinars for the Midwest Association of Public Opinion Research (MAPOR.) We would like to also thank those who assisted us by moderating breakout rooms and answering questions from attendees: Greg Freedman Ellis, Raphael Nishimura, and Benjamin Schneider.
This book was written in bookdown using RStudio. The complete source is available on GitHub: https://github.com/tidy-survey-r/tidy-survey-book.
+This book was written in bookdown using RStudio. The complete source is available on GitHub.
This version of the book was built with R version 4.3.1 (2023-06-16) and with the packages listed in Table 1.1.
-Developing surveys to gather accurate information about populations often involves a intricate and time-intensive process. Researchers can spend months, or even years, developing the study design, questions, and other methods for a single survey to ensure high-quality data is collected.
-Prior to analyzing survey data, we recommend understanding the entire survey life cycle. This understanding can provide a better insight into what types of analyses should be conducted on the data. The survey life cycle consists of the necessary stages to execute a survey project successfully. Each stage influences the survey’s timing, costs, and feasibility, consequently impacting the data collected and how we should analyze it. Figure 2.1 shows a high level view of the survey process and this chapter gives an overview of each step.
+Developing surveys to gather accurate information about populations involves an intricate and time-intensive process. Researchers can spend months, or even years, developing the study design, questions, and other methods for a single survey to ensure high-quality data is collected.
+Before analyzing survey data, we recommend understanding the entire survey life cycle. This understanding can provide better insight into what types of analyses should be conducted on the data. The survey life cycle consists of the necessary stages to execute a survey project successfully. Each stage influences the survey’s timing, costs, and feasibility, consequently impacting the data collected and how we should analyze it. Figure 2.1 shows a high-level overview of the survey process.
The survey life cycle starts with a research topic or question of interest (e.g., what impact does childhood trauma have on health outcomes later in life). Researchers typically review existing data sources to determine if data are already available that can address this question, as drawing from available resources can result in a reduced burden on respondents, cheaper research costs, and faster research outcomes. However, if existing data cannot answer the nuances of the research question, a survey can be used to capture the exact data that the researcher needs through a questionnaire, or a set of questions.
+The survey life cycle starts with a research topic or question of interest (e.g., what impact does childhood trauma have on health outcomes later in life.) Drawing from available resources can result in a reduced burden on respondents, cheaper research costs, and faster research outcomes. Therefore, we recommend reviewing existing data sources to determine if data that can address this question are already available. However, if existing data cannot answer the nuances of the research question, we can capture the exact data we need through a questionnaire, or a set of questions.
To gain a deeper understanding of survey design and implementation, we recommend reviewing several pieces of existing literature in detail (e.g., Biemer and Lyberg 2003; Bradburn, Sudman, and Wansink 2004; Dillman, Smyth, and Christian 2014; Groves et al. 2009; Tourangeau, Rips, and Rasinski 2000; Valliant, Dever, and Kreuter 2013).
Throughout this book, we use public-use datasets from different surveys, including the American National Election Survey (ANES), the Residential Energy Consumption Survey (RECS), the National Crime Victimization Survey (NCVS), and the AmericasBarometer surveys.
-As mentioned above, researchers should look for existing data that can provide insights into their research questions before embarking on a new survey. One of the greatest sources of data is the government. For example, in the U.S., we can get data directly from the various statistical agencies like with RECS and NCVS. Other countries often have data available through official statistics offices, such as the Office for National Statistics in the United Kingdom.
-In addition to government data, many researchers will make their data publicly available through repositories such as the Inter-university Consortium for Political and Social Research (ICPSR) variable search or the Odum Institute Data Archive. Searching these repositories or other compiled lists (e.g., Analyze Survey Data for Free) can be an efficient way to identify surveys with questions related to the researcher’s topic of interest.
+As mentioned above, we should look for existing data that can provide insights into our research questions before embarking on a new survey. One of the greatest sources of data is the government. For example, in the U.S., we can get data directly from the various statistical agencies such as the U.S. Energy Information Administration or Bureau of Justice Statistics. Other countries often have data available through official statistics offices, such as the Office for National Statistics in the United Kingdom.
+In addition to government data, many researchers make their data publicly available through repositories such as the Inter-university Consortium for Political and Social Research (ICPSR) or the Odum Institute Data Archive. Searching these repositories or other compiled lists (e.g., Analyze Survey Data for Free) can be an efficient way to identify surveys with questions related to our research topic.
Almost every survey has errors. Researchers attempt to conduct a survey that reduces the total survey error, or the accumulation of all errors that may arise throughout the survey life cycle. By assessing these different types of errors together, researchers can seek strategies to maximize the overall survey quality and improve the reliability and validity of results (Biemer 2010). However, attempts to reduce individual sources errors (and therefore total survey error) come at the price of time and money. For example:
@@ -563,13 +563,13 @@The set or group we want to survey is known as the population of interest or the target population. The population of interest could be broad, such as “all adults age 18+ living in the U.S.” or a specific population based on a particular characteristic or location. For example, we may want to know about “adults aged 18-24 who live in North Carolina” or “eligible voters living in Illinois.”
-However, a sampling frame with contact information is needed to survey individuals in these populations of interest. If researchers are looking at eligible voters, the sampling frame could be the voting registry for a given state or area. If researchers are looking at more board target populations like all adults in the United States, the sampling frame is likely imperfect. In these cases, a full list of individuals in the United States is not available for a sampling frame. Instead, researchers may choose to use a sampling frame of mailing addresses and send the survey to households, or they may choose to use random digit dialing (RDD) and call random phone numbers (that may or may not be assigned, connected, and working).
-These imperfect sampling frames can result in coverage error where there is a mismatch between the target population and the list of individuals researchers can select. For example, if a researcher is looking to obtain estimates for “all adults aged 18+ living in the U.S.”, a sampling frame of mailing addresses will miss specific types of individuals, such as the homeless, transient populations, and incarcerated individuals. Additionally, many households have more than one adult resident, so researchers would need to consider how to get a specific individual to fill out the survey (called within household selection) or adjust the target population to report on “U.S. households” instead of “individuals.”
-Once the researchers have selected the sampling frame, the next step is determining how to select individuals for the survey. In rare cases, researchers may conduct a census and survey everyone on the sampling frame. However, the ability to implement a questionnaire at that scale is something only some can do (e.g., government censuses). Instead, researchers typically choose to sample individuals and use weights to estimate numbers in the target population. They can use a variety of different sampling methods, and more information on these can be found in Chapter 10. This decision of which sampling method to use impacts sampling error and can be accounted for in weighting.
+However, a sampling frame with contact information is needed to survey individuals in these populations of interest. If we are looking at eligible voters, the sampling frame could be the voting registry for a given state or area. If we are looking at more board populations of interest, like all adults in the United States, the sampling frame is likely imperfect. In these cases, a full list of individuals in the United States is not available for a sampling frame. Instead, we may choose to use a sampling frame of mailing addresses and send the survey to households, or we may choose to use random digit dialing (RDD) and call random phone numbers (that may or may not be assigned, connected, and working.)
+These imperfect sampling frames can result in coverage error where there is a mismatch between the population of interest and the list of individuals we can select. For example, if we are looking to obtain estimates for “all adults aged 18+ living in the U.S.”, a sampling frame of mailing addresses will miss specific types of individuals, such as the homeless, transient populations, and incarcerated individuals. Additionally, many households have more than one adult resident, so we would need to consider how to get a specific individual to fill out the survey (called within household selection) or adjust the population of interest to report on “U.S. households” instead of “individuals.”
+Once we have selected the sampling frame, the next step is determining how to select individuals for the survey. In rare cases, we may conduct a census and survey everyone on the sampling frame. However, the ability to implement a questionnaire at that scale is something only few can do (e.g., government censuses.) Instead, we typically choose to sample individuals and use weights to estimate numbers in the population of interest. They can use a variety of different sampling methods, and more information on these can be found in Chapter 10. This decision of which sampling method to use impacts sampling error and can be accounted for in weighting.
Let’s use a simple example where a researcher is interested in the average number of pets in a household. Our researcher needs to consider the target population for this study. Specifically, are they interested in all households in a given country or households in a more local area (e.g., city or state)? Let’s assume our researcher is interested in the number of pets in a U.S. household with at least one adult (18 years old or older). In this case, a sampling frame of mailing addresses would introduce only a small amount of coverage error as the frame would closely match our target population. Specifically, our researcher would likely want to use the Computerized Delivery Sequence File (CDSF), which is a file of mailing addresses that the United States Postal Service (USPS) creates and covers nearly 100% of U.S. households (Harter et al. 2016). To sample these households, for simplicity, we use a stratified simple random sample design (see Chapter 10 for more information on sample designs), where we randomly sample households within each state (i.e., we stratify by state).
+Let’s use a simple example where we are interested in the average number of pets in a household. We need to consider the population of interest for this study. Specifically, are we interested in all households in a given country or households in a more local area (e.g., city or state)? Let’s assume we are interested in the number of pets in a U.S. household with at least one adult (18 years old or older.) In this case, a sampling frame of mailing addresses would introduce only a small amount of coverage error as the frame would closely match our population of interest. Specifically, we would likely want to use the Computerized Delivery Sequence File (CDSF), which is a file of mailing addresses that the United States Postal Service (USPS) creates and covers nearly 100% of U.S. households (Harter et al. 2016). To sample these households, for simplicity, we use a stratified simple random sample design (see Chapter 10 for more information on sample designs), where we randomly sample households within each state (i.e., we stratify by state.)
Throughout this chapter, we build on this example research question to plan a survey.
With the sampling design decided, researchers can then decide how to survey these individuals. Specifically, the modes used for contacting and surveying the sample, how frequently to send reminders and follow-ups, and the overall timeline of the study are four of the major data collection determinations. Traditionally, researchers have considered four main modes1:
+With the sampling design decided, researchers can then decide how to survey these individuals. Specifically, the modes used for contacting and surveying the sample, how frequently to send reminders and follow-ups, and the overall timeline of the study are four of the major data collection determinations. Traditionally, survey researchers have considered there to be four main modes1:
Researchers can use a single mode to collect data or multiple modes (also called mixed-modes). Using mixed-modes can allow for broader reach and increase response rates depending on the target population (Biemer et al. 2017; DeLeeuw 2005, 2018). For example, researchers could both call households to conduct a CATI survey and send mail with a PAPI survey to the household. Using both modes, researchers could gain participation through the mail from individuals who do not pick up the phone to unknown numbers or through the phone from individuals who do not open all of their mail. However, mode effects (where responses differ based on the mode of response) can be present in the data and may need to be considered during analysis.
-When selecting which mode, or modes, to use, understanding the unique aspects of the chosen target population and sampling frame provides insight into how they can best be reached and engaged. For example, if we plan to survey adults aged 18-24 who live in North Carolina, asking them to complete a survey using CATI (i.e., over the phone) would likely not be as successful as other modes like the web. This age group does not talk on the phone as much as other generations and often does not answer their phones for unknown numbers. Additionally, the mode for contacting respondents relies on what information is available in the sampling frame. For example, if our sampling frame includes an email address, we could email our selected sample members to convince them to complete a survey. Alternatively, if the sampling frame is a list of mailing addresses, we could contact sample members with a letter.
+We can use a single mode to collect data or multiple modes (also called mixed-modes.) Using mixed-modes can allow for broader reach and increase response rates depending on the population of interest (Biemer et al. 2017; DeLeeuw 2005, 2018). For example, we could both call households to conduct a CATI survey and send mail with a PAPI survey to the household. By using both modes, we could gain participation through the mail from individuals who do not pick up the phone to unknown numbers or through the phone from individuals who do not open all of their mail. However, mode effects (where responses differ based on the mode of response) can be present in the data and may need to be considered during analysis.
+When selecting which mode, or modes, to use, understanding the unique aspects of the chosen population of interest and sampling frame provides insight into how they can best be reached and engaged. For example, if we plan to survey adults aged 18-24 who live in North Carolina, asking them to complete a survey using CATI (i.e., over the phone) would likely not be as successful as other modes like the web. This age group does not talk on the phone as much as other generations and often does not answer their phones for unknown numbers. Additionally, the mode for contacting respondents relies on what information is available in the sampling frame. For example, if our sampling frame includes an email address, we could email our selected sample members to convince them to complete a survey. Alternatively, if the sampling frame is a list of mailing addresses, we could contact sample members with a letter.
It is important to note that there can be a difference between the contact and survey modes. For example, if we have a sampling frame with addresses, we can send a letter to our sample members and provide information on completing a web survey. Another option is using mixed-mode surveys by mailing sample members a paper and pencil survey but also including instructions to complete the survey online. Combining different contact modes and different survey modes can be helpful in reducing unit nonresponse error–where the entire unit (e.g., a household) does not respond to the survey at all–as different sample members may respond better to different contact and survey modes. However, when considering which modes to use, it is important to make access to the survey as easy as possible for sample members to reduce burden and unit nonresponse.
-Another way to reduce unit nonresponse error is by varying the language of the contact materials (Dillman, Smyth, and Christian 2014). People are motivated by different things, so constantly repeating the same message may not be helpful. Instead, mixing up the messaging and the type of contact material the sample member receives can increase response rates and reduce the unit nonresponse error. For example, instead of only sending standard letters, researchers could consider sending mailings that invoke “urgent” or “important” thoughts by sending priority letters or using other delivery services like FedEx, UPS, or DHL.
+Another way to reduce unit nonresponse error is by varying the language of the contact materials (Dillman, Smyth, and Christian 2014). People are motivated by different things, so constantly repeating the same message may not be helpful. Instead, mixing up the messaging and the type of contact material the sample member receives can increase response rates and reduce the unit nonresponse error. For example, instead of only sending standard letters, we could consider sending mailings that invoke “urgent” or “important” thoughts by sending priority letters or using other delivery services like FedEx, UPS, or DHL.
A study timeline may also determine the number and types of contacts. If the timeline is long, there is plentiful time for follow-ups and diversified messages in contact materials. If the timeline is short, then fewer follow-ups can be implemented. Many studies start with the tailored design method put forth by Dillman, Smyth, and Christian (2014) and implement five contacts:
Let’s return to our example of a researcher who wants to know the average number of pets in a household. We are using a sampling frame of mailing addresses, so we recommend starting our data collection with letters mailed to households, but later in data collection, we want to send interviewers to the house to conduct an in-person (or CAPI) interview to decrease unit nonresponse error. This means we have two contact modes (paper and in-person). As mentioned above, the survey mode does not have to be the same as the contact mode, so we recommend a mixed-mode study with both Web and CAPI modes. Let’s assume we have six months for data collection, so we may want to recommend the following protocol:
+Let’s return to our example of the average number of pets in a household. We are using a sampling frame of mailing addresses, so we recommend starting our data collection with letters mailed to households, but later in data collection, we want to send interviewers to the house to conduct an in-person (or CAPI) interview to decrease unit nonresponse error. This means we have two contact modes (paper and in-person.) As mentioned above, the survey mode does not have to be the same as the contact mode, so we recommend a mixed-mode study with both Web and CAPI modes. Let’s assume we have six months for data collection, so we could recommend the following protocol: