diff --git a/03-survey-data-documentation.Rmd b/03-survey-data-documentation.Rmd index 7c98540..8a1e3b9 100644 --- a/03-survey-data-documentation.Rmd +++ b/03-survey-data-documentation.Rmd @@ -13,7 +13,7 @@ library(tidyverse) Survey documentation helps us prepare before we look at the actual survey data. The documentation includes technical guides, questionnaires, codebooks, errata, and other useful resources. By taking the time to review these materials, we can gain a comprehensive understanding of the survey data (including research and design decisions discussed in Chapters \@ref(c02-overview-surveys) and \@ref(c10-sample-designs-replicate-weights)) and conduct our analysis more effectively. -Survey documentation can vary in organization, type, and ease of use. The information may be stored in any format - PDFs, Excel spreadsheets, Word documents, and so on. Some surveys bundle documentation together, such as providing the codebook and questionnaire in a single document. Others keep them in separate files. Despite these variations, we can gain a general understanding of the documentation types and what aspects to focus on in each. +Survey documentation can vary in organization, type, and ease of use. The information may be stored in any format---PDFs, Excel spreadsheets, Word documents, and so on. Some surveys bundle documentation together, such as providing the codebook and questionnaire in a single document. Others keep them in separate files. Despite these variations, we can gain a general understanding of the documentation types and what aspects to focus on in each. ## Types of survey documentation @@ -30,9 +30,9 @@ The technical documentation may include other helpful resources. For example, so ### Questionnaires -\index{Questionnaire|(}A questionnaire is a series of questions used to collect information from people in a survey. It can ask about opinions, behaviors, demographics, or even just numbers like the count of lightbulbs, square footage, or farm size. Questionnaires can employ different types of questions, such as closed-ended (e.g., select one or check all that apply), open-ended (e.g., numeric or text), Likert scales (e.g., a 5- or 7-point scale specifying a respondent's level of agreement to a statement), or ranking questions (e.g., a list of options that a respondent ranks by preference.) It may randomize the display order of responses or include instructions that help respondents understand the questions. A survey may have one questionnaire or multiple, depending on its scale and scope. +\index{Questionnaire|(}A questionnaire is a series of questions used to collect information from people in a survey. It can ask about opinions, behaviors, demographics, or even just numbers like the count of lightbulbs, square footage, or farm size. Questionnaires can employ different types of questions, such as closed-ended (e.g., select one or check all that apply), open-ended (e.g., numeric or text), Likert scales (e.g., a 5- or 7-point scale specifying a respondent's level of agreement to a statement), or ranking questions (e.g., a list of options that a respondent ranks by preference). It may randomize the display order of responses or include instructions that help respondents understand the questions. A survey may have one questionnaire or multiple, depending on its scale and scope. -The questionnaire is another important resource for understanding and interpreting the survey data (see Section \@ref(overview-design-questionnaire)), and we should use it alongside any analysis. It provides details about each of the questions asked in the survey, such as question name, question wording, response options, skip logic, randomizations, display specifications, mode differences, and the universe (the subset of respondents who were asked a question.) +The questionnaire is another important resource for understanding and interpreting the survey data (see Section \@ref(overview-design-questionnaire)), and we should use it alongside any analysis. It provides details about each of the questions asked in the survey, such as question name, question wording, response options, skip logic, randomizations, display specifications, mode differences, and the universe (the subset of respondents who were asked a question). \index{American National Election Studies (ANES)|(} In Figure \@ref(fig:understand-que-examp), we show an example from the American National Election Studies (ANES) 2020 questionnaire [@anes-svy]. The figure shows the question name (`POSTVOTE_RVOTE`), description (Did R Vote?), full wording of the question and responses, response order, universe, question logic (this question was only asked if `vote_pre` = 0), and other specifications. The section also includes the variable name, which we can link to the codebook. @@ -40,7 +40,7 @@ In Figure \@ref(fig:understand-que-examp), we show an example from the American ```{r} #| label: understand-que-examp #| echo: false -#| fig.cap: ANES 2020 Questionnaire Example +#| fig.cap: ANES 2020 questionnaire example #| fig.alt: Question information about the variable postvote_rvote from ANES 2020 questionnaire Survey question, Universe, Logic, Web Spec, Response Order, and Released Variable are included. knitr::include_graphics(path = "images/questionnaire-example.jpg") @@ -53,7 +53,7 @@ The content and structure of questionnaires vary depending on the specific surve ```{r} #| label: understand-que-examp-2 #| echo: false -#| fig.cap: BRFSS 2021 Questionnaire Example +#| fig.cap: BRFSS 2021 questionnaire example #| fig.alt: Question information about the variable BPHIGH6 from BRFSS 2021 questionnaire. Question number, question text, variable names, responses, skip info and CATI note, interviewer notes, and columns are included. knitr::include_graphics(path = "images/questionnaire-example-2.jpg") @@ -64,7 +64,7 @@ knitr::include_graphics(path = "images/questionnaire-example-2.jpg") ### Codebooks \index{Missing data|(} \index{Codebook|(} \index{Data dictionary|see {Codebook}} -While a questionnaire provides information about the questions posed to respondents, the codebook explains how the survey data were coded and recorded. It lists details such as variable names, variable labels, variable meanings, codes for missing data, value labels, and value types (whether categorical, continuous, etc.) The codebook helps us understand and use the variables appropriately in our analysis. In particular, the codebook (as opposed to the questionnaire) often includes information on missing data. Note that the term *data dictionary* is sometimes used interchangeably with codebook, but a data dictionary may include more details on the structure and elements of the data. +While a questionnaire provides information about the questions posed to respondents, the codebook explains how the survey data were coded and recorded. It lists details such as variable names, variable labels, variable meanings, codes for missing data, value labels, and value types (whether categorical, continuous, etc.). The codebook helps us understand and use the variables appropriately in our analysis. In particular, the codebook (as opposed to the questionnaire) often includes information on missing data. Note that the term *data dictionary* is sometimes used interchangeably with codebook, but a data dictionary may include more details on the structure and elements of the data. \index{Missing data|)} \index{American National Election Studies (ANES)|(} @@ -73,7 +73,7 @@ Figure \@ref(fig:understand-codebook-examp) is a question from the ANES 2020 cod ```{r} #| label: understand-codebook-examp #| echo: false -#| fig.cap: ANES 2020 Codebook Example +#| fig.cap: ANES 2020 codebook example #| fig.alt: Variable information about the variable V202066 from ANES 2020 questionnaire Variable meaning, Value labels, Universe, and Survey Question(s) are included. knitr::include_graphics(path="images/codebook-example.jpg") @@ -98,13 +98,13 @@ Survey documentation may include additional material, such as interviewer instru ## Missing data coding \index{Missing data|(} -Some observations in a dataset may have missing data. This can be due to design or nonresponse, and these concepts are detailed in Chapter \@ref(c11-missing-data). In that chapter, we also discuss how to analyze data with missing values. This chapter walks through how to understand documentation related to missing data. +Some observations in a dataset may have missing data. This can be due to design or non-response, and these concepts are detailed in Chapter \@ref(c11-missing-data). In that chapter, we also discuss how to analyze data with missing values. This chapter walks through how to understand documentation related to missing data. \index{Codebook|(} The survey documentation, often the codebook, represents the missing data with a code. The codebook may list different codes depending on why certain data points are missing. In the example of variable `V202066` from the ANES (Figure \@ref(fig:understand-codebook-examp)), `-9` represents "Refused," `-7` means that the response was deleted due to an incomplete interview, `-6` means that there is no response because there was no follow-up interview, and `-1` means "Inapplicable" (due to a designed skip pattern.) \index{National Crime Victimization Survey (NCVS)|(} -As another example, there may be a summary variable that describes the missingness of a set of variables - particularly with "select all that apply" or "multiple response" questions. In the National Crime Victimization Survey (NCVS), respondents who are victims of a crime and saw the offender are asked if the offender had a weapon and then asked what the type of weapon was. This part of the questionnaire from 2021 is shown in Figure \@ref(fig:understand-ncvs-weapon-q) [@ncvs_survey_2020]. +As another example, there may be a summary variable that describes the missingness of a set of variables --- particularly with "select all that apply" or "multiple response" questions. In the National Crime Victimization Survey (NCVS), respondents who are victims of a crime and saw the offender are asked if the offender had a weapon and then asked what the type of weapon was. This part of the questionnaire from 2021 is shown in Figure \@ref(fig:understand-ncvs-weapon-q) [@ncvs_survey_2020]. ```{r} #| label: understand-ncvs-weapon-q @@ -115,7 +115,7 @@ As another example, there may be a summary variable that describes the missingne knitr::include_graphics(path="images/questionnaire-ncvs-weapon.jpg") ``` -The NCVS codebook includes coding for all multiple response variables of a "lead in" variable that summarizes the individual options. For question 23a on the weapon type, the lead-in variable is V4050, which is shown in Figure \@ref(fig:understand-ncvs-weapon-cb) [@ncvs_cb_2020]. This variable is then followed by a set of variables for each weapon type. An example of one of the individual variables from the codebook, the handgun, is shown in Figure \@ref(fig:understand-ncvs-weapon-cb-hg) [@ncvs_cb_2020]. We will dive into how to analyze this variable in Chapter \@ref(c11-missing-data). +For these multiple response variables (select all that apply), the NCVS codebook includes what they call a "lead-in" variable that summarizes the response. This lead-in variable provides metadata information on how a respondent answered the question. For example, question 23a on the weapon type, the lead-in variable is V4050 (shown in Figure \@ref(fig:understand-ncvs-weapon-cb)) indicates the quality and type of response [@ncvs_cb_2020]. In the codebook, this variable is then followed by a set of variables for each weapon type. An example of one of the individual variables from the codebook, the handgun (V4051), is shown in Figure \@ref(fig:understand-ncvs-weapon-cb-hg) [@ncvs_cb_2020]. We will dive into how to analyze this variable in Chapter \@ref(c11-missing-data). ```{r} #| label: understand-ncvs-weapon-cb @@ -137,14 +137,14 @@ knitr::include_graphics(path="images/codebook-ncvs-weapon-handgun.jpg") When data are read into R, some values may be system missing, that is they are coded as `NA` even if that is not evident in a codebook. We discuss in Chapter \@ref(c11-missing-data) how to analyze data with `NA` values and review how R handles missing data in calculations. \index{National Crime Victimization Survey (NCVS)|)} \index{Missing data|)} \index{Codebook|)} -## Example: American National Election Studies (ANES) 2020 survey documentation +## Example: ANES 2020 survey documentation \index{American National Election Studies (ANES)|(} -Let's look at the survey documentation for the American National Election Studies (ANES) 2020 and the documentation from their [website](https://electionstudies.org/data-center/2020-time-series-study/). Navigating to "User Guide and Codebook" [@anes-cb], we can download the PDF that contains the survey documentation, titled "ANES 2020 Time Series Study Full Release: User Guide and Codebook". Do not be daunted by the 796-page PDF. Below, we focus on the most critical information. +Let's look at the survey documentation for the ANES 2020 and the documentation from their [website](https://electionstudies.org/data-center/2020-time-series-study/). Navigating to "User Guide and Codebook" [@anes-cb], we can download the PDF that contains the survey documentation, titled "ANES 2020 Time Series Study Full Release: User Guide and Codebook." Do not be daunted by the 796-page PDF. Below, we focus on the most critical information. #### Introduction {-} -The first section in the User Guide explains that the ANES 2020 Times Series Study continues a series of election surveys conducted since 1948. These surveys contain data on public opinion and voting behavior in the U.S. presidential elections. \index{Mode|(}The introduction also includes information about the modes used for data collection (web, live video interviewing, or CATI.)\index{Mode|)} Additionally, there is a summary of the number of pre-election interviews (8,280) and post-election re-interviews (7,449.) +The first section in the User Guide explains that the ANES 2020 Times Series Study continues a series of election surveys conducted since 1948. These surveys contain data on public opinion and voting behavior in the U.S. presidential elections. \index{Mode|(}The introduction also includes information about the modes used for data collection (web, live video interviewing, or CATI).\index{Mode|)} Additionally, there is a summary of the number of pre-election interviews (8,280) and post-election re-interviews (7,449.) #### Sample design and respondent recruitment {-} @@ -164,14 +164,14 @@ The document provides more information about the design variables, summarized in Table: (\#tab:aneswgts) Weight and variance information for ANES -For weight | Use variance unit/PSU/cluster | and use variance stratum +For weight | Variance unit/cluster | Variance stratum :-----------:|:-----------:|:-----------: V200010a| V200010c| V200010d V200010b| V200010c| V200010d ### Methodology {-} -The user guide mentions a supplemental document called "How to Analyze ANES Survey Data" [@debell] as a 'how-to guide' for analyzing the data. In this document, we learn more about the weights, where we learn that they sum to the sample size and not the population. If our goal is to calculate estimates for the entire U.S. population instead of just the sample, we must adjust the weights to the U.S. population. To create accurate weights for the population, we need to determine the total population size at the time of the survey. Let's review the "Sample Design and Respondent Recruitment" section for more details: +The user guide mentions a supplemental document called "How to Analyze ANES Survey Data" [@debell] as a how-to guide for analyzing the data. In this document, we learn more about the weights, and that they sum to the sample size and not the population. If our goal is to calculate estimates for the entire U.S. population instead of just the sample, we must adjust the weights to the U.S. population. To create accurate weights for the population, we need to determine the total population size at the time of the survey. Let's review the "Sample Design and Respondent Recruitment" section for more details: > The target population for the fresh cross-section was the 231 million non-institutional U.S. citizens aged 18 or older living in the 50 U.S. states or the District of Columbia.