Skip to content

Commit

Permalink
Missing data chapter (#96)
Browse files Browse the repository at this point in the history
* Update missing data in C03, PH in C11

* Beginning of missing data chapter

* Adding naniar to renv, bib, and in text in set-up chapter.

* Updating draft of missing data chapter.

* Remove old data package reference

* Missing chapter edits SZ

---------

Co-authored-by: Stephanie Zimmer <[email protected]>
  • Loading branch information
rpowell22 and szimmer authored Mar 3, 2024
1 parent 282627d commit 6b86502
Show file tree
Hide file tree
Showing 11 changed files with 660 additions and 55 deletions.
61 changes: 60 additions & 1 deletion 01-introduction.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -40,4 +40,63 @@ This book will cover many aspects of survey design and analysis, from understand
- **Chapter \@ref(c13-ncvs-vignette)**: National Crime Victimization Survey Vignette. A vignette on how to analyze data from the NCVS, a survey in the U.S. that collects information on crimes and their characteristics. This illustrates an analysis that requires multiple files to calculate victimization rates.
- **Chapter \@ref(c14-ambarom-vignette)**: AmericasBarometer Vignette. A vignette on how to analyze data from the AmericasBarometer, a survey of attitudes, evaluations, experiences, and behavior in countries in the Western Hemisphere. This includes how to make choropleth maps with survey estimates.

In most chapters, you'll find code that you can follow. Each of these chapters starts with a "setup" section. The setup section includes the code needed to load the necessary packages and datasets in the chapter. We then provide the main idea of the chapter and examples on how to use the functions. Most chapters end with exercises to work through. Solutions to the exercises can be found in the Appendix.
<<<<<<< HEAD
In most chapters, you'll find code that you can follow. Each of these chapters starts with a "set-up" section. This section will include the code needed to load the necessary packages and datasets in the chapter. We then provide the main idea of the chapter and examples on how to use the functions. Most chapters end with exercises to work through. Solutions to the exercises can be found in the Appendix.

## Datasets used in this book

We work with two key datasets throughout the book: the Residential Energy Consumption Survey [RECS -- @recs-2020-tech] and the American National Election Studies [ANES -- @debell]. To ensure that all readers can follow the examples, we have provided analytic datasets in an R package, {srvyrexploR}. Install the package from GitHub using the {remotes} package.

```r
remotes::install_github("tidy-survey-r/srvyrexploR")
```

To explore the provided datasets in the package, access the documentation usng the `help()` command.

```r
help(package="srvyrexploR")
```
To load the RECS and ANES datasets, start by running `library(srvyrexploR)` to load the package. Then, use the `data()` command to load the datasets into the environment.

```{r}
#| label: intro-setup
#| error: FALSE
#| warning: FALSE
#| message: FALSE
library(tidyverse)
library(survey)
library(srvyr)
library(srvyrexploR)
```

```{r}
#| label: intro-setup-readin
#| error: FALSE
#| warning: FALSE
#| message: FALSE
data(recs_2020)
data(anes_2020)
```

RECS is a study that provides energy consumption and expenditures data in American households. The Energy Information Administration funds RECS and has been fielded 15 times between 1950 and 2020. The survey has two components - the household survey and the energy supplier survey. In 2020, the household survey was collected by web and paper questionnaires and included questions about appliances, electronics, heating, air conditioning (A/C), temperatures, water heating, lighting, respondent demographics, and energy assistance. The energy supplier survey consists of components relating to energy consumption and energy expenditure. Below is an overview of the `recs_2020` data:

```{r}
#| label: intro-recs
recs_2020 %>% select(-starts_with("NWEIGHT"))
recs_2020 %>% select(starts_with("NWEIGHT"))
```

From this output, we can see that there are `r nrow(recs_2020) %>% formatC(big.mark = ",")` rows and `r ncol(recs_2020) %>% formatC(big.mark = ",")` variables. We can see that there are variables containing an ID (`DOEID`), geographic information (e.g., `Region`, `state_postal`, `Urbanicity`), along with information about the house, including the type of house (`HousingUnitType`) and when the house was built (`YearMade`). Additionally, there is a long list of weighting variables that we will use in the analysis (e.g., `NWEIGHT`, `NWEIGHT1`, ..., `NWEIGHT60`). We will discuss using these weighting variables in Chapter \@ref(c10-specifying-sample-designs). For a more detailed codebook, see Appendix \@ref(recs-cb).

The ANES is a series study that has collected data from election surveys since 1948. These surveys contain data on public opinion and voting behavior in U.S. presidential elections. The 2020 survey (the data we will be using) was fielded to individuals over the web, through live video interviewing, or over with computer-assisted telephone interviewing (CATI). The survey includes questions on party affiliation, voting choice, and level of trust with the government. Here is an overview of the `anes_2020` data. First, we show the variables starting with "V" followed by a number; these are the original variables. Then, we show you the remaining variables that we created based on the original data:

```{r}
#| label: intro-anes
anes_2020 %>% select(matches("^V\\d"))
anes_2020 %>% select(-matches("^V\\d"))
```

From this output we can see that there are `r nrow(anes_2020) %>% formatC(big.mark = ",")` rows and `r ncol(anes_2020) %>% formatC(big.mark = ",")` variables. Most of the variables start with V20, so referencing the documentation for survey will be crucial to not get lost (see Chapter \@ref(c03-understanding-survey-data-documentation)). We have created some more descriptive variables for you to use throughout this book, such as the age (`Age`) and gender (`Gender`) of the respondent, along with variables that represent their party affiliation (`PartyID`). Additionally, we need the variables `Weight` and `Stratum` to analyze this data accurately. We will discuss how to use these weighting variables in Chapters \@ref(c10-specifying-sample-designs) and \@ref(c03-understanding-survey-data-documentation). For a more detailed codebook, see Appendix \@ref(anes-cb).

In most chapters, you'll find code that you can follow. Each of these chapters starts with a "setup" section. The setup section includes the code needed to load the necessary packages and datasets in the chapter. We then provide the main idea of the chapter and examples on how to use the functions. Most chapters end with exercises to work through. Solutions to the exercises can be found in the Appendix.
73 changes: 25 additions & 48 deletions 03-understanding-survey-data-documentation.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -88,70 +88,47 @@ The 2004 ANES dataset released an erratum, notifying analysts to remove a specif

Survey documentation may include additional material, such as interviewer instructions or "show cards" provided to respondents during interviewer-administered surveys to help respondents answer questions. Explore the survey website to find out what resources were used and in what contexts.

## Working with Missing Data
## Missing data coding

Missing data in surveys refers to situations where participants do not provide complete responses to survey questions. Respondents may not have seen a question by design. Or, they may not respond to a question for various other reasons, such as not wanting to answer a particular question, not understanding the question, or simply forgetting to answer.
For some observations in a dataset, there may be missing data. This can be by design or from nonresponse, and these concepts are detailed in Chapter \@ref(c11-missing-data). In that chapter, we also discuss how to analyze data with missing data. In this section, we discuss how to understand documentation related to missing data.

Missing data can be a significant problem in survey analysis, as it can introduce bias and reduce the representativeness of the data. Missing data typically falls into two main categories: missing by design or unintentional missing data.
The survey documentation, often the codebook, represents the missing data with a code. The codebook may list different codes depending on why certain data is missing. In the example of variable `V202066` from the ANES (Figure \@ref(fig:understand-codebook-examp)), `-9` represents "Refused," `-7` means that the response was deleted due to an incomplete interview, `-6` means that there is no response because there was no follow-up interview, and `-1` means "Inapplicable" (due to the designed skip pattern).

1. **Missing by design/questionnaire skip logic**: This type of missingness occurs when certain respondents are intentionally directed to skip specific questions based on their previous responses or characteristics. For example, in a survey about employment, if a respondent indicates that they are not employed, they may be directed to skip questions related to their job responsibilities. Additionally, some surveys randomize questions or modules so that not all participants respond to all questions. In these instances, respondents would have missing data for the modules not randomly assigned to them.

2. **Unintentional missing data**: This type of missingness occurs when researchers do not intend for there to be missing data on a particular question, for example, if respondents did not finish the survey or refused to answer individual questions. There are three main types of unintentional missing data that each should be considered and handled differently [@mack; @Schafer2002]:

a. **Missing completely at random (MCAR)**: The missing data are unrelated to both observed and unobserved data, and the probability of being missing is the same across all cases. For example, if a respondent missed a question because they had to leave the survey early due to an emergency.

b. **Missing at random (MAR)**: The missing data are related to observed data but not unobserved data, and the probability of being missing is the same within groups. For example, if older respondents choose not to answer specific questions but younger respondents do answer them and we know the respondent's age.

c. **Missing not at random (MNAR)**: The missing data are related to unobserved data, and the probability of being missing varies for reasons we are not measuring. For example, if respondents with depression do not answer a question about depression severity.

The survey documentation, often the codebook, represents the missing data with a code. For example, a survey may have "Yes" responses coded to `1`, "No" responses coded to `2`, and missing responses coded to `-9`. Or, the codebook may list different codes depending on why certain data are missing. In the example of variable `V202066` from the ANES (Figure \@ref(fig:understand-codebook-examp)), `-9` represents "Refused," `-7` means that the response was deleted due to an incomplete interview, `-6` means that there is no response because there was no follow-up interview, and `-1` means "Inapplicable" (due to the designed skip pattern).

When running analysis in R, we must handle missing responses as missing data (i.e., `NA`) and not numeric data. If missing responses are treated as zeros or arbitrary values, they can artificially alter summary statistics or introduce spurious patterns in the analysis. Recoding these values to `NA` will allow us to handle missing data in different ways in R, such as using functions like `na.omit()`, `complete.cases()`, or specialized packages like {tidyimpute} or {mice}. These tools allow us to treat missing responses as missing data to conduct our analysis accurately and obtain valid results.

Visualizing the missing data can also inform the types of missing data that are present. The {naniar} package provides many valuable missing data visualizations, such as using `gg_miss_var()` to see the count or percent of missing data points by variable or `gg_miss_fct()` to see relationships in missing data across levels of a factor variable. Investigating the relationships and nature of the missing data before running models can ensure that the missing data are accurately accounted for.

### Accounting for Questionnaire Skip Patterns

Questionnaires may include skip patterns, in which specific questions are skipped based on the respondent's answers to earlier questions. For example, if a respondent answers "no" to a question on whether they voted in the last election, they may be instructed to skip a series of questions related to that election.

Skip patterns are used in surveys to streamline the data collection process and avoid asking irrelevant questions to certain respondents. However, they also result in missing data, as respondents cannot respond to questions they were instructed to skip. Analyzing the data missing by design requires understanding the underlying reasons for the skip patterns. Our survey analysis must properly account for skip patterns to ensure unbiased and accurate population parameters.

Dealing with missing data due to skip patterns requires careful consideration. We can treat skipped questions as missing data. Or, we can run an analysis that accounts for the conditional dependence between the skipped and answered questions. The appropriate method depends on the nature and extent of the skip patterns, the research questions, and the methodology. For example, if we wanted to know what proportion of eligible voters voted for a particular candidate, the denominator would be all eligible voters, while if we wanted to know what proportion voted for a specific candidate among those who voted, the denominator would be those who voted. We include or exclude missing values depending on our research question.

### Accounting for Unintentional Missing Data

When dealing with missing data that is MCAR, MAR, or MNAR, we must consider the implications of how we handle these missing data and avoid introducing more sources of bias. For instance, we can analyze only the respondents who answered all questions by performing listwise deletion, which drops all rows from a data frame with a missing value in any column. For example, let's say we have a dataset `dat` with one complete case and two cases with some missing data. We can use the function `tidyr::drop_na()` for listwise deletion.
As another example, there may be a summary variable that describes the missingness of a set of variables - particularly with "select all that apply" or "multiple response" questions. In the National Crime Victimization Survey (NCVS), respondents who are victims of a crime and saw the offender are asked if the offender have a weapon and then asked what the type of weapon was. This part of the questionnaire from 2021 is shown in Figure \@ref(fig:understand-ncvs-weapon-q).

```{r}
#| label: understand-dropna-example1
dat <- tibble::tribble(~ col1, ~ col2, ~ col3,
"a", "d", "e",
"b", NA, NA,
"c", NA, "f")
#| label: understand-ncvs-weapon-q
#| echo: false
#| fig.cap: Excerpt from the NCVS 2020-2021 Crime Incident Report - Weapon Type
#| fig.alt: Questions 22 and 23a from the NCVS 2020-2021 Crime Incident Report, see https://bjs.ojp.gov/content/pub/pdf/ncvs20_cir.pdf
dat
knitr::include_graphics(path="images/questionnaire-ncvs-weapon.jpg")
```

If we use the `tidyr::drop_na()` function, only the first case will remain, as the other two cases have at least one missing value.
<!-- https://bjs.ojp.gov/content/pub/pdf/ncvs20_cir.pdf -->

The NCVS codebook includes coding for all multiple response variables of a "lead in" variable that summarizes the individual options. For question 23a on the weapon type, the lead in variable is V4050 which is shown in \@ref(fig:understand-ncvs-weapon-cb). This variable is then followed by a set of variables for each weapon type. An example of one of the individual variables from the codebook, the handgun, is shown in \@ref(fig:understand-ncvs-weapon-cb-hg). We will dive in more to this example in Chapter \@ref(c11-missing-data) of how to analyze this variable.

```{r}
#| label: understand-dropna-example2
dat %>%
tidyr::drop_na()
#| label: understand-ncvs-weapon-cb
#| echo: false
#| fig.cap: Excerpt from the NCVS 2021 Codebook for V4050 - LI WHAT WAS WEAPON
#| fig.alt: Codebook includes location of variable (files and columns), variable type (numeric), question (What was the weapon? Anything else?), and the coding of this lead in variable
knitr::include_graphics(path="images/codebook-ncvs-weapon-li.jpg")
```

However, if we want to only remove rows that have missing values in `col3`, we can specify this as an argument in `drop_na()` as follows:

```{r}
#| label: understand-dropna-example3
dat %>%
tidyr::drop_na(col3)
#| label: understand-ncvs-weapon-cb-hg
#| echo: false
#| fig.cap: "Excerpt from the NCVS 2021 Codebook for V4051 - C WEAPON: HAND GUN"
#| fig.alt: Codebook includes location of variable (files and columns), variable type (numeric), question (What was the weapon? Anything else?), and the coding of this categorical variable
knitr::include_graphics(path="images/codebook-ncvs-weapon-handgun.jpg")
```

The `drop_na()` function works on `tbl_svy` objects as well and should only be applied after creating the design object.

If the data are not missing completely at random (MCAR), then listwise deletion may produce biased estimates if there is a pattern of respondents who do not respond to specific questions. In these circumstances, we should explore other options, such as multiple imputation or weighted estimation. However, imputation is not always appropriate and can introduce its own sources of bias. See @allison for more details.
When data is read into R, some values may be system missing, that is they are coded as `NA` even if that is not evident in a codebook. We will discuss in Chapter \@ref(c11-missing-data) how to analyze data with `NA` values and review how R handles missing data in calculations.

In summary, we need to deeply understand the types and reasons for missing data in our survey before running any analysis. The survey documentation is an important resource for understanding how to deal with missing data. Carefully review the documentation for guidance from the researchers.
<!-- https://stats.oarc.ucla.edu/r/faq/how-does-r-handle-missing-values/ -->

## Example: American National Election Studies (ANES) 2020 Survey Documentation

Expand Down
2 changes: 1 addition & 1 deletion 04-set-up.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ Once this package is installed, load it using the `library()` function:
library(censusapi)
```

Additional packages are used in the vignettes. We list them in the Prerequisite boxes at the beginning of the chapters. When working through those chapters, please ensure you pay attention to the setup box at the beginning of the chapter and install all necessary packages.
Additional packages are used in the Real Life Data and Vignettes sections of the book. We list them in the Prerequisite boxes at the beginning of the chapters. When working through those chapters, please ensure you pay attention to this prerequisite box at the beginning of the chapter and load all necessary packages and data.

## Data

Expand Down
20 changes: 19 additions & 1 deletion 10-specifying-sample-designs.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -584,7 +584,6 @@ To specify this design, use the following syntax:
#| echo: FALSE
#| warning: FALSE
#| message: FALSE
#| cache: TRUE
data(recs_2015)
```

Expand Down Expand Up @@ -657,6 +656,7 @@ To specify this design, use the following syntax:

```{r}
#| label: samp-des-recs2020-des
recs_des <- recs_2020 %>%
as_survey_rep(
weights = NWEIGHT,
Expand All @@ -672,6 +672,24 @@ recs_des
summary(recs_des)
```

```{r}
#| label: samp-des-recs2020-des-save
#| echo: false
recs_des <- recs_2020 %>%
as_survey_rep(
weights = NWEIGHT,
repweights = NWEIGHT1:NWEIGHT60,
type = "JK1",
scale = 59/60,
mse = TRUE
)
```


When printing the design object or looking at the summary, the replicate weight type is re-iterated as `Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances`, and the variables are included. No weight or probability summary is included.

### Bootstrap Method
Expand Down
Loading

0 comments on commit 6b86502

Please sign in to comment.