Skip to content

Commit

Permalink
Merge pull request #90 from tidy-survey-r/set-up-ch
Browse files Browse the repository at this point in the history
Adding Set-up chapter
  • Loading branch information
ivelasq authored Jan 14, 2024
2 parents 5e54d3b + 2506e75 commit 0fa4238
Show file tree
Hide file tree
Showing 10 changed files with 286 additions and 162 deletions.
58 changes: 1 addition & 57 deletions 01-introduction.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -40,60 +40,4 @@ This book will cover many aspects of survey design and analysis, from understand
- **Chapter \@ref(c13-ncvs-vignette)**: National Crime Victimization Survey Vignette. A vignette on how to analyze data from the NCVS, a survey in the U.S. that collects information on crimes and their characteristics. This illustrates an analysis that requires multiple files to calculate victimization rates.
- **Chapter \@ref(c14-ambarom-vignette)**: AmericasBarometer Vignette. A vignette on how to analyze data from the AmericasBarometer, a survey of attitudes, evaluations, experiences, and behavior in countries in the Western Hemisphere. This includes how to make choropleth maps with survey estimates.

In most chapters, you'll find code that you can follow. Each of these chapters starts with a "set-up" section. This section will include the code needed to load the necessary packages and datasets in the chapter. We then provide the main idea of the chapter and examples on how to use the functions. Most chapters end with exercises to work through. Solutions to the exercises can be found in the Appendix.

## Datasets used in this book

We work with two key datasets throughout the book: the Residential Energy Consumption Survey [RECS -- @recs-2020-tech] and the American National Election Studies [ANES -- @debell]. To ensure that all readers can follow the examples, we have provided analytic datasets in an R package, {srvyr.data}. Install the package from GitHub using the {remotes} package.

```r
remotes::install_github("https://github.com/tidy-survey-r/srvyr.data")
```

To explore the provided datasets in the package, access the documentation usng the `help()` command.

```r
help(package="srvyr.data")
```
To load the RECS and ANES datasets, start by running `library(srvyr.data)` to load the package. Then, use the `data()` command to load the datasets into the environment.

```{r}
#| label: intro-setup
#| error: FALSE
#| warning: FALSE
#| message: FALSE
library(tidyverse)
library(survey)
library(srvyr)
library(srvyr.data)
```

```{r}
#| label: intro-setup-readin
#| error: FALSE
#| warning: FALSE
#| message: FALSE
#| cache: TRUE
data(recs_2020)
data(anes_2020)
```

RECS is a study that provides energy consumption and expenditures data in American households. The Energy Information Administration funds RECS and has been fielded 15 times between 1950 and 2020. The survey has two components - the household survey and the energy supplier survey. In 2020, the household survey was collected by web and paper questionnaires and included questions about appliances, electronics, heating, air conditioning (A/C), temperatures, water heating, lighting, respondent demographics, and energy assistance. The energy supplier survey consists of components relating to energy consumption and energy expenditure. Below is an overview of the `recs_2020` data:

```{r}
#| label: intro-recs
recs_2020 %>% select(-starts_with("NWEIGHT"))
recs_2020 %>% select(starts_with("NWEIGHT"))
```

From this output, we can see that there are `r nrow(recs_2020) %>% formatC(big.mark = ",")` rows and `r ncol(recs_2020) %>% formatC(big.mark = ",")` variables. We can see that there are variables containing an ID (`DOEID`), geographic information (e.g., `Region`, `state_postal`, `Urbanicity`), along with information about the house, including the type of house (`HousingUnitType`) and when the house was built (`YearMade`). Additionally, there is a long list of weighting variables that we will use in the analysis (e.g., `NWEIGHT`, `NWEIGHT1`, ..., `NWEIGHT60`). We will discuss using these weighting variables in Chapter \@ref(c10-specifying-sample-designs). For a more detailed codebook, see Appendix \@ref(recs-cb).

The ANES is a series study that has collected data from election surveys since 1948. These surveys contain data on public opinion and voting behavior in U.S. presidential elections. The 2020 survey (the data we will be using) was fielded to individuals over the web, through live video interviewing, or over with computer-assisted telephone interviewing (CATI). The survey includes questions on party affiliation, voting choice, and level of trust with the government. Here is an overview of the `anes_2020` data. First, we show the variables starting with "V" followed by a number; these are the original variables. Then, we show you the remaining variables that we created based on the original data:

```{r}
#| label: intro-anes
anes_2020 %>% select(matches("^V\\d"))
anes_2020 %>% select(-matches("^V\\d"))
```

From this output we can see that there are `r nrow(anes_2020) %>% formatC(big.mark = ",")` rows and `r ncol(anes_2020) %>% formatC(big.mark = ",")` variables. Most of the variables start with V20, so referencing the documentation for survey will be crucial to not get lost (see Chapter \@ref(c03-understanding-survey-data-documentation)). We have created some more descriptive variables for you to use throughout this book, such as the age (`Age`) and gender (`Gender`) of the respondent, along with variables that represent their party affiliation (`PartyID`). Additionally, we need the variables `Weight` and `Stratum` to analyze this data accurately. We will discuss how to use these weighting variables in Chapters \@ref(c10-specifying-sample-designs) and \@ref(c03-understanding-survey-data-documentation). For a more detailed codebook, see Appendix \@ref(anes-cb).
In most chapters, you'll find code that you can follow. Each of these chapters starts with a "setup" section. The setup section includes the code needed to load the necessary packages and datasets in the chapter. We then provide the main idea of the chapter and examples on how to use the functions. Most chapters end with exercises to work through. Solutions to the exercises can be found in the Appendix.
90 changes: 3 additions & 87 deletions 03-understanding-survey-data-documentation.Rmd
Original file line number Diff line number Diff line change
@@ -1,32 +1,14 @@
# Understanding Survey Data Documentation {#c03-understanding-survey-data-documentation}

::: {.prereqbox-header}
`r if (knitr:::is_html_output()) '### Prerequisites {- #prereq4}'`
:::

::: {.prereqbox data-latex="{Prerequisites}"}
For this chapter, load the following packages and the helper function:
```{r}
#| label: understand-c04-setup
#| label: understand-pkgs
#| echo: FALSE
#| error: FALSE
#| warning: FALSE
#| message: FALSE
library(tidyverse)
library(survey)
library(srvyr)
library(srvyr.data)
library(censusapi)
```

We created multiple analytic datasets for use in this book in an R package as described in \@ref(datasets-used-in-this-book). For this chapter, we will be using data from ANES. Here is the code to read in the data:

```{r}
#| label: understand-anes-c04
#| eval: FALSE
data(anes_2020)
```
:::

## Introduction

Before diving into survey analysis, it's crucial to review the survey documentation thoroughly. The documentation includes technical guides, questionnaires, codebooks, errata, and other useful resources. By taking the time to review these materials, we can gain a comprehensive understanding of the survey data (including research and design decisions discussed in Chapters \@ref(c02-overview-surveys) and \@ref(c10-specifying-sample-designs)) and effectively conduct our analysis.
Expand Down Expand Up @@ -203,73 +185,7 @@ The user guide references a supplemental document called "How to Analyze ANES Su

> The target population for the fresh cross-section was the 231 million non-institutional U.S. citizens aged 18 or older living in the 50 U.S. states or the District of Columbia.
To create accurate weights for the population, we need to determine the total population size when the survey was conducted. We will use Current Population Survey (CPS) to find a number of the non-institutional U.S. citizens aged 18 or older living in the 50 U.S. states or D.C. in March of 2020, as this is what the ANES methodology mentions using.

The {censusapi} package allows us to run a reproducible analysis of the CPS data. Note that this package requires a census API key; more information can be found in the package documentation. Best practice is to include the census API key in our R environment and not directly in the code. We can use the {usethis} package's `edit_r_environ()` function to access the R environment (located in a file called `.Renviron`). Run `edit_r_environ()`, save the census API key as `CENSUS_KEY`, and restart RStudio. Once the census API key is saved in the R environment, we access it in our code with `Sys.getenv("CENSUS_KEY")`.

We extract several variables including month of interview (`HRMONTH`), year of interview (`HRYEAR4`), age (`PRTAGE`), citizenship status (`PRCITSHP`), and final person-level weight (`PWSSWGT`). Detailed information for these variables can be found in the data dictionary^[https://www2.census.gov/programs-surveys/cps/datasets/2020/basic/2020_Basic_CPS_Public_Use_Record_Layout_plus_IO_Code_list.txt].


```{r}
#| label: understand-get-cps
#| message: false
#| cache: TRUE
cps_state_in <- getCensus(name = "cps/basic/mar",
vintage = 2020,
region = "state",
vars = c("HRMONTH", "HRYEAR4",
"PRTAGE", "PRCITSHP", "PWSSWGT"),
key = Sys.getenv("CENSUS_KEY"))
cps_state <- cps_state_in %>%
as_tibble() %>%
mutate(across(.cols = everything(),
.fns = as.numeric))
```

We confirm that all the data is from March (`HRMONTH == 3`) of 2020 (`HRYEAR4 == 2020`).
```{r}
#| label: understand-cps-date
cps_state %>%
distinct(HRMONTH, HRYEAR4)
```

We then filter to only those who are 18 years or older (`PRTAGE >= 18`) and have U.S. citizenship (`PRCITSHIP %in% (1:4)`) and calculate the sum of the weights to obtain the size of the target population.
```{r}
#| label: understand-cps-targetpop
targetpop <- cps_state %>%
as_tibble() %>%
filter(PRTAGE >= 18,
PRCITSHP %in% (1:4)) %>%
pull(PWSSWGT) %>%
sum()
targetpop
```

The target population in 2020 is `r scales::comma(targetpop)`. This information gives us what we need to create the post-election survey object with {srvyr}. Using the raw ANES data we pulled in at the beginning of this chapter, we will adjust the weighting variable (`V200010b`) using the target population we just calculated (`targetpop`).

```{r}
#| label: understand-read-anes
anes_adjwgt <- anes_2020 %>%
mutate(Weight = V200010b / sum(V200010b) * targetpop)
```

Once we adjusted the weights to the population, we can then create the survey design using our new weight variable in the `weights` argument and use the strata and cluster variables identified in the user manual.

```{r}
#| label: understand-anes-des
anes_des <- anes_adjwgt %>%
as_survey_design(weights = Weight,
strata = V200010d,
ids = V200010c,
nest = TRUE)
summary(anes_des)
```

Now that we have the survey design object, we can continue to reference the ANES documentation, including the questionnaire and the codebook, as we select variables for analysis and gain insights into the findings.
To create accurate weights for the population, we need to determine the total population size when the survey was conducted. This can be determined using the Current Population Survey (CPS) for March of 2020 as stated in the ANES documentation. Chapter \@ref(c04-set-up) goes into more detail about how to calculate this value and adjust the data.

## Searching for Public-Use Survey Data
Throughout this book, we use public-use datasets from different surveys. Above, we provided an example from the American National Election Survey (ANES), and we will continue to use this dataset throughout the book. Additionally, we use the Residential Energy Consumption Survey (RECS), the National Crime Victimization Survey (NCVS), and the AmericasBarometer surveys.
Expand Down
Loading

0 comments on commit 0fa4238

Please sign in to comment.