Skip to content

Commit

Permalink
Merge pull request #85 from tidy-survey-r/use-data-package
Browse files Browse the repository at this point in the history
Use data package in book
  • Loading branch information
ivelasq authored Jan 13, 2024
2 parents 38279f4 + dfd32da commit b98154f
Show file tree
Hide file tree
Showing 23 changed files with 137 additions and 4,004 deletions.
65 changes: 19 additions & 46 deletions 01-introduction.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -38,44 +38,18 @@ In most chapters, you'll find code that you can follow. Each of these chapters s

## Datasets used in this book {#book-datasets}

We work with two key datasets throughout the book: the Residential Energy Consumption Survey [RECS -- @recs-2020-tech] and the American National Election Studies [ANES -- @debell]. To ensure that all readers can follow the examples, we have provided analytic datasets available on OSF^[https://osf.io/gzbkn/?view_only=8ca80573293b4e12b7f934a0f742b957].

If a chapter contains data that is not part of existing packages, we have created a helper function, `read_osf()`, for you to load it easily. We recommend saving the script below in a folder called "helper-fun" and calling the file `helper-function.R` if you would like to follow along with the prerequisites listed in the chapters that contain code.
We work with two key datasets throughout the book: the Residential Energy Consumption Survey [RECS -- @recs-2020-tech] and the American National Election Studies [ANES -- @debell]. To ensure that all readers can follow the examples, we have provided analytic datasets in an R package, {srvyr.data}. Install the package from GitHub using the {remotes} package.

```r
read_osf <- function(filename){
#' Downloads file from OSF project
#' Reads in file
#' Deletes file from computer

osf_dl_del_later <- !dir.exists("osf_dl")

if (osf_dl_del_later) {
osf_dl_del_later <- TRUE
dir.create("osf_dl")
}

dat_det <-
osf_retrieve_node("https://osf.io/gzbkn/?view_only=8ca80573293b4e12b7f934a0f742b957") %>%
osf_ls_files() %>%
dplyr::filter(name == filename) %>%
osf_download(conflicts = "overwrite", path = "osf_dl")

out <- dat_det %>%
dplyr::pull(local_path) %>%
readr::read_rds()

if (osf_dl_del_later) {
unlink("osf_dl", recursive = TRUE)
} else{
unlink(dplyr::pull(dat_det, local_path))
}

return(out)
}
remotes::install_github("https://github.com/tidy-survey-r/srvyr.data")
```

Here's how to use the function to read in the RECS and ANES datasets:
To explore the provided datasets in the package, access the documentation usng the `help()` command.

```r
help(package="srvyr.data")
```
To load the RECS and ANES datasets, start by running `library(srvyr.data)` to load the package. Then, use the `data()` command to load the datasets into the environment.

```{r}
#| label: intro-setup
Expand All @@ -85,8 +59,7 @@ Here's how to use the function to read in the RECS and ANES datasets:
library(tidyverse)
library(survey)
library(srvyr)
library(osfr)
source("helper-fun/helper-function.R")
library(srvyr.data)
```

```{r}
Expand All @@ -95,26 +68,26 @@ source("helper-fun/helper-function.R")
#| warning: FALSE
#| message: FALSE
#| cache: TRUE
recs_in <- read_osf("recs_2020.rds")
anes_in <- read_osf("anes_2020.rds")
data(recs_2020)
data(anes_2020)
```

RECS is a study that provides energy consumption and expenditures data in American households. The Energy Information Administration funds RECS and has been fielded 15 times between 1950 and 2020. The survey has two components - the household survey and the energy supplier survey. In 2020, the household survey was collected by web and paper questionnaires and included questions about appliances, electronics, heating, air conditioning (A/C), temperatures, water heating, lighting, respondent demographics, and energy assistance. The energy supplier survey consists of components relating to energy consumption and energy expenditure. Below is an overview of the `recs_in` data:
RECS is a study that provides energy consumption and expenditures data in American households. The Energy Information Administration funds RECS and has been fielded 15 times between 1950 and 2020. The survey has two components - the household survey and the energy supplier survey. In 2020, the household survey was collected by web and paper questionnaires and included questions about appliances, electronics, heating, air conditioning (A/C), temperatures, water heating, lighting, respondent demographics, and energy assistance. The energy supplier survey consists of components relating to energy consumption and energy expenditure. Below is an overview of the `recs_2020` data:

```{r}
#| label: intro-recs
recs_in %>% select(-starts_with("NWEIGHT"))
recs_in %>% select(starts_with("NWEIGHT"))
recs_2020 %>% select(-starts_with("NWEIGHT"))
recs_2020 %>% select(starts_with("NWEIGHT"))
```

From this output, we can see that there are `r nrow(recs_in) %>% formatC(big.mark = ",")` rows and `r ncol(recs_in) %>% formatC(big.mark = ",")` variables. We can see that there are variables containing an ID (`DOEID`), geographic information (e.g., `Region`, `state_postal`, `Urbanicity`), along with information about the house, including the type of house (`HousingUnitType`) and when the house was built (`YearMade`). Additionally, there is a long list of weighting variables that we will use in the analysis (e.g., `NWEIGHT`, `NWEIGHT1`, ..., `NWEIGHT60`). We will discuss using these weighting variables in Chapter \@ref(c03-specifying-sample-designs). For a more detailed codebook, see Appendix \@ref(recs-cb).
From this output, we can see that there are `r nrow(recs_2020) %>% formatC(big.mark = ",")` rows and `r ncol(recs_2020) %>% formatC(big.mark = ",")` variables. We can see that there are variables containing an ID (`DOEID`), geographic information (e.g., `Region`, `state_postal`, `Urbanicity`), along with information about the house, including the type of house (`HousingUnitType`) and when the house was built (`YearMade`). Additionally, there is a long list of weighting variables that we will use in the analysis (e.g., `NWEIGHT`, `NWEIGHT1`, ..., `NWEIGHT60`). We will discuss using these weighting variables in Chapter \@ref(c03-specifying-sample-designs). For a more detailed codebook, see Appendix \@ref(recs-cb).

The ANES is a series study that has collected data from election surveys since 1948. These surveys contain data on public opinion and voting behavior in U.S. presidential elections. The 2020 survey (the data we will be using) was fielded to individuals over the web, through live video interviewing, or over with computer-assisted telephone interviewing (CATI). The survey includes questions on party affiliation, voting choice, and level of trust with the government. Here is an overview of the `anes_in` data. First, we show the variables starting with "V" followed by a number; these are the original variables. Then, we show you the remaining variables that we created based on the original data:
The ANES is a series study that has collected data from election surveys since 1948. These surveys contain data on public opinion and voting behavior in U.S. presidential elections. The 2020 survey (the data we will be using) was fielded to individuals over the web, through live video interviewing, or over with computer-assisted telephone interviewing (CATI). The survey includes questions on party affiliation, voting choice, and level of trust with the government. Here is an overview of the `anes_2020` data. First, we show the variables starting with "V" followed by a number; these are the original variables. Then, we show you the remaining variables that we created based on the original data:

```{r}
#| label: intro-anes
anes_in %>% select(matches("^V\\d"))
anes_in %>% select(-matches("^V\\d"))
anes_2020 %>% select(matches("^V\\d"))
anes_2020 %>% select(-matches("^V\\d"))
```

From this output we can see that there are `r nrow(anes_in) %>% formatC(big.mark = ",")` rows and `r ncol(anes_in) %>% formatC(big.mark = ",")` variables. Most of the variables start with V20, so referencing the documentation for survey will be crucial to not get lost (see Chapter \@ref(c04-understanding-survey-data-documentation)). We have created some more descriptive variables for you to use throughout this book, such as the age (`Age`) and gender (`Gender`) of the respondent, along with variables that represent their party affiliation (`PartyID`). Additionally, we need the variables `Weight` and `Stratum` to analyze this data accurately. We will discuss how to use these weighting variables in Chapters \@ref(c03-specifying-sample-designs) and \@ref(c04-understanding-survey-data-documentation). For a more detailed codebook, see Appendix \@ref(anes-cb).
From this output we can see that there are `r nrow(anes_2020) %>% formatC(big.mark = ",")` rows and `r ncol(anes_2020) %>% formatC(big.mark = ",")` variables. Most of the variables start with V20, so referencing the documentation for survey will be crucial to not get lost (see Chapter \@ref(c04-understanding-survey-data-documentation)). We have created some more descriptive variables for you to use throughout this book, such as the age (`Age`) and gender (`Gender`) of the respondent, along with variables that represent their party affiliation (`PartyID`). Additionally, we need the variables `Weight` and `Stratum` to analyze this data accurately. We will discuss how to use these weighting variables in Chapters \@ref(c03-specifying-sample-designs) and \@ref(c04-understanding-survey-data-documentation). For a more detailed codebook, see Appendix \@ref(anes-cb).
23 changes: 12 additions & 11 deletions 03-specifying-sample-designs.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,7 @@ For this chapter, load the following packages and the helper function:
library(tidyverse)
library(survey)
library(srvyr)
library(osfr)
source("helper-fun/helper-function.R")
library(srvyr.data)
```

To help explain the different types of sample designs, this chapter will use the `api` and `scd` data that comes in the {survey} package:
Expand All @@ -25,12 +24,13 @@ data(api)
data(scd)
```

Additionally, we have created multiple analytic datasets for use in this book on a directory on OSF^[https://osf.io/gzbkn/?view_only=8ca80573293b4e12b7f934a0f742b957]. To load any data used in the book that is not included in existing packages, we have created a helper function `read_osf()`. This chapter uses data from the Residential Energy Consumption Survey (RECS) - both 2015 and 2020, so we will use the following code to load the RECS data to use later in this chapter:
Additionally, we have created multiple analytic datasets for use in the {srvyr.data} package, as described in \@ref{book-datasets}. This chapter uses data from the Residential Energy Consumption Survey (RECS) - both 2015 and 2020, so we will use the following code to load the RECS data to use later in this chapter:

```{r}
#| label: samp-setup-recs
#| eval: FALSE
recs_2015_in <- read_osf("recs_2015.rds")
recs_in <- read_osf("recs_2020.rds")
data(recs_2015)
data(recs_2020)
```
:::

Expand Down Expand Up @@ -573,7 +573,7 @@ fay_des <- dat %>%

#### Example {-}

The 2015 RECS [@recs-2015-micro] uses Fay's BRR weights with the final weight as NWEIGHT and replicate weights as BRRWT1 - BRRWT96 with $\rho=0.5$. On the file, DOEID is a unique identifier for each respondent, TOTALDOL is the total cost of energy, TOTSQFT_EN is the total square footage of the residence, and REGOINC is the Census region. We have already read in the RECS data and created a dataset called `recs_2015_in` above in the prerequisites.
The 2015 RECS [@recs-2015-micro] uses Fay's BRR weights with the final weight as NWEIGHT and replicate weights as BRRWT1 - BRRWT96 with $\rho=0.5$. On the file, DOEID is a unique identifier for each respondent, TOTALDOL is the total cost of energy, TOTSQFT_EN is the total square footage of the residence, and REGOINC is the Census region. We have already read in the RECS data and created a dataset called `recs_2015` above in the prerequisites.

To specify this design, use the following syntax:

Expand All @@ -583,14 +583,14 @@ To specify this design, use the following syntax:
#| warning: FALSE
#| message: FALSE
#| cache: TRUE
recs_2015_in <- read_osf("recs_2015.rds")
data(recs_2015)
```


```{r}
#| label: samp-des-recs-des
#| eval: TRUE
recs_2015_des <- recs_2015_in %>%
recs_2015_des <- recs_2015 %>%
as_survey_rep(weights = NWEIGHT,
repweights = BRRWT1:BRRWT96,
type = "Fay",
Expand Down Expand Up @@ -649,12 +649,13 @@ jkn_des <- dat %>%

#### Example {-}

The 2020 RECS [@recs-2020-micro] uses jackknife weights with the final weight as NWEIGHT and replicate weights as NWEIGHT1 - NWEIGHT60 with a scale of $(R-1)/R=59/60$. On the file, DOEID is a unique identifier for each respondent, TOTALDOL is the total cost of energy, TOTSQFT_EN is the total square footage of the residence, and REGOINC is the Census region. We have already read in the RECS data and created a dataset called `recs_in` above in the prerequisites.
The 2020 RECS [@recs-2020-micro] uses jackknife weights with the final weight as NWEIGHT and replicate weights as NWEIGHT1 - NWEIGHT60 with a scale of $(R-1)/R=59/60$. On the file, DOEID is a unique identifier for each respondent, TOTALDOL is the total cost of energy, TOTSQFT_EN is the total square footage of the residence, and REGOINC is the Census region. We have already read in the RECS data and created a dataset called `recs_2020` above in the prerequisites.

To specify this design, use the following syntax:

```{r}
recs_des <- recs_in %>%
#| label: samp-des-recs2020-des
recs_des <- recs_2020 %>%
as_survey_rep(
weights = NWEIGHT,
repweights = NWEIGHT1:NWEIGHT60,
Expand All @@ -673,7 +674,7 @@ summary(recs_des)
#| label: samp-des-recs-des-full
#| echo: FALSE
# This is just for later use in book
recs_des <- recs_in %>%
recs_des <- recs_2020 %>%
as_survey_rep(
weights = NWEIGHT,
repweights = NWEIGHT1:NWEIGHT60,
Expand Down
7 changes: 3 additions & 4 deletions 04-understanding-survey-data-documentation.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -14,16 +14,15 @@ For this chapter, load the following packages and the helper function:
library(tidyverse)
library(survey)
library(srvyr)
library(osfr)
source("helper-fun/helper-function.R")
library(srvyr.data)
library(censusapi)
```

We will be using data from ANES. Here is the code to read in the data.
```{r}
#| label: understand-anes-c04
#| eval: FALSE
anes_in <- read_osf("anes_2020.rds")
data(anes_2020)
```
:::

Expand Down Expand Up @@ -250,7 +249,7 @@ The target population in 2020 is `r scales::comma(targetpop)`. This information

```{r}
#| label: understand-read-anes
anes_adjwgt <- anes_in %>%
anes_adjwgt <- anes_2020 %>%
mutate(Weight = V200010b / sum(V200010b) * targetpop)
```

Expand Down
13 changes: 6 additions & 7 deletions 05-descriptive-analysis.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,7 @@ For this chapter, load the following packages and the helper function:
library(tidyverse)
library(survey)
library(srvyr)
library(osfr)
source("helper-fun/helper-function.R")
library(srvyr.data)
library(broom)
```

Expand All @@ -40,10 +39,10 @@ We will be using data from ANES and RECS. Here is the code to create the design
```{r}
#| label: desc-anes-des
#| eval: FALSE
anes_in <- read_osf("anes_2020.rds")
targetpop <- 231592693
data(anes_2020)
anes_adjwgt <- anes_in %>%
anes_adjwgt <- anes_2020 %>%
mutate(Weight = Weight / sum(Weight) * targetpop)
anes_des <- anes_adjwgt %>%
Expand All @@ -60,9 +59,9 @@ For RECS, details are included in the RECS documentation and Chapter \@ref(c03-s
```{r}
#| label: desc-recs-des
#| eval: FALSE
recs_in <- read_osf("recs_2020.rds")
data(recs_2020)
recs_des <- recs_in %>%
recs_des <- recs_2020 %>%
as_survey_rep(
weights = NWEIGHT,
repweights = NWEIGHT1:NWEIGHT60,
Expand Down Expand Up @@ -978,7 +977,7 @@ It is estimated that American residential households spent an average of `r .elb

Briefly, we mentioned using `filter()` to subset a survey object for analysis. This operation should be done after creating the design object. In rare circumstances, subsetting data before creating the object can lead to incorrect variability estimates. This can occur if subsetting removes an entire PSU.

Suppose we wanted estimates of the average amount spent on natural gas among housing units that use natural gas using the variable `BTUNG`^[`BTUNG` is derived from the supplier side component of the survey where `BTUNG` represents the natural gas consumption in British thermal units (BTUs) in a year]. This could be obtained by first filtering records to only include records where `BTUNG > 0` and then finding the average amount of money spent.
Suppose we wanted estimates of the average amount spent on natural gas among housing units that use natural gas using the variable `BTUNG`^[`BTUNG` is derived from the supplier side component of the survey where `BTUNG` represents the natural gas consumption in British thermal units (Btus) in a year]. This could be obtained by first filtering records to only include records where `BTUNG > 0` and then finding the average amount of money spent.

```{r}
#| label: desc-subpop
Expand Down
11 changes: 5 additions & 6 deletions 06-statistical-testing.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,7 @@ For this chapter, load the following packages and the helper function:
library(tidyverse)
library(survey)
library(srvyr)
library(osfr)
source("helper-fun/helper-function.R")
library(srvyr.data)
library(broom)
library(gt)
```
Expand All @@ -24,10 +23,10 @@ We will be using data from ANES and RECS. Here is the code to create the design
```{r}
#| label: stattest-anes-des
#| eval: FALSE
anes_in <- read_osf("anes_2020.rds")
targetpop <- 231592693
data(anes_2020)
anes_adjwgt <- anes_in %>%
anes_adjwgt <- anes_2020 %>%
mutate(Weight = Weight / sum(Weight) * targetpop)
anes_des <- anes_adjwgt %>%
Expand All @@ -44,9 +43,9 @@ For RECS, details are included in the RECS documentation and Chapter \@ref(c03-s
```{r}
#| label: stattest-recs-des
#| eval: FALSE
recs_in <- read_osf("recs_2020.rds")
data(recs_2020)
recs_des <- recs_in %>%
recs_des <- recs_2020 %>%
as_survey_rep(
weights = NWEIGHT,
repweights = NWEIGHT1:NWEIGHT60,
Expand Down
17 changes: 7 additions & 10 deletions 07-modeling.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -14,19 +14,18 @@ For this chapter, load the following packages and the helper function:
library(tidyverse)
library(survey)
library(srvyr)
library(osfr)
source("helper-fun/helper-function.R")
library(srvyr.data)
library(broom)
```

We will be using data from ANES and RECS. Here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter \@ref(c04-understanding-survey-data-documentation) for more information).
```{r}
#| label: model-anes-des
#| eval: FALSE
anes_in <- read_osf("anes_2020.rds")
targetpop <- 231592693
data(anes_2020)
anes_adjwgt <- anes_in %>%
anes_adjwgt <- anes_2020 %>%
mutate(Weight = Weight / sum(Weight) * targetpop)
anes_des <- anes_adjwgt %>%
Expand All @@ -41,9 +40,7 @@ For RECS, details are included in the RECS documentation and Chapter \@ref(c03-s
```{r}
#| label: model-recs-des
#| eval: FALSE
recs_in <- read_osf("recs_2020.rds")
recs_des <- recs_in %>%
recs_des <- recs_2020 %>%
as_survey_rep(
weights = NWEIGHT,
repweights = NWEIGHT1:NWEIGHT60,
Expand Down Expand Up @@ -215,7 +212,7 @@ On RECS, we can obtain information on the square footage of homes and the electr
#| fig.alt: Hex chart where each hexagon represents a number of housing units at a point. x-axis is 'Total square footage' ranging from 0 to 7,500 and y-axis is 'Amount spent on electricity' ranging from $0 to 8,000. The trend is relatively linear and positve. A high concentration of points have square footage between 0 and 2,500 square feet as well as between electricity expenditure between $0 and 2,000
#| echo: FALSE
#| warning: FALSE
recs_in %>%
recs_2020 %>%
ggplot(aes(
x = TOTSQFT_EN,
y = DOLLAREL,
Expand Down Expand Up @@ -311,7 +308,7 @@ Additionally, `augment()` can be used to predict outcomes for data not used in m
```{r}
#| label: model-predict-new-dat
add_data <-
recs_in %>% select(DOEID,
recs_2020 %>% select(DOEID,
Region,
Urbanicity,
TOTSQFT_EN,
Expand Down Expand Up @@ -649,7 +646,7 @@ tidy(earlyvote_mod) %>% arrange(p.value)

```{r}
#| label: model-ex-logistic-2
add_vote_dat <- anes_in %>%
add_vote_dat <- anes_2020 %>%
select(EarlyVote2020, Age, Education, PartyID) %>%
rbind(tibble(
EarlyVote2020 = NA,
Expand Down
Loading

0 comments on commit b98154f

Please sign in to comment.