Merge pull request #85 from tidy-survey-r/use-data-package

Use data package in book
tidy-survey-r · Jan 13, 2024 · b98154f · b98154f
2 parents 38279f4 + dfd32da
commit b98154f
Show file tree

Hide file tree

Showing 23 changed files with 137 additions and 4,004 deletions.
diff --git a/01-introduction.Rmd b/01-introduction.Rmd
@@ -38,44 +38,18 @@ In most chapters, you'll find code that you can follow. Each of these chapters s
 
 ## Datasets used in this book {#book-datasets}
 
-We work with two key datasets throughout the book: the Residential Energy Consumption Survey [RECS -- @recs-2020-tech] and the American National Election Studies [ANES -- @debell].  To ensure that all readers can follow the examples, we have provided analytic datasets available on OSF^[https://osf.io/gzbkn/?view_only=8ca80573293b4e12b7f934a0f742b957]. 
-
-If a chapter contains data that is not part of existing packages, we have created a helper function, `read_osf()`,  for you to load it easily. We recommend saving the script below in a folder called "helper-fun" and calling the file `helper-function.R` if you would like to follow along with the prerequisites listed in the chapters that contain code. 
+We work with two key datasets throughout the book: the Residential Energy Consumption Survey [RECS -- @recs-2020-tech] and the American National Election Studies [ANES -- @debell].  To ensure that all readers can follow the examples, we have provided analytic datasets in an R package, {srvyr.data}. Install the package from GitHub using the {remotes} package.
 
 ```r
-read_osf <- function(filename){
-  #' Downloads file from OSF project
-  #' Reads in file
-  #' Deletes file from computer
-
-  osf_dl_del_later <- !dir.exists("osf_dl")
-
-  if (osf_dl_del_later) {
-    osf_dl_del_later <- TRUE
-    dir.create("osf_dl")
-  }
-
-  dat_det <-
-    osf_retrieve_node("https://osf.io/gzbkn/?view_only=8ca80573293b4e12b7f934a0f742b957") %>%
-    osf_ls_files() %>%
-    dplyr::filter(name == filename) %>%
-    osf_download(conflicts = "overwrite", path = "osf_dl")
-
-  out <- dat_det %>%
-    dplyr::pull(local_path) %>%
-    readr::read_rds()
-
-  if (osf_dl_del_later) {
-    unlink("osf_dl", recursive = TRUE)
-  } else{
-    unlink(dplyr::pull(dat_det, local_path))
-  }
-
-  return(out)
-}
+remotes::install_github("https://github.com/tidy-survey-r/srvyr.data")
 ```
 
-Here's how to use the function to read in the RECS and ANES datasets:
+To explore the provided datasets in the package, access the documentation usng the `help()` command.
+
+```r
+help(package="srvyr.data")
+```
+To load the RECS and ANES datasets, start by running `library(srvyr.data)` to load the package. Then, use the `data()` command to load the datasets into the environment.
 
 ```{r}
 #| label: intro-setup
@@ -85,8 +59,7 @@ Here's how to use the function to read in the RECS and ANES datasets:
 library(tidyverse)
 library(survey)
 library(srvyr)
-library(osfr)
-source("helper-fun/helper-function.R")
+library(srvyr.data)
 ```
 
 ```{r}
@@ -95,26 +68,26 @@ source("helper-fun/helper-function.R")
 #| warning: FALSE
 #| message: FALSE
 #| cache: TRUE
-recs_in <- read_osf("recs_2020.rds")
-anes_in <- read_osf("anes_2020.rds")
+data(recs_2020)
+data(anes_2020)
 ```
 
-RECS is a study that provides energy consumption and expenditures data in American households. The Energy Information Administration funds RECS and has been fielded 15 times between 1950 and 2020. The survey has two components - the household survey and the energy supplier survey. In 2020, the household survey was collected by web and paper questionnaires and included questions about appliances, electronics, heating, air conditioning (A/C), temperatures, water heating, lighting, respondent demographics, and energy assistance. The energy supplier survey consists of components relating to energy consumption and energy expenditure. Below is an overview of the `recs_in` data:
+RECS is a study that provides energy consumption and expenditures data in American households. The Energy Information Administration funds RECS and has been fielded 15 times between 1950 and 2020. The survey has two components - the household survey and the energy supplier survey. In 2020, the household survey was collected by web and paper questionnaires and included questions about appliances, electronics, heating, air conditioning (A/C), temperatures, water heating, lighting, respondent demographics, and energy assistance. The energy supplier survey consists of components relating to energy consumption and energy expenditure. Below is an overview of the `recs_2020` data:
 
 ```{r}
 #| label: intro-recs
-recs_in %>% select(-starts_with("NWEIGHT"))
-recs_in %>% select(starts_with("NWEIGHT"))
+recs_2020 %>% select(-starts_with("NWEIGHT"))
+recs_2020 %>% select(starts_with("NWEIGHT"))
 ```
 
-From this output, we can see that there are `r nrow(recs_in) %>% formatC(big.mark = ",")` rows and `r ncol(recs_in) %>% formatC(big.mark = ",")` variables.  We can see that there are variables containing an ID (`DOEID`), geographic information (e.g., `Region`, `state_postal`, `Urbanicity`), along with information about the house, including the type of house (`HousingUnitType`) and when the house was built (`YearMade`). Additionally, there is a long list of weighting variables that we will use in the analysis (e.g., `NWEIGHT`, `NWEIGHT1`, ..., `NWEIGHT60`). We will discuss using these weighting variables in Chapter \@ref(c03-specifying-sample-designs). For a more detailed codebook, see Appendix \@ref(recs-cb).
+From this output, we can see that there are `r nrow(recs_2020) %>% formatC(big.mark = ",")` rows and `r ncol(recs_2020) %>% formatC(big.mark = ",")` variables.  We can see that there are variables containing an ID (`DOEID`), geographic information (e.g., `Region`, `state_postal`, `Urbanicity`), along with information about the house, including the type of house (`HousingUnitType`) and when the house was built (`YearMade`). Additionally, there is a long list of weighting variables that we will use in the analysis (e.g., `NWEIGHT`, `NWEIGHT1`, ..., `NWEIGHT60`). We will discuss using these weighting variables in Chapter \@ref(c03-specifying-sample-designs). For a more detailed codebook, see Appendix \@ref(recs-cb).
 
-The ANES is a series study that has collected data from election surveys since 1948. These surveys contain data on public opinion and voting behavior in U.S. presidential elections. The 2020 survey (the data we will be using) was fielded to individuals over the web, through live video interviewing, or over with computer-assisted telephone interviewing (CATI). The survey includes questions on party affiliation, voting choice, and level of trust with the government. Here is an overview of the `anes_in` data. First, we show the variables starting with "V" followed by a number; these are the original variables. Then, we show you the remaining variables that we created based on the original data:
+The ANES is a series study that has collected data from election surveys since 1948. These surveys contain data on public opinion and voting behavior in U.S. presidential elections. The 2020 survey (the data we will be using) was fielded to individuals over the web, through live video interviewing, or over with computer-assisted telephone interviewing (CATI). The survey includes questions on party affiliation, voting choice, and level of trust with the government. Here is an overview of the `anes_2020` data. First, we show the variables starting with "V" followed by a number; these are the original variables. Then, we show you the remaining variables that we created based on the original data:
 
 ```{r}
 #| label: intro-anes
-anes_in %>% select(matches("^V\\d"))
-anes_in %>% select(-matches("^V\\d"))
+anes_2020 %>% select(matches("^V\\d"))
+anes_2020 %>% select(-matches("^V\\d"))
 ```
 
-From this output we can see that there are `r nrow(anes_in) %>% formatC(big.mark = ",")` rows and `r ncol(anes_in) %>% formatC(big.mark = ",")` variables.  Most of the variables start with V20, so referencing the documentation for survey will be crucial to not get lost (see Chapter \@ref(c04-understanding-survey-data-documentation)).  We have created some more descriptive variables for you to use throughout this book, such as the age (`Age`) and gender (`Gender`) of the respondent, along with variables that represent their party affiliation (`PartyID`). Additionally, we need the variables  `Weight` and `Stratum` to analyze this data accurately.  We will discuss how to use these weighting variables in Chapters \@ref(c03-specifying-sample-designs) and \@ref(c04-understanding-survey-data-documentation). For a more detailed codebook, see Appendix \@ref(anes-cb).
+From this output we can see that there are `r nrow(anes_2020) %>% formatC(big.mark = ",")` rows and `r ncol(anes_2020) %>% formatC(big.mark = ",")` variables.  Most of the variables start with V20, so referencing the documentation for survey will be crucial to not get lost (see Chapter \@ref(c04-understanding-survey-data-documentation)).  We have created some more descriptive variables for you to use throughout this book, such as the age (`Age`) and gender (`Gender`) of the respondent, along with variables that represent their party affiliation (`PartyID`). Additionally, we need the variables  `Weight` and `Stratum` to analyze this data accurately.  We will discuss how to use these weighting variables in Chapters \@ref(c03-specifying-sample-designs) and \@ref(c04-understanding-survey-data-documentation). For a more detailed codebook, see Appendix \@ref(anes-cb).
diff --git a/03-specifying-sample-designs.Rmd b/03-specifying-sample-designs.Rmd
@@ -14,8 +14,7 @@ For this chapter, load the following packages and the helper function:
 library(tidyverse)
 library(survey)
 library(srvyr)
-library(osfr)
-source("helper-fun/helper-function.R")
+library(srvyr.data)
 ```
 
 To help explain the different types of sample designs, this chapter will use the `api` and `scd` data that comes in the {survey} package:
@@ -25,12 +24,13 @@ data(api)
 data(scd)
 ```
 
-Additionally, we have created multiple analytic datasets for use in this book on a directory on OSF^[https://osf.io/gzbkn/?view_only=8ca80573293b4e12b7f934a0f742b957]. To load any data used in the book that is not included in existing packages, we have created a helper function `read_osf()`. This chapter uses data from the Residential Energy Consumption Survey (RECS) - both 2015 and 2020, so we will use the following code to load the RECS data to use later in this chapter:
+Additionally, we have created multiple analytic datasets for use in the {srvyr.data} package, as described in \@ref{book-datasets}. This chapter uses data from the Residential Energy Consumption Survey (RECS) - both 2015 and 2020, so we will use the following code to load the RECS data to use later in this chapter:
+
 ```{r}
 #| label: samp-setup-recs 
 #| eval: FALSE
-recs_2015_in <- read_osf("recs_2015.rds")
-recs_in <- read_osf("recs_2020.rds")
+data(recs_2015)
+data(recs_2020)
 ```
 :::
 
@@ -573,7 +573,7 @@ fay_des <- dat %>%
 
 #### Example {-} 
 
-The 2015 RECS [@recs-2015-micro] uses Fay's BRR weights with the final weight as NWEIGHT and replicate weights as BRRWT1 - BRRWT96 with $\rho=0.5$. On the file, DOEID is a unique identifier for each respondent, TOTALDOL is the total cost of energy, TOTSQFT_EN is the total square footage of the residence, and REGOINC is the Census region. We have already read in the RECS data and created a dataset called `recs_2015_in` above in the prerequisites.
+The 2015 RECS [@recs-2015-micro] uses Fay's BRR weights with the final weight as NWEIGHT and replicate weights as BRRWT1 - BRRWT96 with $\rho=0.5$. On the file, DOEID is a unique identifier for each respondent, TOTALDOL is the total cost of energy, TOTSQFT_EN is the total square footage of the residence, and REGOINC is the Census region. We have already read in the RECS data and created a dataset called `recs_2015` above in the prerequisites.
 
 To specify this design, use the following syntax:
 
@@ -583,14 +583,14 @@ To specify this design, use the following syntax:
 #| warning: FALSE
 #| message: FALSE
 #| cache: TRUE
-recs_2015_in <- read_osf("recs_2015.rds")
+data(recs_2015)
 ```
 
 
 ```{r}
 #| label: samp-des-recs-des
 #| eval: TRUE
-recs_2015_des <- recs_2015_in %>%
+recs_2015_des <- recs_2015 %>%
   as_survey_rep(weights = NWEIGHT,
                 repweights = BRRWT1:BRRWT96,
                 type = "Fay",
@@ -649,12 +649,13 @@ jkn_des <- dat %>%
 
 #### Example {-}
 
-The 2020 RECS [@recs-2020-micro] uses jackknife weights with the final weight as NWEIGHT and replicate weights as NWEIGHT1 - NWEIGHT60 with a scale of $(R-1)/R=59/60$. On the file, DOEID is a unique identifier for each respondent, TOTALDOL is the total cost of energy, TOTSQFT_EN is the total square footage of the residence, and REGOINC is the Census region. We have already read in the RECS data and created a dataset called `recs_in` above in the prerequisites.
+The 2020 RECS [@recs-2020-micro] uses jackknife weights with the final weight as NWEIGHT and replicate weights as NWEIGHT1 - NWEIGHT60 with a scale of $(R-1)/R=59/60$. On the file, DOEID is a unique identifier for each respondent, TOTALDOL is the total cost of energy, TOTSQFT_EN is the total square footage of the residence, and REGOINC is the Census region. We have already read in the RECS data and created a dataset called `recs_2020` above in the prerequisites.
 
 To specify this design, use the following syntax:
 
 ```{r}
-recs_des <- recs_in %>%
+#| label: samp-des-recs2020-des
+recs_des <- recs_2020 %>%
   as_survey_rep(
     weights = NWEIGHT,
     repweights = NWEIGHT1:NWEIGHT60,
@@ -673,7 +674,7 @@ summary(recs_des)
 #| label: samp-des-recs-des-full
 #| echo: FALSE
 # This is just for later use in book
-recs_des <- recs_in %>%
+recs_des <- recs_2020 %>%
   as_survey_rep(
     weights = NWEIGHT,
     repweights = NWEIGHT1:NWEIGHT60,

diff --git a/04-understanding-survey-data-documentation.Rmd b/04-understanding-survey-data-documentation.Rmd
@@ -14,16 +14,15 @@ For this chapter, load the following packages and the helper function:
 library(tidyverse)
 library(survey)
 library(srvyr)
-library(osfr)
-source("helper-fun/helper-function.R")
+library(srvyr.data)
 library(censusapi)
 ```
 
 We will be using data from ANES. Here is the code to read in the data.
 ```{r}
 #| label: understand-anes-c04
 #| eval: FALSE
-anes_in <- read_osf("anes_2020.rds")
+data(anes_2020)
 ```
 :::
 
@@ -250,7 +249,7 @@ The target population in 2020 is `r scales::comma(targetpop)`. This information
 
 ```{r}
 #| label: understand-read-anes
-anes_adjwgt <- anes_in %>%
+anes_adjwgt <- anes_2020 %>%
   mutate(Weight = V200010b / sum(V200010b) * targetpop) 
 ```
 

diff --git a/05-descriptive-analysis.Rmd b/05-descriptive-analysis.Rmd
@@ -14,8 +14,7 @@ For this chapter, load the following packages and the helper function:
 library(tidyverse)
 library(survey)
 library(srvyr)
-library(osfr)
-source("helper-fun/helper-function.R")
+library(srvyr.data)
 library(broom)
 ```
 
@@ -40,10 +39,10 @@ We will be using data from ANES and RECS. Here is the code to create the design
 ```{r}
 #| label: desc-anes-des
 #| eval: FALSE
-anes_in <- read_osf("anes_2020.rds")
 targetpop <- 231592693
+data(anes_2020)
 
-anes_adjwgt <- anes_in %>%
+anes_adjwgt <- anes_2020 %>%
   mutate(Weight = Weight / sum(Weight) * targetpop)
 
 anes_des <- anes_adjwgt %>%
@@ -60,9 +59,9 @@ For RECS, details are included in the RECS documentation and Chapter \@ref(c03-s
 ```{r}
 #| label: desc-recs-des
 #| eval: FALSE
-recs_in <- read_osf("recs_2020.rds")
+data(recs_2020)
 
-recs_des <- recs_in %>%
+recs_des <- recs_2020 %>%
   as_survey_rep(
     weights = NWEIGHT,
     repweights = NWEIGHT1:NWEIGHT60,
@@ -978,7 +977,7 @@ It is estimated that American residential households spent an average of `r .elb
 
 Briefly, we mentioned using `filter()` to subset a survey object for analysis. This operation should be done after creating the design object. In rare circumstances, subsetting data before creating the object can lead to incorrect variability estimates. This can occur if subsetting removes an entire PSU.
 
-Suppose we wanted estimates of the average amount spent on natural gas among housing units that use natural gas using the variable `BTUNG`^[`BTUNG` is derived from the supplier side component of the survey where `BTUNG` represents the natural gas consumption in British thermal units (BTUs) in a year]. This could be obtained by first filtering records to only include records where `BTUNG > 0` and then finding the average amount of money spent.
+Suppose we wanted estimates of the average amount spent on natural gas among housing units that use natural gas using the variable `BTUNG`^[`BTUNG` is derived from the supplier side component of the survey where `BTUNG` represents the natural gas consumption in British thermal units (Btus) in a year]. This could be obtained by first filtering records to only include records where `BTUNG > 0` and then finding the average amount of money spent.
 
 ```{r}
 #| label: desc-subpop

diff --git a/06-statistical-testing.Rmd b/06-statistical-testing.Rmd
@@ -14,8 +14,7 @@ For this chapter, load the following packages and the helper function:
 library(tidyverse)
 library(survey) 
 library(srvyr) 
-library(osfr)
-source("helper-fun/helper-function.R")
+library(srvyr.data)
 library(broom)
 library(gt)
 ```
@@ -24,10 +23,10 @@ We will be using data from ANES and RECS. Here is the code to create the design
 ```{r}
 #| label: stattest-anes-des
 #| eval: FALSE
-anes_in <- read_osf("anes_2020.rds")
 targetpop <- 231592693
+data(anes_2020)
 
-anes_adjwgt <- anes_in %>%
+anes_adjwgt <- anes_2020 %>%
   mutate(Weight = Weight / sum(Weight) * targetpop)
 
 anes_des <- anes_adjwgt %>%
@@ -44,9 +43,9 @@ For RECS, details are included in the RECS documentation and Chapter \@ref(c03-s
 ```{r}
 #| label: stattest-recs-des
 #| eval: FALSE
-recs_in <- read_osf("recs_2020.rds")
+data(recs_2020)
 
-recs_des <- recs_in %>%
+recs_des <- recs_2020 %>%
   as_survey_rep(
     weights = NWEIGHT,
     repweights = NWEIGHT1:NWEIGHT60,

diff --git a/07-modeling.Rmd b/07-modeling.Rmd
@@ -14,19 +14,18 @@ For this chapter, load the following packages and the helper function:
 library(tidyverse)
 library(survey) 
 library(srvyr) 
-library(osfr)
-source("helper-fun/helper-function.R")
+library(srvyr.data)
 library(broom)
 ```
 
 We will be using data from ANES and RECS. Here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter \@ref(c04-understanding-survey-data-documentation) for more information).
 ```{r}
 #| label: model-anes-des
 #| eval: FALSE
-anes_in <- read_osf("anes_2020.rds") 
 targetpop <- 231592693
+data(anes_2020)
 
-anes_adjwgt <- anes_in %>%
+anes_adjwgt <- anes_2020 %>%
   mutate(Weight = Weight / sum(Weight) * targetpop)
 
 anes_des <- anes_adjwgt %>%
@@ -41,9 +40,7 @@ For RECS, details are included in the RECS documentation and Chapter \@ref(c03-s
 ```{r}
 #| label: model-recs-des
 #| eval: FALSE
-recs_in <- read_osf("recs_2020.rds")
-
-recs_des <- recs_in %>%
+recs_des <- recs_2020 %>%
   as_survey_rep(
     weights = NWEIGHT,
     repweights = NWEIGHT1:NWEIGHT60,
@@ -215,7 +212,7 @@ On RECS, we can obtain information on the square footage of homes and the electr
 #| fig.alt: Hex chart where each hexagon represents a number of housing units at a point. x-axis is 'Total square footage' ranging from 0 to 7,500 and y-axis is 'Amount spent on electricity' ranging from $0 to 8,000. The trend is relatively linear and positve. A high concentration of points have square footage between 0 and 2,500 square feet as well as between electricity expenditure between $0 and 2,000
 #| echo: FALSE
 #| warning: FALSE
-recs_in %>%
+recs_2020 %>%
   ggplot(aes(
     x = TOTSQFT_EN,
     y = DOLLAREL,
@@ -311,7 +308,7 @@ Additionally, `augment()` can be used to predict outcomes for data not used in m
 ```{r}
 #| label: model-predict-new-dat
 add_data <-
-  recs_in %>% select(DOEID,
+  recs_2020 %>% select(DOEID,
                      Region,
                      Urbanicity,
                      TOTSQFT_EN,
@@ -649,7 +646,7 @@ tidy(earlyvote_mod) %>% arrange(p.value)
 
 ```{r}
 #| label: model-ex-logistic-2
-add_vote_dat <- anes_in %>%
+add_vote_dat <- anes_2020 %>%
   select(EarlyVote2020, Age, Education, PartyID) %>%
   rbind(tibble(
     EarlyVote2020 = NA,