Skip to content

Commit

Permalink
Incorporate pub edits c04 - get started (#146)
Browse files Browse the repository at this point in the history
* Incorporate pub edits c04

* Revert subpopulations

---------

Co-authored-by: Isabella Velasquez <[email protected]>
  • Loading branch information
szimmer and ivelasq authored Aug 4, 2024
1 parent 777e620 commit 6e90c7b
Showing 1 changed file with 15 additions and 15 deletions.
30 changes: 15 additions & 15 deletions 04-set-up.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ library(srvyrexploR)
```

\index{gtsummary|(} \index{gt package|(}
The packages {broom}, {gt}, and {gtsummary} play a role in displaying output and creating formatted tables [@R-gt, @R-broom; @gtsummarysjo]. Install them with the provided code^[Note: {broom} is already included in the tidyverse, so no separate installation is required.]:
The packages {broom}, {gt}, and {gtsummary} play a role in displaying output and creating formatted tables [@R-gt; @R-broom; @gtsummarysjo]. Install them with the provided code^[Note: {broom} is already included in the tidyverse, so no separate installation is required.]:

```{r}
#| label: setup-install-extra
Expand Down Expand Up @@ -93,7 +93,7 @@ After installing this package, load it using the `library()` function:
library(censusapi)
```

Note that the {censusapi} package requires a Census API key, available for free from the [U.S. Census Bureau website](https://api.census.gov/data/key_signup.html) (refer to the package documentation for more information.) We recommend storing the Census API key in the R environment instead of directly in the code. To do this, run the `Sys.setenv()` script below, substituting the API key where it says `YOUR_API_KEY_HERE`.
Note that the {censusapi} package requires a Census API key, available for free from the [U.S. Census Bureau website](https://api.census.gov/data/key_signup.html) (refer to the package documentation for more information). We recommend storing the Census API key in the R environment instead of directly in the code. To do this, run the `Sys.setenv()` script below, substituting the API key where it says `YOUR_API_KEY_HERE`.

```{r}
#| label: setup-census-api-setup
Expand All @@ -118,10 +118,10 @@ help(package = "srvyrexploR")

This book uses two main datasets: the American National Election Studies [ANES -- @debell] and the Residential Energy Consumption Survey [RECS -- @recs-2020-tech], which are included as `anes_2020` and `recs_2020` in the {srvyrexploR} package, respectively.

#### American National Election Studies (ANES) Data {-}
#### American National Election Studies Data {-}

\index{American National Election Studies (ANES)|(}
ANES is a study that collects data from election surveys dating back to 1948. These surveys contain information on public opinion and voting behavior in U.S. presidential elections and some midterm elections^[In the United States, presidential elections are held in years divisible by four. In other even years, there are elections at the federal level for Congress, which are referred to as midterm elections as they occur at the middle of the term of a president.]. They cover topics such as party affiliation, voting choice, and level of trust in the government. The 2020 survey (data used in this book) was fielded online, through live video interviews, or via computer-assisted telephone interviews (CATI).
American National Election Studies (ANES) collect data from election surveys dating back to 1948. These surveys contain information on public opinion and voting behavior in U.S. presidential elections and some midterm elections^[In the United States, presidential elections are held in years divisible by four. In other even years, there are elections at the federal level for Congress, which are referred to as midterm elections as they occur at the middle of the term of a president.]. They cover topics such as party affiliation, voting choice, and level of trust in the government. The 2020 survey (data used in this book) was fielded online, through live video interviews, or via computer-assisted telephone interviews (CATI).

When working with new survey data, we should review the survey documentation (see Chapter \@ref(c03-survey-data-documentation)) to understand the data collection methods. The original ANES data contains variables starting with `V20` [@debell], so to assist with our analysis throughout the book, we created descriptive variable names. For example, the respondent's age is now in a variable called `Age`, and gender is in a variable called `Gender`. These descriptive variables are included in the {srvyrexploR} package. A complete overview of all variables can be found in `r if (!knitr:::is_html_output()) 'the online Appendix ('`Appendix \@ref(anes-cb)`r if (!knitr:::is_html_output()) ')'`.

Expand All @@ -137,10 +137,10 @@ anes_2020 %>%
From the output, we can see there are `r nrow(anes_2020 %>% select(-matches("^V\\d"))) %>% formatC(big.mark = ",")` rows and `r ncol(anes_2020 %>% select(-matches("^V\\d"))) %>% formatC(big.mark = ",")` variables in the ANES data. This output also indicates that most of the variables are factors (e.g., `InterviewMode`), while a few variables are in double (numeric) format (e.g., `Age`).
\index{American National Election Studies (ANES)|)}

#### Residential Energy Consumption Survey (RECS) Data {-}
#### Residential Energy Consumption Survey Data {-}

\index{Residential Energy Consumption Survey (RECS)|(}
RECS is a study that measures energy consumption and expenditure in American households. Funded by the Energy Information Administration, RECS data are collected through interviews with household members and energy suppliers. These interviews take place in person, over the phone, via mail, and on the web, with modes changing over time. The survey has been fielded 14 times between 1950 and 2020. It includes questions about appliances, electronics, heating, air conditioning (A/C), temperatures, water heating, lighting, energy bills, respondent demographics, and energy assistance.
Residential Energy Consumption Survey (RECS) is a study that measures energy consumption and expenditure in American households. Funded by the Energy Information Administration, RECS data are collected through interviews with household members and energy suppliers. These interviews take place in person, over the phone, via mail, and on the web, with modes changing over time. The survey has been fielded 14 times between 1950 and 2020. It includes questions about appliances, electronics, heating, air conditioning (A/C), temperatures, water heating, lighting, energy bills, respondent demographics, and energy assistance.

We should read the survey documentation (see Chapter \@ref(c03-survey-data-documentation)) to understand how the data were collected and implemented. An overview of all variables can be found in `r if (!knitr:::is_html_output()) 'the online Appendix ('`Appendix \@ref(recs-cb)`r if (!knitr:::is_html_output()) ')'`.

Expand All @@ -161,9 +161,9 @@ From the output, we can see that the RECS data has `r nrow(recs_2020 %>% select(

In this section, we provide details on how to code the design object for the ANES and RECS data used in the book. However, we only provide a high-level overview to get readers started. For a deeper understanding of creating design objects for a variety of sampling designs, see Chapter \@ref(c10-sample-designs-replicate-weights).

While we recommend conducting exploratory data analysis on the original data before diving into complex survey analysis (see Chapter \@ref(c12-recommendations)), the actual survey survey analysis and inference should be performed with the survey design objects instead of the original survey data. For example, the ANES data is called `anes_2020`. If we create a survey design object called `anes_des`, our survey survey analyses should begin with `anes_des` and not `anes_2020`. Using the survey design object ensures that our calculations appropriately account for the details of the survey design.
While we recommend conducting exploratory data analysis on the original data before diving into complex survey analysis (see Chapter \@ref(c12-recommendations)), the actual survey analysis and inference should be performed with the survey design objects instead of the original survey data. For example, the ANES data is called `anes_2020`. If we create a survey design object called `anes_des`, our survey analyses should begin with `anes_des` and not `anes_2020`. Using the survey design object ensures that our calculations appropriately account for the details of the survey design.

#### American National Election Studies (ANES) Design Object {-}
#### American National Election Studies Design Object {-}

\index{American National Election Studies (ANES)|(} \index{Current Population Survey (CPS)|(}
The ANES documentation [@debell] details the sampling and weighting implications for analyzing the survey data. From this documentation and as noted in Chapter \@ref(c03-survey-data-documentation), the 2020 ANES data are weighted to the sample, not the population. To make generalizations about the population, we need to weigh the data against the full population count. The ANES methodology recommends using the Current Population Survey (CPS) to determine the number of non-institutional U.S. citizens aged 18 or older living in the 50 U.S. states or D.C. in March 2020.
Expand Down Expand Up @@ -234,7 +234,7 @@ anes_adjwgt <- anes_2020 %>%
mutate(Weight = V200010b / sum(V200010b) * targetpop)
```
\index{Stratified sampling|(} \index{Functions in srvyr!as\_survey\_design|(} \index{as\_survey\_design|see {Functions in srvyr}} \index{Clustered sampling|(} \index{Primary sampling unit|(} \index{PSU|see {Primary sampling unit}} \index{Cluster|see {Primary sampling unit}}
Once we have the adjusted weights, we can refer to the rest of the documentation to create the survey design. The documentation indicates that the study uses a stratified cluster sampling design. Therefore, we need to specify variables for `strata` and `ids` (cluster) and fill in the `nest` argument. The documentation provides guidance on which strata and cluster variables to use depending on whether we are analyzing pre- or post-election data. In this book, we analyze post-election data, so we need to use the post-election weight `V200010b`, strata variable `V200010d`, and PSU/cluster variable `V200010c`. Additionally, we set `nest=TRUE` to ensure the clusters are nested within the strata. \index{Weighting|)}
Once we have the adjusted weights, we can refer to the rest of the documentation to create the survey design. The documentation indicates that the study uses a stratified cluster sampling design. Therefore, we need to specify variables for `strata` and `ids` (cluster) and fill in the `nest` argument. The documentation provides guidance on which strata and cluster variables to use depending on whether we are analyzing pre- or post-election data. In this book, we analyze post-election data, so we need to use the post-election weight `V200010b`, strata variable `V200010d`, and Primary Sampling Unit (PSU)/cluster variable `V200010c`. Additionally, we set `nest=TRUE` to ensure the clusters are nested within the strata. \index{Weighting|)}

```{r}
#| label: setup-anes-des
Expand All @@ -247,14 +247,14 @@ anes_des <- anes_adjwgt %>%
anes_des
```

We can examine this new object to learn more about the survey design, such that the ANES is a "Stratified 1 - level Cluster Sampling design (with replacement) With (101) clusters". Additionally, the output displays the sampling variables and then lists the remaining variables in the dataset. This design object is used throughout this book to conduct survey analysis. \index{Stratified sampling|)} \index{Functions in srvyr!as\_survey\_design|)} \index{American National Election Studies (ANES)|)} \index{Clustered sampling|)} \index{Primary sampling unit|)}
We can examine this new object to learn more about the survey design, such that the ANES is a "Stratified 1 - level Cluster Sampling design (with replacement) With (101) clusters." Additionally, the output displays the sampling variables and then lists the remaining variables in the dataset. This design object is used throughout this book to conduct survey analysis. \index{Stratified sampling|)} \index{Functions in srvyr!as\_survey\_design|)} \index{American National Election Studies (ANES)|)} \index{Clustered sampling|)} \index{Primary sampling unit|)}

#### Residential Energy Consumption Survey (RECS) Design Object {-}
#### Residential Energy Consumption Survey Design Object {-}

\index{Replicate weights|(} \index{Replicate weights!Jackknife} \index{Jackknife|see {Replicate weights}} \index{Residential Energy Consumption Survey (RECS)|(}
The RECS documentation [@recs-2020-tech] provides information on the survey's sampling and weighting implications for analysis. The documentation shows the 2020 RECS uses Jackknife weights, where the main analytic weight is `NWEIGHT`, and the Jackknife weights are `NWEIGHT1`-`NWEIGHT60`. We can specify these in the ``weights`` and ``repweights`` arguments in the survey design object code, respectively.

With Jackknife weights, additional information is required: `type`, `scale`, and `mse`. Chapter \@ref(c10-sample-designs-replicate-weights) goes into depth about each of these arguments, but to quickly get started, the RECS documentation lets us know that `type=JK1`, `scale=59/60`, and `mse = TRUE`. \index{Functions in srvyr!as\_survey\_rep|(}We can use the following code to create the survey design object: \index{as\_survey\_rep|see {Functions in srvyr}} \index{Replicate weights!Jackknife}
With Jackknife weights, additional information is required: `type`, `scale`, and `mse`. Chapter \@ref(c10-sample-designs-replicate-weights) discusses in depth each of these arguments; but to quickly get started, the RECS documentation lets us know that `type=JK1`, `scale=59/60`, and `mse = TRUE`. \index{Functions in srvyr!as\_survey\_rep|(}We can use the following code to create the survey design object: \index{as\_survey\_rep|see {Functions in srvyr}} \index{Replicate weights!Jackknife}

```{r}
#| label: setup-recs-des
Expand All @@ -271,7 +271,7 @@ recs_des <- recs_2020 %>%
recs_des
```

Viewing this new object provides information about the survey design, such that RECS is an "Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances". Additionally, the output shows the sampling variables (`NWEIGHT1`-`NWEIGHT60`) and then lists the remaining variables in the dataset. This design object is used throughout this book to conduct survey analysis. \index{Functions in srvyr!as\_survey\_rep|)} \index{Replicate weights|)} \index{Residential Energy Consumption Survey (RECS)|)}
Viewing this new object provides information about the survey design, such that RECS is an "Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances." Additionally, the output shows the sampling variables (`NWEIGHT1`-`NWEIGHT60`) and then lists the remaining variables in the dataset. This design object is used throughout this book to conduct survey analysis. \index{Functions in srvyr!as\_survey\_rep|)} \index{Replicate weights|)} \index{Residential Energy Consumption Survey (RECS)|)}

## Survey analysis process {#survey-analysis-process}

Expand All @@ -287,15 +287,15 @@ There is a general process for analyzing data to create estimates with {srvyr} p

4. Within `summarize()`, specify variables to calculate, including means, totals, proportions, quantiles, and more

In Section \@ref(setup-des-obj), we follow Step #1 to create the survey design objects for the ANES and RECS data featured in this book. Additional details on how to create design objects can be found in Chapter \@ref(c10-sample-designs-replicate-weights). Then, once we have the design object, we can filter the data to any subpopulation of interest (if needed). It is important to filter the data **after** creating the design object. This ensures that we are accurately accounting for the survey design in our calculations. Finally, we can use `group_by()`, `summarize()`, and other functions from the {survey} and {srvyr} packages to analyze the survey data by estimating means, totals, and so on.
In Section \@ref(setup-des-obj), we follow Step 1 to create the survey design objects for the ANES and RECS data featured in this book. Additional details on how to create design objects can be found in Chapter \@ref(c10-sample-designs-replicate-weights). Then, once we have the design object, we can filter the data to any subpopulation of interest (if needed). It is important to filter the data **after** creating the design object. This ensures that we are accurately accounting for the survey design in our calculations. Finally, we can use `group_by()`, `summarize()`, and other functions from the {survey} and {srvyr} packages to analyze the survey data by estimating means, totals, and so on.

\index{Survey analysis process|)}\index{Design object|)}

## Similarities between {dplyr} and {srvyr} functions {#similarities-dplyr-srvyr}

The {dplyr} package from the tidyverse offers flexible and intuitive functions for data wrangling [@R-dplyr]. One of the major advantages of using {srvyr} is that it applies {dplyr}-like syntax to the {survey} package [@R-srvyr]. We can use pipes, such as `%>%` from the {magrittr} package, to specify a survey design object, apply a function, and then feed that output into the next function's first argument [@R-magrittr]. Functions follow the 'tidy' convention of snake_case function names.

To help explain the similarities between {dplyr} functions and {srvyr} functions, we use the `towny` dataset from the {gt} package and `apistrat` data that comes in the {survey} package. The `towny` dataset provides population data for municipalities in Ontario, Canada on Census years between 1996 and 2021. Taking a look at `towny` with `dplyr::glimpse()`, we can see the dataset has `r ncol(towny)` columns with a mix of character and numeric data.
To help explain the similarities between {dplyr} functions and {srvyr} functions, we use the `towny` dataset from the {gt} package and `apistrat` data that comes in the {survey} package. The `towny` dataset provides population data for municipalities in Ontario, Canada on census years between 1996 and 2021. Taking a look at `towny` with `dplyr::glimpse()`, we can see the dataset has `r ncol(towny)` columns with a mix of character and numeric data.

```{r}
#| label: setup-towny-surveydata
Expand Down

0 comments on commit 6e90c7b

Please sign in to comment.