Skip to content

Commit

Permalink
Merge pull request #87 from tidy-survey-r/reorg-jan24
Browse files Browse the repository at this point in the history
Re-organize chapters and create sections
  • Loading branch information
ivelasq authored Jan 13, 2024
2 parents 1630606 + 1c68a94 commit b76871a
Show file tree
Hide file tree
Showing 19 changed files with 86 additions and 78 deletions.
20 changes: 13 additions & 7 deletions 01-introduction.Rmd
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
\mainmatter

# (PART) Intro {-}

# Introduction {#c01-intro}

Surveys are used to gather information about a population. They are frequently used by researchers, governments, and businesses to better understand public opinion and behavior. For example, a non-profit group might be interested in public opinion on a given topic, government agencies may be interested in behaviors to inform policy, or companies may survey potential consumers about what they want from their products. Developing and fielding a survey is a method to gather information about topics that interest us.
Expand All @@ -25,18 +27,22 @@ There is one limitation to the {srvyr} package: it doesn't fully incorporate the
This book will cover many aspects of survey design and analysis, from understanding how to create design effects to conducting descriptive analysis, statistical tests, and models. Additionally, we emphasize best practices in coding and presenting results. Throughout this book, we use real-world data and present practical examples to help you gain proficiency in survey analysis. While we provide a brief overview of survey methodology and statistical theory, this book is not intended to be the sole resource for these topics. We reference other materials throughout the book and encourage readers to seek those out for more information. Below is a summary of each chapter:

- **Chapter \@ref(c02-overview-surveys)**: An overview of surveys and the process of designing surveys. This is only an overview, and we include many references for more in-depth knowledge.
- **Chapter \@ref(c03-specifying-sample-designs)**: Specifying sampling designs. Descriptions of common sampling designs, when they are used, the math behind the mean and standard error estimates, how to specify the designs in R, and examples using real data.
- **Chapter \@ref(c04-understanding-survey-data-documentation)**: Understanding survey documentation. How to read the various components of survey documentation, working with missing data, and finding the documentation.
- **Chapter \@ref(c03-understanding-survey-data-documentation)**: Understanding survey documentation. How to read the various components of survey documentation, working with missing data, and finding the documentation.
- **Chapter \@ref(c04-set-up)**: TO-DO
- **Chapter \@ref(c05-descriptive-analysis)**: Descriptive analyses. Calculating point estimates along with their standard errors, confidence intervals, and design effects.
- **Chapter \@ref(c06-statistical-testing)**: Statistical testing. Testing for differences between groups, including comparisons of means and proportions as well as goodness of fit tests, tests of independence, and tests of homogeneity.
- **Chapter \@ref(c07-modeling)**: Modeling. Linear regression, ANOVA, and logistic regression.
- **Chapter \@ref(c08-communicating-results)**: Communicating results. Describing results, reproducibility, making publishable tables and graphs, and helpful functions.
- **Chapter \@ref(c09-ncvs-vignette)**: National Crime Victimization Survey Vignette. A vignette on how to analyze data from the NCVS, a survey in the U.S. that collects information on crimes and their characteristics. This illustrates an analysis that requires multiple files to calculate victimization rates.
- **Chapter \@ref(c10-ambarom-vignette)**: AmericasBarometer Vignette. A vignette on how to analyze data from the AmericasBarometer, a survey of attitudes, evaluations, experiences, and behavior in countries in the Western Hemisphere. This includes how to make choropleth maps with survey estimates.
- **Chapter \@ref(c09-reprex-data)**: TO-DO
- **Chapter \@ref(c10-specifying-sample-designs)**: Specifying sampling designs. Descriptions of common sampling designs, when they are used, the math behind the mean and standard error estimates, how to specify the designs in R, and examples using real data.
- **Chapter \@ref(c11-missing-data)**: TO-DO
- **Chapter \@ref(c12-pitfalls)**: TO-DO
- **Chapter \@ref(c13-ncvs-vignette)**: National Crime Victimization Survey Vignette. A vignette on how to analyze data from the NCVS, a survey in the U.S. that collects information on crimes and their characteristics. This illustrates an analysis that requires multiple files to calculate victimization rates.
- **Chapter \@ref(c14-ambarom-vignette)**: AmericasBarometer Vignette. A vignette on how to analyze data from the AmericasBarometer, a survey of attitudes, evaluations, experiences, and behavior in countries in the Western Hemisphere. This includes how to make choropleth maps with survey estimates.

In most chapters, you'll find code that you can follow. Each of these chapters starts with a "set-up" section. This section will include the code needed to load the necessary packages and datasets in the chapter. We then provide the main idea of the chapter and examples on how to use the functions. Most chapters end with exercises to work through. Solutions to the exercises can be found in the Appendix.

## Datasets used in this book {#book-datasets}
## Datasets used in this book

We work with two key datasets throughout the book: the Residential Energy Consumption Survey [RECS -- @recs-2020-tech] and the American National Election Studies [ANES -- @debell]. To ensure that all readers can follow the examples, we have provided analytic datasets in an R package, {srvyr.data}. Install the package from GitHub using the {remotes} package.

Expand Down Expand Up @@ -80,7 +86,7 @@ recs_2020 %>% select(-starts_with("NWEIGHT"))
recs_2020 %>% select(starts_with("NWEIGHT"))
```

From this output, we can see that there are `r nrow(recs_2020) %>% formatC(big.mark = ",")` rows and `r ncol(recs_2020) %>% formatC(big.mark = ",")` variables. We can see that there are variables containing an ID (`DOEID`), geographic information (e.g., `Region`, `state_postal`, `Urbanicity`), along with information about the house, including the type of house (`HousingUnitType`) and when the house was built (`YearMade`). Additionally, there is a long list of weighting variables that we will use in the analysis (e.g., `NWEIGHT`, `NWEIGHT1`, ..., `NWEIGHT60`). We will discuss using these weighting variables in Chapter \@ref(c03-specifying-sample-designs). For a more detailed codebook, see Appendix \@ref(recs-cb).
From this output, we can see that there are `r nrow(recs_2020) %>% formatC(big.mark = ",")` rows and `r ncol(recs_2020) %>% formatC(big.mark = ",")` variables. We can see that there are variables containing an ID (`DOEID`), geographic information (e.g., `Region`, `state_postal`, `Urbanicity`), along with information about the house, including the type of house (`HousingUnitType`) and when the house was built (`YearMade`). Additionally, there is a long list of weighting variables that we will use in the analysis (e.g., `NWEIGHT`, `NWEIGHT1`, ..., `NWEIGHT60`). We will discuss using these weighting variables in Chapter \@ref(c10-specifying-sample-designs). For a more detailed codebook, see Appendix \@ref(recs-cb).

The ANES is a series study that has collected data from election surveys since 1948. These surveys contain data on public opinion and voting behavior in U.S. presidential elections. The 2020 survey (the data we will be using) was fielded to individuals over the web, through live video interviewing, or over with computer-assisted telephone interviewing (CATI). The survey includes questions on party affiliation, voting choice, and level of trust with the government. Here is an overview of the `anes_2020` data. First, we show the variables starting with "V" followed by a number; these are the original variables. Then, we show you the remaining variables that we created based on the original data:

Expand All @@ -90,4 +96,4 @@ anes_2020 %>% select(matches("^V\\d"))
anes_2020 %>% select(-matches("^V\\d"))
```

From this output we can see that there are `r nrow(anes_2020) %>% formatC(big.mark = ",")` rows and `r ncol(anes_2020) %>% formatC(big.mark = ",")` variables. Most of the variables start with V20, so referencing the documentation for survey will be crucial to not get lost (see Chapter \@ref(c04-understanding-survey-data-documentation)). We have created some more descriptive variables for you to use throughout this book, such as the age (`Age`) and gender (`Gender`) of the respondent, along with variables that represent their party affiliation (`PartyID`). Additionally, we need the variables `Weight` and `Stratum` to analyze this data accurately. We will discuss how to use these weighting variables in Chapters \@ref(c03-specifying-sample-designs) and \@ref(c04-understanding-survey-data-documentation). For a more detailed codebook, see Appendix \@ref(anes-cb).
From this output we can see that there are `r nrow(anes_2020) %>% formatC(big.mark = ",")` rows and `r ncol(anes_2020) %>% formatC(big.mark = ",")` variables. Most of the variables start with V20, so referencing the documentation for survey will be crucial to not get lost (see Chapter \@ref(c03-understanding-survey-data-documentation)). We have created some more descriptive variables for you to use throughout this book, such as the age (`Age`) and gender (`Gender`) of the respondent, along with variables that represent their party affiliation (`PartyID`). Additionally, we need the variables `Weight` and `Stratum` to analyze this data accurately. We will discuss how to use these weighting variables in Chapters \@ref(c10-specifying-sample-designs) and \@ref(c03-understanding-survey-data-documentation). For a more detailed codebook, see Appendix \@ref(anes-cb).
10 changes: 5 additions & 5 deletions 02-overview-surveys.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Generally, survey researchers consider there to be seven main sources of error t

- Representation
- **Coverage Error**: A mismatch between the *population of interest* (also known as the target population or study population) and the sampling frame.
- **Sampling Error**: Error produced when selecting a *sample*, the subset of the population, from the *sampling frame*, the list from which the sample is drawn (there is no sampling error if conducting a census). This error is due to randomization, and we discuss how to quantify this error in Chapter \@ref(c03-specifying-sample-designs).
- **Sampling Error**: Error produced when selecting a *sample*, the subset of the population, from the *sampling frame*, the list from which the sample is drawn (there is no sampling error if conducting a census). This error is due to randomization, and we discuss how to quantify this error in Chapter \@ref(c10-specifying-sample-designs).
- **Nonresponse Error**: Differences between those who responded and did not respond to the survey (unit nonresponse) or a given question (item nonresponse).
- **Adjustment Error**: Error introduced during post-survey statistical adjustments.
- Measurement
Expand Down Expand Up @@ -49,7 +49,7 @@ From formulating methodologies to choosing an appropriate sampling frame, the st

The set or group we want to survey is known as the *population of interest*. The population of interest could be broad, such as “all adults age 18+ living in the U.S.” or a specific population based on a particular characteristic or location. For example, we may want to know about "adults aged 18-24 who live in North Carolina" or "eligible voters living in Illinois." However, a *sampling frame* with contact information is needed to survey individuals in these populations of interest. If researchers are looking at eligible voters, the sampling frame could be the voting registry for a given state or area. The sampling frame is likely imperfect for more broad target populations like all adults in the United States. In these cases, researchers may choose to use a sampling frame of mailing addresses and send the survey to households, or they may choose to use random digit dialing (RDD) and call random phone numbers (that may or may not be assigned, connected, and working). These imperfect sampling frames can result in *coverage error* where there is a mismatch between the target population and the list of individuals researchers can select. For example, if a researcher is looking to obtain estimates for "all adults aged 18+ living in the U.S.", a sampling frame of mailing addresses will miss specific types of individuals, such as the homeless, transient populations, and incarcerated individuals. Additionally, many households have more than one adult living there, so researchers would need to consider how to get a specific individual to fill out the survey (called within household selection) or adjust the target population to report on "U.S. households" instead of "individuals."

Once the researchers have selected the sampling frame, the next step is determining how to select individuals for the survey. In rare cases, researchers may conduct a *census* and survey everyone on the sampling frame. However, the ability to implement a questionnaire at that scale is something only some can do (e.g., government censuses). Instead, researchers typically choose to sample individuals and use weights to estimate numbers in the target population. They can use a variety of different sampling methods, and more information on these can be found in Chapter \@ref(c03-specifying-sample-designs). This decision of which sampling method to use impacts *sampling error* and can be accounted for in weighting.
Once the researchers have selected the sampling frame, the next step is determining how to select individuals for the survey. In rare cases, researchers may conduct a *census* and survey everyone on the sampling frame. However, the ability to implement a questionnaire at that scale is something only some can do (e.g., government censuses). Instead, researchers typically choose to sample individuals and use weights to estimate numbers in the target population. They can use a variety of different sampling methods, and more information on these can be found in Chapter \@ref(c10-specifying-sample-designs). This decision of which sampling method to use impacts *sampling error* and can be accounted for in weighting.

#### Example: Number of Pets in a Household {.unnumbered #overview-design-sampdesign-ex}

Expand Down Expand Up @@ -146,7 +146,7 @@ knitr::include_graphics(path="images/PetExample2.png")

Researchers can then code the responses from the open-ended box and get a better understanding of the respondent's choice of preferred pet. Interpreting this question becomes easier as researchers no longer need to qualify the results with the choices provided.

This is a simple example of how the presentation of the question and options can impact the findings. For more complex topics and questions, researchers must thoroughly consider how to mitigate any impacts from the presentation, formatting, wording, and other aspects. As survey analysts, reviewing not only the data but also the wording of the questions is crucial to ensure the results are presented in a manner consistent with the question asked. Chapter \@ref(c04-understanding-survey-data-documentation) provides further details on how to review existing survey documentation to inform our analyses.
This is a simple example of how the presentation of the question and options can impact the findings. For more complex topics and questions, researchers must thoroughly consider how to mitigate any impacts from the presentation, formatting, wording, and other aspects. As survey analysts, reviewing not only the data but also the wording of the questions is crucial to ensure the results are presented in a manner consistent with the question asked. Chapter \@ref(c03-understanding-survey-data-documentation) provides further details on how to review existing survey documentation to inform our analyses.

## Data Collection {#overview-datacollection}

Expand All @@ -170,7 +170,7 @@ Let's return to the question we created to ask about [animal preference](#overvi

### Weighting {#overview-post-weighting}

We can address some of the error sources identified in the previous sections using *weighting*. For example, weights can address coverage, sampling, and nonresponse errors. Many published surveys include an "analysis weight" variable that combines these adjustments. However, weighting itself can also introduce *adjustment error*, so researchers need to balance which types of errors should be corrected with weighting. The construction of weights is outside the scope of this book, and researchers should reference other materials if interested in constructing their own [@Valliant2018weights]. Instead, this book assumes the survey has been completed, weights are constructed, and data is available to users. We walk users through how to read the documentation (Chapter \@ref(c04-understanding-survey-data-documentation)) and work with the data and analysis weights provided to analyze and interpret survey results correctly.
We can address some of the error sources identified in the previous sections using *weighting*. For example, weights can address coverage, sampling, and nonresponse errors. Many published surveys include an "analysis weight" variable that combines these adjustments. However, weighting itself can also introduce *adjustment error*, so researchers need to balance which types of errors should be corrected with weighting. The construction of weights is outside the scope of this book, and researchers should reference other materials if interested in constructing their own [@Valliant2018weights]. Instead, this book assumes the survey has been completed, weights are constructed, and data is available to users. We walk users through how to read the documentation (Chapter \@ref(c03-understanding-survey-data-documentation)) and work with the data and analysis weights provided to analyze and interpret survey results correctly.

#### Example: Number of Pets in a Household {.unnumbered #overview-post-weighting-ex}

Expand All @@ -185,7 +185,7 @@ Before data is released publicly, researchers need to ensure that individual res

Documentation is a critical step of the survey life cycle. Researchers systematically record all the details, decisions, procedures, and methodologies to ensure transparency, reproducibility, and the overall quality of survey research.

Proper documentation allows analysts to understand, reproduce, and evaluate the study's methods and findings. Chapter \@ref(c04-understanding-survey-data-documentation) dives into how analysts should use survey data documentation.
Proper documentation allows analysts to understand, reproduce, and evaluate the study's methods and findings. Chapter \@ref(c03-understanding-survey-data-documentation) dives into how analysts should use survey data documentation.

## Post-survey data analysis and reporting

Expand Down
Loading

0 comments on commit b76871a

Please sign in to comment.