diff --git a/01-introduction.Rmd b/01-introduction.Rmd index c522ecd0..d1f0b03d 100644 --- a/01-introduction.Rmd +++ b/01-introduction.Rmd @@ -1,5 +1,7 @@ \mainmatter +# (PART) Intro {-} + # Introduction {#c01-intro} Surveys are used to gather information about a population. They are frequently used by researchers, governments, and businesses to better understand public opinion and behavior. For example, a non-profit group might be interested in public opinion on a given topic, government agencies may be interested in behaviors to inform policy, or companies may survey potential consumers about what they want from their products. Developing and fielding a survey is a method to gather information about topics that interest us. @@ -25,18 +27,22 @@ There is one limitation to the {srvyr} package: it doesn't fully incorporate the This book will cover many aspects of survey design and analysis, from understanding how to create design effects to conducting descriptive analysis, statistical tests, and models. Additionally, we emphasize best practices in coding and presenting results. Throughout this book, we use real-world data and present practical examples to help you gain proficiency in survey analysis. While we provide a brief overview of survey methodology and statistical theory, this book is not intended to be the sole resource for these topics. We reference other materials throughout the book and encourage readers to seek those out for more information. Below is a summary of each chapter: - **Chapter \@ref(c02-overview-surveys)**: An overview of surveys and the process of designing surveys. This is only an overview, and we include many references for more in-depth knowledge. -- **Chapter \@ref(c03-specifying-sample-designs)**: Specifying sampling designs. Descriptions of common sampling designs, when they are used, the math behind the mean and standard error estimates, how to specify the designs in R, and examples using real data. -- **Chapter \@ref(c04-understanding-survey-data-documentation)**: Understanding survey documentation. How to read the various components of survey documentation, working with missing data, and finding the documentation. +- **Chapter \@ref(c03-understanding-survey-data-documentation)**: Understanding survey documentation. How to read the various components of survey documentation, working with missing data, and finding the documentation. +- **Chapter \@ref(c04-set-up)**: TO-DO - **Chapter \@ref(c05-descriptive-analysis)**: Descriptive analyses. Calculating point estimates along with their standard errors, confidence intervals, and design effects. - **Chapter \@ref(c06-statistical-testing)**: Statistical testing. Testing for differences between groups, including comparisons of means and proportions as well as goodness of fit tests, tests of independence, and tests of homogeneity. - **Chapter \@ref(c07-modeling)**: Modeling. Linear regression, ANOVA, and logistic regression. - **Chapter \@ref(c08-communicating-results)**: Communicating results. Describing results, reproducibility, making publishable tables and graphs, and helpful functions. -- **Chapter \@ref(c09-ncvs-vignette)**: National Crime Victimization Survey Vignette. A vignette on how to analyze data from the NCVS, a survey in the U.S. that collects information on crimes and their characteristics. This illustrates an analysis that requires multiple files to calculate victimization rates. -- **Chapter \@ref(c10-ambarom-vignette)**: AmericasBarometer Vignette. A vignette on how to analyze data from the AmericasBarometer, a survey of attitudes, evaluations, experiences, and behavior in countries in the Western Hemisphere. This includes how to make choropleth maps with survey estimates. +- **Chapter \@ref(c09-reprex-data)**: TO-DO +- **Chapter \@ref(c10-specifying-sample-designs)**: Specifying sampling designs. Descriptions of common sampling designs, when they are used, the math behind the mean and standard error estimates, how to specify the designs in R, and examples using real data. +- **Chapter \@ref(c11-missing-data)**: TO-DO +- **Chapter \@ref(c12-pitfalls)**: TO-DO +- **Chapter \@ref(c13-ncvs-vignette)**: National Crime Victimization Survey Vignette. A vignette on how to analyze data from the NCVS, a survey in the U.S. that collects information on crimes and their characteristics. This illustrates an analysis that requires multiple files to calculate victimization rates. +- **Chapter \@ref(c14-ambarom-vignette)**: AmericasBarometer Vignette. A vignette on how to analyze data from the AmericasBarometer, a survey of attitudes, evaluations, experiences, and behavior in countries in the Western Hemisphere. This includes how to make choropleth maps with survey estimates. In most chapters, you'll find code that you can follow. Each of these chapters starts with a "set-up" section. This section will include the code needed to load the necessary packages and datasets in the chapter. We then provide the main idea of the chapter and examples on how to use the functions. Most chapters end with exercises to work through. Solutions to the exercises can be found in the Appendix. -## Datasets used in this book {#book-datasets} +## Datasets used in this book We work with two key datasets throughout the book: the Residential Energy Consumption Survey [RECS -- @recs-2020-tech] and the American National Election Studies [ANES -- @debell]. To ensure that all readers can follow the examples, we have provided analytic datasets in an R package, {srvyr.data}. Install the package from GitHub using the {remotes} package. @@ -80,7 +86,7 @@ recs_2020 %>% select(-starts_with("NWEIGHT")) recs_2020 %>% select(starts_with("NWEIGHT")) ``` -From this output, we can see that there are `r nrow(recs_2020) %>% formatC(big.mark = ",")` rows and `r ncol(recs_2020) %>% formatC(big.mark = ",")` variables. We can see that there are variables containing an ID (`DOEID`), geographic information (e.g., `Region`, `state_postal`, `Urbanicity`), along with information about the house, including the type of house (`HousingUnitType`) and when the house was built (`YearMade`). Additionally, there is a long list of weighting variables that we will use in the analysis (e.g., `NWEIGHT`, `NWEIGHT1`, ..., `NWEIGHT60`). We will discuss using these weighting variables in Chapter \@ref(c03-specifying-sample-designs). For a more detailed codebook, see Appendix \@ref(recs-cb). +From this output, we can see that there are `r nrow(recs_2020) %>% formatC(big.mark = ",")` rows and `r ncol(recs_2020) %>% formatC(big.mark = ",")` variables. We can see that there are variables containing an ID (`DOEID`), geographic information (e.g., `Region`, `state_postal`, `Urbanicity`), along with information about the house, including the type of house (`HousingUnitType`) and when the house was built (`YearMade`). Additionally, there is a long list of weighting variables that we will use in the analysis (e.g., `NWEIGHT`, `NWEIGHT1`, ..., `NWEIGHT60`). We will discuss using these weighting variables in Chapter \@ref(c10-specifying-sample-designs). For a more detailed codebook, see Appendix \@ref(recs-cb). The ANES is a series study that has collected data from election surveys since 1948. These surveys contain data on public opinion and voting behavior in U.S. presidential elections. The 2020 survey (the data we will be using) was fielded to individuals over the web, through live video interviewing, or over with computer-assisted telephone interviewing (CATI). The survey includes questions on party affiliation, voting choice, and level of trust with the government. Here is an overview of the `anes_2020` data. First, we show the variables starting with "V" followed by a number; these are the original variables. Then, we show you the remaining variables that we created based on the original data: @@ -90,4 +96,4 @@ anes_2020 %>% select(matches("^V\\d")) anes_2020 %>% select(-matches("^V\\d")) ``` -From this output we can see that there are `r nrow(anes_2020) %>% formatC(big.mark = ",")` rows and `r ncol(anes_2020) %>% formatC(big.mark = ",")` variables. Most of the variables start with V20, so referencing the documentation for survey will be crucial to not get lost (see Chapter \@ref(c04-understanding-survey-data-documentation)). We have created some more descriptive variables for you to use throughout this book, such as the age (`Age`) and gender (`Gender`) of the respondent, along with variables that represent their party affiliation (`PartyID`). Additionally, we need the variables `Weight` and `Stratum` to analyze this data accurately. We will discuss how to use these weighting variables in Chapters \@ref(c03-specifying-sample-designs) and \@ref(c04-understanding-survey-data-documentation). For a more detailed codebook, see Appendix \@ref(anes-cb). +From this output we can see that there are `r nrow(anes_2020) %>% formatC(big.mark = ",")` rows and `r ncol(anes_2020) %>% formatC(big.mark = ",")` variables. Most of the variables start with V20, so referencing the documentation for survey will be crucial to not get lost (see Chapter \@ref(c03-understanding-survey-data-documentation)). We have created some more descriptive variables for you to use throughout this book, such as the age (`Age`) and gender (`Gender`) of the respondent, along with variables that represent their party affiliation (`PartyID`). Additionally, we need the variables `Weight` and `Stratum` to analyze this data accurately. We will discuss how to use these weighting variables in Chapters \@ref(c10-specifying-sample-designs) and \@ref(c03-understanding-survey-data-documentation). For a more detailed codebook, see Appendix \@ref(anes-cb). diff --git a/02-overview-surveys.Rmd b/02-overview-surveys.Rmd index 85649c59..533bf1c5 100644 --- a/02-overview-surveys.Rmd +++ b/02-overview-surveys.Rmd @@ -19,7 +19,7 @@ Generally, survey researchers consider there to be seven main sources of error t - Representation - **Coverage Error**: A mismatch between the *population of interest* (also known as the target population or study population) and the sampling frame. - - **Sampling Error**: Error produced when selecting a *sample*, the subset of the population, from the *sampling frame*, the list from which the sample is drawn (there is no sampling error if conducting a census). This error is due to randomization, and we discuss how to quantify this error in Chapter \@ref(c03-specifying-sample-designs). + - **Sampling Error**: Error produced when selecting a *sample*, the subset of the population, from the *sampling frame*, the list from which the sample is drawn (there is no sampling error if conducting a census). This error is due to randomization, and we discuss how to quantify this error in Chapter \@ref(c10-specifying-sample-designs). - **Nonresponse Error**: Differences between those who responded and did not respond to the survey (unit nonresponse) or a given question (item nonresponse). - **Adjustment Error**: Error introduced during post-survey statistical adjustments. - Measurement @@ -49,7 +49,7 @@ From formulating methodologies to choosing an appropriate sampling frame, the st The set or group we want to survey is known as the *population of interest*. The population of interest could be broad, such as “all adults age 18+ living in the U.S.” or a specific population based on a particular characteristic or location. For example, we may want to know about "adults aged 18-24 who live in North Carolina" or "eligible voters living in Illinois." However, a *sampling frame* with contact information is needed to survey individuals in these populations of interest. If researchers are looking at eligible voters, the sampling frame could be the voting registry for a given state or area. The sampling frame is likely imperfect for more broad target populations like all adults in the United States. In these cases, researchers may choose to use a sampling frame of mailing addresses and send the survey to households, or they may choose to use random digit dialing (RDD) and call random phone numbers (that may or may not be assigned, connected, and working). These imperfect sampling frames can result in *coverage error* where there is a mismatch between the target population and the list of individuals researchers can select. For example, if a researcher is looking to obtain estimates for "all adults aged 18+ living in the U.S.", a sampling frame of mailing addresses will miss specific types of individuals, such as the homeless, transient populations, and incarcerated individuals. Additionally, many households have more than one adult living there, so researchers would need to consider how to get a specific individual to fill out the survey (called within household selection) or adjust the target population to report on "U.S. households" instead of "individuals." -Once the researchers have selected the sampling frame, the next step is determining how to select individuals for the survey. In rare cases, researchers may conduct a *census* and survey everyone on the sampling frame. However, the ability to implement a questionnaire at that scale is something only some can do (e.g., government censuses). Instead, researchers typically choose to sample individuals and use weights to estimate numbers in the target population. They can use a variety of different sampling methods, and more information on these can be found in Chapter \@ref(c03-specifying-sample-designs). This decision of which sampling method to use impacts *sampling error* and can be accounted for in weighting. +Once the researchers have selected the sampling frame, the next step is determining how to select individuals for the survey. In rare cases, researchers may conduct a *census* and survey everyone on the sampling frame. However, the ability to implement a questionnaire at that scale is something only some can do (e.g., government censuses). Instead, researchers typically choose to sample individuals and use weights to estimate numbers in the target population. They can use a variety of different sampling methods, and more information on these can be found in Chapter \@ref(c10-specifying-sample-designs). This decision of which sampling method to use impacts *sampling error* and can be accounted for in weighting. #### Example: Number of Pets in a Household {.unnumbered #overview-design-sampdesign-ex} @@ -146,7 +146,7 @@ knitr::include_graphics(path="images/PetExample2.png") Researchers can then code the responses from the open-ended box and get a better understanding of the respondent's choice of preferred pet. Interpreting this question becomes easier as researchers no longer need to qualify the results with the choices provided. -This is a simple example of how the presentation of the question and options can impact the findings. For more complex topics and questions, researchers must thoroughly consider how to mitigate any impacts from the presentation, formatting, wording, and other aspects. As survey analysts, reviewing not only the data but also the wording of the questions is crucial to ensure the results are presented in a manner consistent with the question asked. Chapter \@ref(c04-understanding-survey-data-documentation) provides further details on how to review existing survey documentation to inform our analyses. +This is a simple example of how the presentation of the question and options can impact the findings. For more complex topics and questions, researchers must thoroughly consider how to mitigate any impacts from the presentation, formatting, wording, and other aspects. As survey analysts, reviewing not only the data but also the wording of the questions is crucial to ensure the results are presented in a manner consistent with the question asked. Chapter \@ref(c03-understanding-survey-data-documentation) provides further details on how to review existing survey documentation to inform our analyses. ## Data Collection {#overview-datacollection} @@ -170,7 +170,7 @@ Let's return to the question we created to ask about [animal preference](#overvi ### Weighting {#overview-post-weighting} -We can address some of the error sources identified in the previous sections using *weighting*. For example, weights can address coverage, sampling, and nonresponse errors. Many published surveys include an "analysis weight" variable that combines these adjustments. However, weighting itself can also introduce *adjustment error*, so researchers need to balance which types of errors should be corrected with weighting. The construction of weights is outside the scope of this book, and researchers should reference other materials if interested in constructing their own [@Valliant2018weights]. Instead, this book assumes the survey has been completed, weights are constructed, and data is available to users. We walk users through how to read the documentation (Chapter \@ref(c04-understanding-survey-data-documentation)) and work with the data and analysis weights provided to analyze and interpret survey results correctly. +We can address some of the error sources identified in the previous sections using *weighting*. For example, weights can address coverage, sampling, and nonresponse errors. Many published surveys include an "analysis weight" variable that combines these adjustments. However, weighting itself can also introduce *adjustment error*, so researchers need to balance which types of errors should be corrected with weighting. The construction of weights is outside the scope of this book, and researchers should reference other materials if interested in constructing their own [@Valliant2018weights]. Instead, this book assumes the survey has been completed, weights are constructed, and data is available to users. We walk users through how to read the documentation (Chapter \@ref(c03-understanding-survey-data-documentation)) and work with the data and analysis weights provided to analyze and interpret survey results correctly. #### Example: Number of Pets in a Household {.unnumbered #overview-post-weighting-ex} @@ -185,7 +185,7 @@ Before data is released publicly, researchers need to ensure that individual res Documentation is a critical step of the survey life cycle. Researchers systematically record all the details, decisions, procedures, and methodologies to ensure transparency, reproducibility, and the overall quality of survey research. -Proper documentation allows analysts to understand, reproduce, and evaluate the study's methods and findings. Chapter \@ref(c04-understanding-survey-data-documentation) dives into how analysts should use survey data documentation. +Proper documentation allows analysts to understand, reproduce, and evaluate the study's methods and findings. Chapter \@ref(c03-understanding-survey-data-documentation) dives into how analysts should use survey data documentation. ## Post-survey data analysis and reporting diff --git a/04-understanding-survey-data-documentation.Rmd b/03-understanding-survey-data-documentation.Rmd similarity index 98% rename from 04-understanding-survey-data-documentation.Rmd rename to 03-understanding-survey-data-documentation.Rmd index b2821541..3c692e53 100644 --- a/04-understanding-survey-data-documentation.Rmd +++ b/03-understanding-survey-data-documentation.Rmd @@ -1,4 +1,4 @@ -# Understanding Survey Data Documentation {#c04-understanding-survey-data-documentation} +# Understanding Survey Data Documentation {#c03-understanding-survey-data-documentation} ::: {.prereqbox-header} `r if (knitr:::is_html_output()) '### Prerequisites {- #prereq4}'` @@ -18,7 +18,8 @@ library(srvyr.data) library(censusapi) ``` -We will be using data from ANES. Here is the code to read in the data. +We created multiple analytic datasets for use in this book in an R package as described in \@ref(datasets-used-in-this-book). For this chapter, we will be using data from ANES. Here is the code to read in the data: + ```{r} #| label: understand-anes-c04 #| eval: FALSE @@ -28,7 +29,7 @@ data(anes_2020) ## Introduction -Before diving into survey analysis, it's crucial to review the survey documentation thoroughly. The documentation includes technical guides, questionnaires, codebooks, errata, and other useful resources. By taking the time to review these materials, we can gain a comprehensive understanding of the survey data (including research and design decisions discussed in Chapters \@ref(c02-overview-surveys) and \@ref(c03-specifying-sample-designs)) and effectively conduct our analysis. +Before diving into survey analysis, it's crucial to review the survey documentation thoroughly. The documentation includes technical guides, questionnaires, codebooks, errata, and other useful resources. By taking the time to review these materials, we can gain a comprehensive understanding of the survey data (including research and design decisions discussed in Chapters \@ref(c02-overview-surveys) and \@ref(c10-specifying-sample-designs)) and effectively conduct our analysis. Survey documentation can vary in organization, type, and ease of use. The information may be stored in any format - PDFs, Excel spreadsheets, Word documents, etc. Some surveys save different documentation together, such as providing a single document containing both the codebook and the questionnaire. Others keep them in separate files. Despite these differences, it is important to know what kind of information is available in each documentation type and what to focus on in each one. @@ -39,7 +40,7 @@ The technical documentation, also known as user guides or methodology/analysis g * **Introduction:** The introduction orients us to the survey. This section provides the project's background, the study's purpose, and the main research questions. * **Study design:** The study design section describes how researchers prepared and administered the survey. - * **Sample:** The sample section describes how researchers selected cases, any sampling error that occurred, and the limitations of the sample. This section can contain recommendations on how to use sampling weights. Look for weight information, whether the survey design contains strata, clusters/PSUs, or replicate weights. Also look for population sizes, finite population correction, or replicate weight scaling information. The sample documentation is critical in successfully running our analysis, and more detail on sample designs is available in Chapter \@ref(c03-specifying-sample-designs). + * **Sample:** The sample section describes how researchers selected cases, any sampling error that occurred, and the limitations of the sample. This section can contain recommendations on how to use sampling weights. Look for weight information, whether the survey design contains strata, clusters/PSUs, or replicate weights. Also look for population sizes, finite population correction, or replicate weight scaling information. The sample documentation is critical in successfully running our analysis, and more detail on sample designs is available in Chapter \@ref(c10-specifying-sample-designs). The technical documentation may include other helpful information. Some technical documentation includes syntax for SAS, SUDAAN, Stata, and/or R, meaning we do not have to create this code from scratch. diff --git a/04-set-up.Rmd b/04-set-up.Rmd new file mode 100644 index 00000000..acd1a902 --- /dev/null +++ b/04-set-up.Rmd @@ -0,0 +1,17 @@ +# (PART) Analysis {-} + +# Set-up {#c04-set-up} + +```{r} +data(recs_2020) + +recs_des <- recs_2020 %>% + as_survey_rep( + weights = NWEIGHT, + repweights = NWEIGHT1:NWEIGHT60, + type = "JK1", + scale = 59/60, + mse = TRUE + ) +``` + diff --git a/05-descriptive-analysis.Rmd b/05-descriptive-analysis.Rmd index 0c862ae9..de907f1a 100644 --- a/05-descriptive-analysis.Rmd +++ b/05-descriptive-analysis.Rmd @@ -34,7 +34,7 @@ dstrata <- apistrat %>% as_survey_design(strata = stype, weights = pw) ``` -We will be using data from ANES and RECS. Here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter \@ref(c04-understanding-survey-data-documentation) for more information). +We will be using data from ANES and RECS. Here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter \@ref(c03-understanding-survey-data-documentation) for more information). ```{r} #| label: desc-anes-des @@ -54,7 +54,7 @@ anes_des <- anes_adjwgt %>% ) ``` -For RECS, details are included in the RECS documentation and Chapter \@ref(c03-specifying-sample-designs). +For RECS, details are included in the RECS documentation and Chapter \@ref(c10-specifying-sample-designs). ```{r} #| label: desc-recs-des @@ -83,7 +83,7 @@ We will discuss many different types of descriptive analyses in this chapter, bu * Discrete data: variables that are counted or measured, such as number of children * Continuous data, variables that are measured and whose values can lie anywhere on an interval, such as weight -When we pull the data from surveys into R, the data will be listed as character, factor, numeric, or logical/Boolean. They will not clearly indicate the type of survey data (e.g., ordinal). When working with survey data, researchers need to properly use the questionnaire and codebook along with the data (see Chapter \@ref(c04-understanding-survey-data-documentation)) to understand what the values for each variable represent. For example, our survey data may represent categorical variables (e.g., the North, South, East, and West regions of the United States) using numeric codes (e.g., 1, 2, 3, and 4). Though this is a categorical variable from the survey, this variable might be automatically read as numeric values when we import our data into R. This can lead to the common mistake of applying a mean function to categorical values instead of a proportion function. Choosing appropriate measures is crucial to reach valid conclusions. Different variable types have distinct properties and levels of measurement, and we cannot apply all measures to all variables. +When we pull the data from surveys into R, the data will be listed as character, factor, numeric, or logical/Boolean. They will not clearly indicate the type of survey data (e.g., ordinal). When working with survey data, researchers need to properly use the questionnaire and codebook along with the data (see Chapter \@ref(c03-understanding-survey-data-documentation)) to understand what the values for each variable represent. For example, our survey data may represent categorical variables (e.g., the North, South, East, and West regions of the United States) using numeric codes (e.g., 1, 2, 3, and 4). Though this is a categorical variable from the survey, this variable might be automatically read as numeric values when we import our data into R. This can lead to the common mistake of applying a mean function to categorical values instead of a proportion function. Choosing appropriate measures is crucial to reach valid conclusions. Different variable types have distinct properties and levels of measurement, and we cannot apply all measures to all variables. This chapter will discuss how to analyze *measures of distribution* (e.g., cross-tabulations), *central tendency* (e.g., means), *relationship* (e.g., ratios), and *dispersion* (e.g., standard). Measures of distribution describe how often an event or response occurs. These measures include counts and totals. Measures of central tendency find the central (or average) responses. These measures include means and medians. Measures of relationship describe how variables relate to each other. These measures include correlations and ratios. Measures of dispersion describe how data spreads around the central tendency for continuous variables. These measures include standard deviations and variances. Specifically, we will cover the following functions from the {srvyr} package: @@ -95,14 +95,14 @@ This chapter will discuss how to analyze *measures of distribution* (e.g., cross * Ratios (`survey_ratio()`) * Variances and standard deviations (`survey_var()` and `survey_sd()`) -To incorporate each of these survey functions, recall the general process for survey estimation from Chapter \@ref(c03-specifying-sample-designs): +To incorporate each of these survey functions, recall the general process for survey estimation from Chapter \@ref(c10-specifying-sample-designs): 1. Create a `tbl_svy` object using `srvyr::as_survey_design()` or `srvyr::as_survey_rep()`. 2. Subset the data for subpopulations using `srvyr::filter()`, if needed. 3. Specify domains of analysis using `srvyr::group_by()`, if needed. 4. Analyze the data with survey-specific functions. -We have already discussed how to create the survey design objects in Chapter \@ref(c03-specifying-sample-designs), and the code for creating these for the two datasets used in this chapter is provided in the Prerequisites box at the beginning of this chapter. We will apply the survey functions covered in this chapter in Step 4. To look at the data by different subgroups, we can choose to filter and/or group the data. It is very important that we filter and group the data only *after* creating the design object. This is necessary to ensure that the results accurately account for the survey design. Removing any data before creating the survey design object means that the data for those cases is not included in the survey design information and estimations of the variance. +We have already discussed how to create the survey design objects in Chapter \@ref(c10-specifying-sample-designs), and the code for creating these for the two datasets used in this chapter is provided in the Prerequisites box at the beginning of this chapter. We will apply the survey functions covered in this chapter in Step 4. To look at the data by different subgroups, we can choose to filter and/or group the data. It is very important that we filter and group the data only *after* creating the design object. This is necessary to ensure that the results accurately account for the survey design. Removing any data before creating the survey design object means that the data for those cases is not included in the survey design information and estimations of the variance. ## Similarities Between {dplyr} and {srvyr} Functions diff --git a/06-statistical-testing.Rmd b/06-statistical-testing.Rmd index 2f49d5e5..b151e032 100644 --- a/06-statistical-testing.Rmd +++ b/06-statistical-testing.Rmd @@ -19,7 +19,7 @@ library(broom) library(gt) ``` -We will be using data from ANES and RECS. Here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter \@ref(c04-understanding-survey-data-documentation) for more information). +We will be using data from ANES and RECS. Here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter \@ref(c03-understanding-survey-data-documentation) for more information). ```{r} #| label: stattest-anes-des #| eval: FALSE @@ -38,7 +38,7 @@ anes_des <- anes_adjwgt %>% ) ``` -For RECS, details are included in the RECS documentation and Chapter \@ref(c03-specifying-sample-designs). +For RECS, details are included in the RECS documentation and Chapter \@ref(c10-specifying-sample-designs). ```{r} #| label: stattest-recs-des @@ -113,7 +113,7 @@ For comparing two estimates, this is called a *two-sample t-test* and we can set Two sample t-tests can also be *paired* or *unpaired*. If the data come from two different populations (e.g., North versus South), the t-test run will be an *unpaired* or *independent samples* t-test. *Paired* t-tests occur when the data come from the same population. This is commonly seen with data from the same population in two different time periods (e.g., before and after an intervention). -The difference between t-tests with non-survey data and survey data is based on the underlying variance estimation difference. Chapter \@ref(c03-specifying-sample-designs) provides a detailed overview of the math behind the mean and sampling error calculations for various sample designs. The functions in the {survey} package will account for these nuances, provided the design object is correctly defined. +The difference between t-tests with non-survey data and survey data is based on the underlying variance estimation difference. Chapter \@ref(c10-specifying-sample-designs) provides a detailed overview of the math behind the mean and sampling error calculations for various sample designs. The functions in the {survey} package will account for these nuances, provided the design object is correctly defined. ### Syntax {#stattest-ttest-syntax} @@ -150,7 +150,7 @@ The `formula` argument can take several different forms depending on what we are - **3+ level grouping variable:** `var ~ groupVar == level`, where `var` is the measure of interest, `groupVar` is the categorical variable, and `level` is the category level to isolate. For example, we could test if the test scores in one classroom differed from all other classrooms. b. **Paired:** `var_1 - var_2 ~ 0`, where `var_1` is the first variable of interest and `var_2` is the second variable of interest. For example, we could test if test scores on a subject differed between the start and the end of a course. -The `na.rm` argument defaults to `FALSE`, which means if any data is missing, the t-test will not compute. Throughout this chapter, we will always set `na.rm = TRUE`, but before analyzing the survey data, review the notes provided in Chapter \@ref(c04-understanding-survey-data-documentation) to better understand how to handle missing data. +The `na.rm` argument defaults to `FALSE`, which means if any data is missing, the t-test will not compute. Throughout this chapter, we will always set `na.rm = TRUE`, but before analyzing the survey data, review the notes provided in Chapter \@ref(c03-understanding-survey-data-documentation) to better understand how to handle missing data. Let's walk through a few examples using the ANES and RECS data. @@ -320,7 +320,7 @@ Third, **tests of homogeneity** are used to compare two distributions to see if - $H_0: p_{1a} = p_{1b}, ~ p_{2a} = p_{2b}, ~ ..., ~ p_{ka} = p_{kb}$ where $p_{ia}$ is the observed proportion of category $i$ for subgroup $a$, $p_{ib}$ is the observed proportion of category $i$ for subgroup $a$ and $k$ is the number of categories - $H_A:$ at least one category of $p_{ia}$ does not match $p_{ib}$ -As with t-tests, the difference between using $\chi^2$ tests with non-survey data and survey data is based on the underlying variance estimation. The functions in the {survey} package will account for these nuances, provided the design object is correctly defined. For basic variance estimation formulas for different survey design types, refer to Chapter \@ref(c03-specifying-sample-designs). +As with t-tests, the difference between using $\chi^2$ tests with non-survey data and survey data is based on the underlying variance estimation. The functions in the {survey} package will account for these nuances, provided the design object is correctly defined. For basic variance estimation formulas for different survey design types, refer to Chapter \@ref(c10-specifying-sample-designs). ### Syntax {#stattest-chi-syntax} @@ -375,7 +375,7 @@ For tests of independence, the `Wald` and `adjWald` are recommended as they prov The formula argument will always be one-sided, unlike the `svyttest()` function. The two variables of interest should be included with a plus sign: `formula = ~ var_1 + var_2`. As with the `svygofchisq()` function, the variables entered into the formula should be formatted as either a factor or a character. -Additionally, as with the t-test function, both `svygofchisq()` and `svychisq()` have the `na.rm` argument. If any data is missing, the $\chi^2$ tests will assume that `NA` is a category and include it in the calculation. Throughout this chapter, we will always set `na.rm = TRUE`, but before analyzing the survey data, review the notes provided in Chapter \@ref(c04-understanding-survey-data-documentation) to better understand how to handle missing data. +Additionally, as with the t-test function, both `svygofchisq()` and `svychisq()` have the `na.rm` argument. If any data is missing, the $\chi^2$ tests will assume that `NA` is a category and include it in the calculation. Throughout this chapter, we will always set `na.rm = TRUE`, but before analyzing the survey data, review the notes provided in Chapter \@ref(c03-understanding-survey-data-documentation) to better understand how to handle missing data. ### Examples {#stattest-chi-examples} diff --git a/07-modeling.Rmd b/07-modeling.Rmd index a1e19a18..a9be1dc8 100644 --- a/07-modeling.Rmd +++ b/07-modeling.Rmd @@ -18,7 +18,7 @@ library(srvyr.data) library(broom) ``` -We will be using data from ANES and RECS. Here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter \@ref(c04-understanding-survey-data-documentation) for more information). +We will be using data from ANES and RECS. Here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter \@ref(c03-understanding-survey-data-documentation) for more information). ```{r} #| label: model-anes-des #| eval: FALSE @@ -36,7 +36,7 @@ anes_des <- anes_adjwgt %>% nest = TRUE) ``` -For RECS, details are included in the RECS documentation and Chapter \@ref(c03-specifying-sample-designs). +For RECS, details are included in the RECS documentation and Chapter \@ref(c10-specifying-sample-designs). ```{r} #| label: model-recs-des #| eval: FALSE @@ -132,7 +132,7 @@ The arguments are: * `na.action`: handling of missing data * `df.resid`: degrees of freedom for Wald tests (optional) - defaults to using `degf(design)-(g-1)` where $g$ is the number of groups -The function `svyglm()` does not have the design as the first argument so the dot (`.`) notation is used to pass it with a pipe (see Chapter \@ref(c06-statistical-testing) for more details). The default for missing data is `na.omit`, this means that we are removing all records with any missing data in either predictors or outcomes from analyses. There are other options for handling missing data and we recommend looking at the help documentation for `na.omit` (run `help(na.omit)` or `?na.omit`) for more information on options to use for `na.action`. For a discussion of how to handle missing data see Chapter \@ref(c04-understanding-survey-data-documentation). +The function `svyglm()` does not have the design as the first argument so the dot (`.`) notation is used to pass it with a pipe (see Chapter \@ref(c06-statistical-testing) for more details). The default for missing data is `na.omit`, this means that we are removing all records with any missing data in either predictors or outcomes from analyses. There are other options for handling missing data and we recommend looking at the help documentation for `na.omit` (run `help(na.omit)` or `?na.omit`) for more information on options to use for `na.action`. For a discussion of how to handle missing data see Chapter \@ref(c03-understanding-survey-data-documentation). ### Example diff --git a/08-communicating-results.Rmd b/08-communicating-results.Rmd index 546cddb5..84ab6009 100644 --- a/08-communicating-results.Rmd +++ b/08-communicating-results.Rmd @@ -1,3 +1,5 @@ +# (PART) Reporting {-} + # Communicating Results {#c08-communicating-results} ::: {.prereqbox-header} @@ -19,7 +21,7 @@ library(gt) library(gtsummary) ``` -We will be using data from ANES. Here is the code to create the ANES design object that will be used throughout the chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter \@ref(c04-understanding-survey-data-documentation) for more information). +We will be using data from ANES. Here is the code to create the ANES design object that will be used throughout the chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter \@ref(c03-understanding-survey-data-documentation) for more information). ```{r} #| label: results-anes-des @@ -53,7 +55,7 @@ Before beginning any dissemination of results, it is important to understand the ## Describing Results through Text -As researchers, we often focus on the data itself; communicating the results effectively can be a forgotten step. However, all of the steps that we as researchers need to consider when conducting analyses must also be communicated to our audience. The first few chapters of this book (Chapters \@ref(c02-overview-surveys) through \@ref(c04-understanding-survey-data-documentation)) provided insights into what we need to consider when conducting analyses. Each of these topics should also be considered when presenting results to others. +As researchers, we often focus on the data itself; communicating the results effectively can be a forgotten step. However, all of the steps that we as researchers need to consider when conducting analyses must also be communicated to our audience. The first few chapters of this book (Chapters \@ref(c02-overview-surveys) through \@ref(c03-understanding-survey-data-documentation)) provided insights into what we need to consider when conducting analyses. Each of these topics should also be considered when presenting results to others. ### Methodology diff --git a/09-reproducible-data.Rmd b/09-reproducible-data.Rmd new file mode 100644 index 00000000..d330259f --- /dev/null +++ b/09-reproducible-data.Rmd @@ -0,0 +1 @@ +# Reproducible data {#c09-reprex-data} diff --git a/03-specifying-sample-designs.Rmd b/10-specifying-sample-designs.Rmd similarity index 98% rename from 03-specifying-sample-designs.Rmd rename to 10-specifying-sample-designs.Rmd index 7a1738a3..d309d390 100644 --- a/03-specifying-sample-designs.Rmd +++ b/10-specifying-sample-designs.Rmd @@ -1,4 +1,6 @@ -# Specifying sample designs and replicate weights in {srvyr} {#c03-specifying-sample-designs} +# (PART) Real life data {-} + +# Specifying sample designs and replicate weights in {srvyr} {#c10-specifying-sample-designs} ::: {.prereqbox-header} `r if (knitr:::is_html_output()) '### Prerequisites {- #prereq3}'` @@ -24,7 +26,7 @@ data(api) data(scd) ``` -Additionally, we have created multiple analytic datasets for use in the {srvyr.data} package, as described in \@ref{book-datasets}. This chapter uses data from the Residential Energy Consumption Survey (RECS) - both 2015 and 2020, so we will use the following code to load the RECS data to use later in this chapter: +This chapter uses data from the Residential Energy Consumption Survey (RECS) - both 2015 and 2020, so we will use the following code to load the RECS data: ```{r} #| label: samp-setup-recs @@ -503,7 +505,7 @@ brr_des <- dat %>% mse = TRUE) ``` -Typically, the replicate weights sum to a value similar to the main weight, as they are both supposed to provide population estimates. Rarely, an alternative method will be used where the replicate weights have values of 0 or 2 in the case of BRR weights. This would be indicated in the documentation (see Section \@ref(und-surv-doc) and Chapter \@ref(c04-understanding-survey-data-documentation) for more information on how to understand the provided documentation). In this case, the replicate weights are not combined, and the option `combined_weights = FALSE` should be indicated, as the default value for this argument is TRUE. This specific syntax is shown below: +Typically, the replicate weights sum to a value similar to the main weight, as they are both supposed to provide population estimates. Rarely, an alternative method will be used where the replicate weights have values of 0 or 2 in the case of BRR weights. This would be indicated in the documentation (see Section \@ref(und-surv-doc) and Chapter \@ref(c03-understanding-survey-data-documentation) for more information on how to understand the provided documentation). In this case, the replicate weights are not combined, and the option `combined_weights = FALSE` should be indicated, as the default value for this argument is TRUE. This specific syntax is shown below: ```r brr_des <- dat %>% @@ -550,7 +552,7 @@ Note that `combined_weights` was specified as `FALSE` because these weights are ### Fay's BRR Method -Fay's BRR method for replicate weights is similar to the BRR method in that it uses a Hadamard matrix to construct replicate weights. However, rather than deleting PSUs for each replicate, with Fay's BRR half of the PSUs have a replicate weight which is the main weight multiplied by $\rho$, and the other half have the main weight multiplied by $(2-\rho)$ where $0 \le \rho < 1$. Note that when $\rho=0$, this is equivalent to the standard BRR weights, and as $\rho$ becomes closer to 1, this method is more similar to jackknife discussed in the next section. To obtain the value of $\rho$, it is necessary to read the documentation (see Section \@ref(und-surv-doc) and Chapter \@ref(c04-understanding-survey-data-documentation)). +Fay's BRR method for replicate weights is similar to the BRR method in that it uses a Hadamard matrix to construct replicate weights. However, rather than deleting PSUs for each replicate, with Fay's BRR half of the PSUs have a replicate weight which is the main weight multiplied by $\rho$, and the other half have the main weight multiplied by $(2-\rho)$ where $0 \le \rho < 1$. Note that when $\rho=0$, this is equivalent to the standard BRR weights, and as $\rho$ becomes closer to 1, this method is more similar to jackknife discussed in the next section. To obtain the value of $\rho$, it is necessary to read the documentation (see Section \@ref(und-surv-doc) and Chapter \@ref(c03-understanding-survey-data-documentation)). #### The math {-} @@ -670,21 +672,6 @@ recs_des summary(recs_des) ``` -```{r} -#| label: samp-des-recs-des-full -#| echo: FALSE -# This is just for later use in book -recs_des <- recs_2020 %>% - as_survey_rep( - weights = NWEIGHT, - repweights = NWEIGHT1:NWEIGHT60, - type = "JK1", - scale = 59/60, - mse = TRUE - ) -``` - - When printing the design object or looking at the summary, the replicate weight type is re-iterated as `Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances`, and the variables are included. No weight or probability summary is included. ### Bootstrap Method @@ -773,7 +760,7 @@ As with other replicate design objects, when printing the object or looking at t SRS, stratified, and clustered designs are the backbone of sampling designs, and the features are often combined in one design. Additionally, rather than using SRS for selection, other sampling mechanisms are commonly used, such as probability proportional to size (PPS), systematic sampling, or selection with unequal probabilities, which are briefly described here. In PPS sampling, a size measure is constructed for each unit (e.g., the population of the PSU or the number of occupied housing units) and then units with larger size measures are more likely to be sampled. Systematic sampling is commonly used to ensure representation across a population. Units are sorted by a feature and then every $k$ units are selected from a random start point so the sample is spread across the population. In addition to PPS, other unequal probabilities of selection may be used. For example, in a study of establishments (e.g., businesses or public institutions) that conducts a survey every year, an establishment that recently participated (e.g., participated last year) may have a reduced chance of selection in a subsequent round to reduce the burden on the establishment. To learn more about sampling designs, refer to @valliant2013practical, @cox2011business, @cochran1977sampling, and @deming1991sample. -A common method of sampling is to stratify PSUs, select PSUs within the stratum using PPS selection, and then select units within the PSUs either with SRS or PPS. Reading survey documentation is an important first step in survey analysis to understand the design of the survey we are using and variables necessary to specify the design. Good documentation will highlight the variables necessary to specify the design. This is often found in User's Guides, methodology, analysis guides, or technical documentation (see Chapter \@ref(c04-understanding-survey-data-documentation) for more details). +A common method of sampling is to stratify PSUs, select PSUs within the stratum using PPS selection, and then select units within the PSUs either with SRS or PPS. Reading survey documentation is an important first step in survey analysis to understand the design of the survey we are using and variables necessary to specify the design. Good documentation will highlight the variables necessary to specify the design. This is often found in User's Guides, methodology, analysis guides, or technical documentation (see Chapter \@ref(c03-understanding-survey-data-documentation) for more details). #### Example {-} diff --git a/11-missing-data.Rmd b/11-missing-data.Rmd new file mode 100644 index 00000000..17a2fb02 --- /dev/null +++ b/11-missing-data.Rmd @@ -0,0 +1 @@ +# Missing data {#c11-missing-data} diff --git a/12-pitfalls.Rmd b/12-pitfalls.Rmd new file mode 100644 index 00000000..b5aabfdf --- /dev/null +++ b/12-pitfalls.Rmd @@ -0,0 +1 @@ +# Common pitfalls {#c12-pitfalls} diff --git a/09-ncvs-vignette.Rmd b/13-ncvs-vignette.Rmd similarity index 99% rename from 09-ncvs-vignette.Rmd rename to 13-ncvs-vignette.Rmd index bb558a0b..055c2afe 100644 --- a/09-ncvs-vignette.Rmd +++ b/13-ncvs-vignette.Rmd @@ -1,4 +1,6 @@ -# National Crime Victimization Survey Vignette {#c09-ncvs-vignette} +# (PART) Vignettes {-} + +# National Crime Victimization Survey Vignette {#c13-ncvs-vignette} ::: {.prereqbox-header} `r if (knitr:::is_html_output()) '### Prerequisites {- #prereq9}'` diff --git a/10-ambarom-vignette.Rmd b/14-ambarom-vignette.Rmd similarity index 99% rename from 10-ambarom-vignette.Rmd rename to 14-ambarom-vignette.Rmd index aa9bac30..447e6a8d 100644 --- a/10-ambarom-vignette.Rmd +++ b/14-ambarom-vignette.Rmd @@ -1,4 +1,4 @@ -# AmericasBarometer Vignette {#c10-ambarom-vignette} +# AmericasBarometer Vignette {#c14-ambarom-vignette} ::: {.prereqbox-header} `r if (knitr:::is_html_output()) '### Prerequisites {- #prereq10}'` diff --git a/91-AppendixB.Rmd b/91-AppendixB.Rmd index 35a6c2f5..ddd4a1d6 100644 --- a/91-AppendixB.Rmd +++ b/91-AppendixB.Rmd @@ -11,7 +11,7 @@ library(janitor) library(kableExtra) library(knitr) -recs <- recs_2020 +data(recs_2020) ``` The full codebook with the original variables is available at [https://www.eia.gov/consumption/residential/data/2020/index.php?view=microdata](https://www.eia.gov/consumption/residential/data/2020/index.php?view=microdata) - "Variable and response codebook". This codebook includes the variables on the dataset included for download along with this book. @@ -21,23 +21,14 @@ The full codebook with the original variables is available at [https://www.eia.g #| label: recs-cb-prep #| echo: FALSE -attrlist <- map(recs, attributes) - -NULL_to_NA <- function(x){ - if (is.null(x)){ - NA - }else{ - x - } -} - +attrlist <- map(recs_2020, attributes) recs_var_info <- tibble( Vars=names(attrlist), Section=map_chr(attrlist, "Section") %>% unname(), Question=map(attrlist, "Question") %>% map(NULL_to_NA) %>% unlist(use.names = FALSE), Description=map_chr(attrlist, "label") %>% unname(), - VarType=map(recs, class) , + VarType=map(recs_2020, class) , ) %>% mutate( VarType=if_else(Vars=="DOEID", list("ID"), VarType) @@ -90,10 +81,10 @@ make_section <- function(sec){ vt <- vi %>% pull(VarType) %>% unlist() if (any(c("factor", "character", "logical") %in% vt)){ - recs %>% cb_count(var) + recs_2020 %>% cb_count(var) cat("\n") } else if ("numeric" %in% vt){ - recs %>% cb_continuous(var) + recs_2020 %>% cb_continuous(var) cat("\n") } diff --git a/92-AppendixC.Rmd b/92-AppendixC.Rmd new file mode 100644 index 00000000..6ed1779e --- /dev/null +++ b/92-AppendixC.Rmd @@ -0,0 +1 @@ +# Importing survey data into R {#import-data} \ No newline at end of file diff --git a/_output.yml b/_output.yml index 4121d15d..dc1ea379 100644 --- a/_output.yml +++ b/_output.yml @@ -2,13 +2,13 @@ bookdown::gitbook: css: css/style.css config: toc: - collapse: none + collapse: section before: |
  • A Book Example
  • after: |
  • Published with bookdown
  • - download: [pdf, epub] - edit: https://github.com/yihui/bookdown-crc/edit/master/%s + download: null + edit: https://github.com/tidy-survey-r/tidy-survey-book/%s sharing: github: true facebook: false @@ -27,7 +27,4 @@ bookdown::pdf_book: toc_unnumbered: false toc_appendix: true quote_footer: ["\\VA{", "}{}"] - highlight_bw: true -bookdown::epub_book: - stylesheet: css/style.css - pandoc_args: "--mathml" \ No newline at end of file + highlight_bw: true \ No newline at end of file diff --git a/index.Rmd b/index.Rmd index 73ed5112..6ddd018d 100644 --- a/index.Rmd +++ b/index.Rmd @@ -16,6 +16,7 @@ graphics: yes #cover-image: images/cover.jpg header-includes: - \usepackage{draftwatermark} + - \usepackage[titles]{tocloft} --- \SetWatermarkText{DRAFT} diff --git a/renv.lock b/renv.lock index 9a47b908..d2cab0b6 100644 --- a/renv.lock +++ b/renv.lock @@ -1826,9 +1826,9 @@ "Source": "GitHub", "RemoteType": "github", "RemoteHost": "api.github.com", - "RemoteRepo": "srvyr", "RemoteUsername": "gergness", - "RemoteRef": "HEAD", + "RemoteRepo": "srvyr", + "RemoteRef": "main", "RemoteSha": "1917f75487fa40f2ea6fd4e33323cd9278afb356", "Requirements": [ "R", @@ -1842,7 +1842,7 @@ "tidyselect", "vctrs" ], - "Hash": "c77ebba142d814788bab0092bf102f6d" + "Hash": "932c30103619651286c6eba783e9a248" }, "srvyr.data": { "Package": "srvyr.data",