From da7a5acb415928a123a00177ca7efd8b9ce71c12 Mon Sep 17 00:00:00 2001 From: Isabella Velasquez Date: Sun, 14 Jan 2024 10:57:22 -0500 Subject: [PATCH 1/9] Continue pitfalls chapter --- 12-pitfalls.Rmd | 93 ++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 92 insertions(+), 1 deletion(-) diff --git a/12-pitfalls.Rmd b/12-pitfalls.Rmd index b5aabfdf..1dcb3c52 100644 --- a/12-pitfalls.Rmd +++ b/12-pitfalls.Rmd @@ -1 +1,92 @@ -# Common pitfalls {#c12-pitfalls} +# Recommendations for Successful Survey Data Analysis {#c12-pitfalls} + +This book pertains to the analysis of complex surveys, which require extra considerations compared to non-random ones. Running analysis carefully and accurately, while avoiding pitfalls, is crucial for success. We've included our recommendations for successful survey data analysis throughout the book; however, this chapter summarizes key considerations while taking the steps shown in previous chapters. + +```{r} +#| include: false +anscombe_tidy <- anscombe %>% + mutate(observation = seq_len(n())) %>% + gather(key, value,-observation) %>% + separate(key, c("variable", "set"), 1, convert = TRUE) %>% + mutate(set = c("I", "II", "III", "IV")[set]) %>% + spread(variable, value) +``` + +## Create a foundation for insightful analysis + +We would be remiss to not emphasis again to review the survey documentation (Chapter \@ref(c03-understanding-survey-data-documentation)) and the survey design (Chapter \@ref(c10-specifying-sample-designs)) before beginning the analysis. + +## Ensure that analysis is appropriate and accurate + +### Begin your analysis with descriptive data analysis + +When receiving a fresh batch of data, it's tempting to jump right into developing models to find significant results. However, a successful data analyst begins by exploring the dataset. This involves running descriptive analysis on the dataset as a whole, as well as individual variables and combinations of variables. As described in Chapter \@ref(c05-descriptive-analysis), descriptive analyses should always precede statistical analysis to prevent avoidable (and embarrassing) mistakes. + +Even before applying weights, consider running cross-tabulations on the raw data. Do any results jump out? + +Let’s say that we run `svy_des %>% group_by(group) %>% summarize(p = mean())` and see the data shows that males make up 10% of the sample. + +```r +## # A tibble: 2 × 2 +## group p +## +## 1 female 0.9 +## 2 male 0.1 +``` + +We would generally assume around a 50/50 split between male and female respondents in a population. The large female proportion could indicate either a unique sample or a potential error in the data. If we review the survey documentation and see this was an intentional part of the design, we can continue our analysis using the appropriate methods. If this was not an intentional choice by the researchers, the results alert us that something may be incorrect in the data or our code, and we can verify if there's an issue by comparing the results with the weighted means. + +Tables provide a quick check of our assumptions, but there is no substitute for graphs and plots to visualize the distribution of data. We might miss outliers or nuances if we scan only summary statistics. Anscombe's Quartet demonstrates the importance of visualization in analysis. Let's say we have a dataset with x and y variables. Let's take a look at how the dataset is structured: + +```{r} +head(anscombe_tidy) +``` + +We can begin by checking one set of variables. For Set I, the x-variables have an average of 9 with a standard deviation of 3.3; for y, we have an average of 7.5 with a standard deviation of 2.03. The two variables have a correlation of 0.81. + +```{r} +anscombe_tidy %>% + filter(set == "I") %>% + summarize( + x_mean = mean(x), + x_sd = sd(x), + y_mean = mean(y), + y_sd = sd(y), + correlation = cor(x, y) + ) +``` + +These are useful statistics - we can note that the data doesn't have high variability; the two variables are strongly correlated. + +Now, let's check all of our variables. Notice anything interesting? + +```{r} +anscombe_tidy %>% + group_by(set) %>% + summarize( + x_mean = mean(x), + x_sd = sd(x, na.rm = TRUE), + y_mean = mean(y), + y_sd = sd(y, na.rm = TRUE), + correlation = cor(x, y) + ) +``` + +The summary results for these four variables are nearly identical! We might assume that the distribution for each of them is similar. A data visualization can help confirm our assumptions. + +```{r} +ggplot(anscombe_tidy, aes(x, y)) + + geom_point() + + facet_wrap( ~ set) + + geom_smooth(method = "lm", se = FALSE) + + theme_minimal() +``` + +When creating the plots, we can clearly see that is not the case. Each set of points results in different shapes and distributions. Imagine sharing each individual plot with a shareholder and how you would describe the data, and how different the interpretations will be. + +With survey data, we might not always have continuous data that we can plot like Anscombe's Quartet. However, if the dataset does contain continuous data, or other types of data which would benefit from a visual representation, we recommend taking the time to graph distributions and correlations. + +## Improve your debugging skills + + + From 1d8364695e50df168906f9d79982deec16fe58d1 Mon Sep 17 00:00:00 2001 From: Isabella Velasquez Date: Sun, 14 Jan 2024 13:21:37 -0500 Subject: [PATCH 2/9] Continue chapter --- 12-pitfalls.Rmd | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/12-pitfalls.Rmd b/12-pitfalls.Rmd index 1dcb3c52..ba2c86e6 100644 --- a/12-pitfalls.Rmd +++ b/12-pitfalls.Rmd @@ -1,6 +1,6 @@ # Recommendations for Successful Survey Data Analysis {#c12-pitfalls} -This book pertains to the analysis of complex surveys, which require extra considerations compared to non-random ones. Running analysis carefully and accurately, while avoiding pitfalls, is crucial for success. We've included our recommendations for successful survey data analysis throughout the book; however, this chapter summarizes key considerations while taking the steps shown in previous chapters. +This book pertains to the analysis of complex surveys, which require extra considerations compared to non-random ones. Running analysis carefully and accurately, while avoiding pitfalls, is crucial for success. We've included our recommendations for successful survey data analysis throughout the book; however, this chapter summarizes a few key considerations while taking the steps shown in previous chapters. Note, this is not meant to be a comprehensive list of best practices for survey analysis but instead a curated list for survey analysts. ```{r} #| include: false @@ -12,13 +12,7 @@ anscombe_tidy <- anscombe %>% spread(variable, value) ``` -## Create a foundation for insightful analysis - -We would be remiss to not emphasis again to review the survey documentation (Chapter \@ref(c03-understanding-survey-data-documentation)) and the survey design (Chapter \@ref(c10-specifying-sample-designs)) before beginning the analysis. - -## Ensure that analysis is appropriate and accurate - -### Begin your analysis with descriptive data analysis +## Begin your analysis with descriptive data analysis When receiving a fresh batch of data, it's tempting to jump right into developing models to find significant results. However, a successful data analyst begins by exploring the dataset. This involves running descriptive analysis on the dataset as a whole, as well as individual variables and combinations of variables. As described in Chapter \@ref(c05-descriptive-analysis), descriptive analyses should always precede statistical analysis to prevent avoidable (and embarrassing) mistakes. @@ -82,7 +76,7 @@ ggplot(anscombe_tidy, aes(x, y)) + theme_minimal() ``` -When creating the plots, we can clearly see that is not the case. Each set of points results in different shapes and distributions. Imagine sharing each individual plot with a shareholder and how you would describe the data, and how different the interpretations will be. +When creating the plots, we clearly see that the distributions are very dissimilar. Each set of points results in different shapes and distributions. Imagine sharing each individual plot with a shareholder and how you would describe the data, and how different the interpretations will be. With survey data, we might not always have continuous data that we can plot like Anscombe's Quartet. However, if the dataset does contain continuous data, or other types of data which would benefit from a visual representation, we recommend taking the time to graph distributions and correlations. From c427aaeacdc5b3857216788da7fd8bbfc0f6d58a Mon Sep 17 00:00:00 2001 From: Isabella Velasquez Date: Sat, 27 Jan 2024 11:24:21 -0800 Subject: [PATCH 3/9] Continue chapters --- 12-pitfalls.Rmd | 90 ++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 85 insertions(+), 5 deletions(-) diff --git a/12-pitfalls.Rmd b/12-pitfalls.Rmd index ba2c86e6..f62e0b5f 100644 --- a/12-pitfalls.Rmd +++ b/12-pitfalls.Rmd @@ -1,17 +1,67 @@ # Recommendations for Successful Survey Data Analysis {#c12-pitfalls} -This book pertains to the analysis of complex surveys, which require extra considerations compared to non-random ones. Running analysis carefully and accurately, while avoiding pitfalls, is crucial for success. We've included our recommendations for successful survey data analysis throughout the book; however, this chapter summarizes a few key considerations while taking the steps shown in previous chapters. Note, this is not meant to be a comprehensive list of best practices for survey analysis but instead a curated list for survey analysts. +The previous chapters in this book aimed to provide the technical skills and knowledge required to run survey analyses. This chapter builds upon the best practices previously mentioned to present a curated set of recommendations aimed at running a *successful* survey analysis. We hope this list equips you with practical insights that assist in producing meaningful and reliable results. ```{r} #| include: false anscombe_tidy <- anscombe %>% - mutate(observation = seq_len(n())) %>% - gather(key, value,-observation) %>% + mutate(observation = row_number()) %>% + pivot_longer(-observation, names_to = "key", values_to = "value") %>% separate(key, c("variable", "set"), 1, convert = TRUE) %>% mutate(set = c("I", "II", "III", "IV")[set]) %>% - spread(variable, value) + pivot_wider(names_from = variable, values_from = value) ``` +## Grasp the survey design + +Understanding complex design factors such as clustering, stratification, and weighting is the foundation of complex survey analysis. Each of these techniques impact standard errors and variance, but we cannot treat complex surveys as unweighted simple random samples if we want to produce accurate estimates. + +As mentioned in Chapter \@ref(c05-descriptive-analysis), the design effect measures the impact of using a complex survey design compared to a simple random samples in the calculation of the estimates. The output below shows the estimated average cost of electricity in the U.S. by each region. The variable `elec_bill` calculates the point estimate for each region while `elec_bill_se` is the standard error when incorporating the design. + +```{r} +#| warning: false +library(survey) +library(srvyr) + +data(recs_2020) + +recs_des <- recs_2020 %>% + as_survey_rep( + weights = NWEIGHT, + repweights = NWEIGHT1:NWEIGHT60, + type = "JK1", + scale = 59 / 60, + mse = TRUE + ) + +recs_des %>% + group_by(Region) %>% + summarize(wtn_mean = survey_mean(WinterTempNight, + vartype = c("se", "var"), + na.rm = TRUE, + deff = TRUE), + n = unweighted(n())) +``` + +The design effects are all over 1, indicating that the design is less statistically efficient than a SRS design. This might lead us to believe that we should not apply weights due to a large loss of precision. + +```{r} +unweight_mod <- + glm(formula = WinterTempNight ~ Region, data = recs_2020) + +weight_mod <- recs_des %>% + svyglm(design = ., formula = WinterTempNight ~ Region, na.action = na.omit) + +recs_2020 %>% + modelr::spread_residuals(unweight_mod, weight_mod) %>% + group_by(Region) %>% + summarize(unweight_mse = mean(unweight_mod^2, na.rm = TRUE), + weight_mse = mean(weight_mod^2, na.rm = TRUE)) +``` + + + + ## Begin your analysis with descriptive data analysis When receiving a fresh batch of data, it's tempting to jump right into developing models to find significant results. However, a successful data analyst begins by exploring the dataset. This involves running descriptive analysis on the dataset as a whole, as well as individual variables and combinations of variables. As described in Chapter \@ref(c05-descriptive-analysis), descriptive analyses should always precede statistical analysis to prevent avoidable (and embarrassing) mistakes. @@ -76,11 +126,41 @@ ggplot(anscombe_tidy, aes(x, y)) + theme_minimal() ``` -When creating the plots, we clearly see that the distributions are very dissimilar. Each set of points results in different shapes and distributions. Imagine sharing each individual plot with a shareholder and how you would describe the data, and how different the interpretations will be. +When creating the plots, it is apparent that the distributions are very dissimilar. Each set of points results in different shapes and distributions. Imagine sharing each individual plot with a shareholder and how you would describe the data, and how different the interpretations will be. With survey data, we might not always have continuous data that we can plot like Anscombe's Quartet. However, if the dataset does contain continuous data, or other types of data which would benefit from a visual representation, we recommend taking the time to graph distributions and correlations. +## Use the appropriate variable types + +When we pull the data from surveys into R, the data may be listed as character, factor, numeric, or logical/Boolean. For example, here we take a `glimpse()` of ANES data: + +```{r} +data(anes_2020) + +anes_2020 %>% + select(-matches("^V\\d")) %>% + glimpse() +``` + +While the output shows that `CampaignInterest` is a factor, R will not clearly indicate whether it is an ordinal variable. When working with survey data, analysts need to properly use the questionnaire and codebook along with the data (see Chapter \@ref(c03-understanding-survey-data-documentation)) to understand what the values for each variable represent. + +For example, our survey data may represent categorical variables (e.g., the North, South, East, and West regions of the United States) using numbers (e.g., 1, 2, 3, and 4). When importing the file, R will automatically read the column as numeric values. Without carefully reviewing the data frame, we may calculate the mean across all numeric variables: + +```{r} +#| eval: false +svy_dat %>% + summarize(across(where(is.numeric), ~ mean(.x, na.rm = TRUE))) +``` + +R will calculate the mean even if it is not appropriate, leading to the common mistake of applying an average to categorical values instead of a proportion function. If the variable name is difficult to interpret, we might accidentally report an average region of 2.2 to our stakeholders. Rather, ensuring your variables are of the appropriate type will avoid this pitfall and ensure the measures and models are appropriate for the type of variable. + ## Improve your debugging skills +Whether `NA` is a level or a value impacts whether `is.na()` works +If the variable is a factor and has an `NA` as a level, and you want to remove those `NA`s, you might try using dplyr's `filter()`: +```{r} +svy_dat %>% + filter(!is.na(variable)) +``` From 29617a2b3e7c759afb7e4f57b8885440e3097a46 Mon Sep 17 00:00:00 2001 From: Isabella Velasquez Date: Mon, 29 Jan 2024 17:45:54 -0800 Subject: [PATCH 4/9] Update chapter --- ... => 12-successful-survey-data-analysis.Rmd | 106 +++++++++--------- 1 file changed, 51 insertions(+), 55 deletions(-) rename 12-pitfalls.Rmd => 12-successful-survey-data-analysis.Rmd (56%) diff --git a/12-pitfalls.Rmd b/12-successful-survey-data-analysis.Rmd similarity index 56% rename from 12-pitfalls.Rmd rename to 12-successful-survey-data-analysis.Rmd index f62e0b5f..acfb14a5 100644 --- a/12-pitfalls.Rmd +++ b/12-successful-survey-data-analysis.Rmd @@ -4,71 +4,44 @@ The previous chapters in this book aimed to provide the technical skills and kno ```{r} #| include: false +library(dplyr) +library(tidyr) +library(ggplot2) + anscombe_tidy <- anscombe %>% mutate(observation = row_number()) %>% pivot_longer(-observation, names_to = "key", values_to = "value") %>% separate(key, c("variable", "set"), 1, convert = TRUE) %>% mutate(set = c("I", "II", "III", "IV")[set]) %>% pivot_wider(names_from = variable, values_from = value) -``` - -## Grasp the survey design -Understanding complex design factors such as clustering, stratification, and weighting is the foundation of complex survey analysis. Each of these techniques impact standard errors and variance, but we cannot treat complex surveys as unweighted simple random samples if we want to produce accurate estimates. - -As mentioned in Chapter \@ref(c05-descriptive-analysis), the design effect measures the impact of using a complex survey design compared to a simple random samples in the calculation of the estimates. The output below shows the estimated average cost of electricity in the U.S. by each region. The variable `elec_bill` calculates the point estimate for each region while `elec_bill_se` is the standard error when incorporating the design. - -```{r} -#| warning: false -library(survey) -library(srvyr) - -data(recs_2020) - -recs_des <- recs_2020 %>% - as_survey_rep( - weights = NWEIGHT, - repweights = NWEIGHT1:NWEIGHT60, - type = "JK1", - scale = 59 / 60, - mse = TRUE +example_srvy <- tibble::tribble( + ~id, ~region, ~q_d1, ~q_d2_1, ~weight, + 1L, 1L, 1L, 0L, 1740, + 2L, 1L, 1L, 0L, 1428, + 3L, 2L, 1L, 2L, 496, + 4L, 2L, 1L, 2L, 550, + 5L, 3L, 1L, 1L, 1762, + 6L, 4L, 1L, 0L, 1004, + 7L, 4L, 1L, 0L, 522, + 8L, 3L, 2L, 0L, 1099, + 9L, 4L, 2L, 2L, 1295 ) - -recs_des %>% - group_by(Region) %>% - summarize(wtn_mean = survey_mean(WinterTempNight, - vartype = c("se", "var"), - na.rm = TRUE, - deff = TRUE), - n = unweighted(n())) ``` -The design effects are all over 1, indicating that the design is less statistically efficient than a SRS design. This might lead us to believe that we should not apply weights due to a large loss of precision. - -```{r} -unweight_mod <- - glm(formula = WinterTempNight ~ Region, data = recs_2020) - -weight_mod <- recs_des %>% - svyglm(design = ., formula = WinterTempNight ~ Region, na.action = na.omit) - -recs_2020 %>% - modelr::spread_residuals(unweight_mod, weight_mod) %>% - group_by(Region) %>% - summarize(unweight_mse = mean(unweight_mod^2, na.rm = TRUE), - weight_mse = mean(weight_mod^2, na.rm = TRUE)) -``` +## Grasp the survey design +Understanding complex design factors such as clustering, stratification, and weighting is the foundation of complex survey analysis. Each of these techniques impact standard errors and variance, and we cannot treat complex surveys as unweighted simple random samples if we want to produce accurate estimates. - +Let's ## Begin your analysis with descriptive data analysis -When receiving a fresh batch of data, it's tempting to jump right into developing models to find significant results. However, a successful data analyst begins by exploring the dataset. This involves running descriptive analysis on the dataset as a whole, as well as individual variables and combinations of variables. As described in Chapter \@ref(c05-descriptive-analysis), descriptive analyses should always precede statistical analysis to prevent avoidable (and embarrassing) mistakes. +When receiving a fresh batch of data, it's tempting to jump right into running models to find significant results. However, a successful data analyst begins by exploring the dataset. This involves running descriptive analysis on the dataset as a whole, as well as individual variables and combinations of variables. As described in Chapter \@ref(c05-descriptive-analysis), descriptive analyses should always precede statistical analysis to avoid avoidable (and potentially embarrassing) mistakes. Even before applying weights, consider running cross-tabulations on the raw data. Do any results jump out? -Let’s say that we run `svy_des %>% group_by(group) %>% summarize(p = mean())` and see the data shows that males make up 10% of the sample. +Let’s go back to our earlier example. We run `svy_des %>% group_by(group) %>% summarize(p = mean())` and see the data shows that males make up 10% of the sample. ```r ## # A tibble: 2 × 2 @@ -80,7 +53,7 @@ Let’s say that we run `svy_des %>% group_by(group) %>% summarize(p = mean())` We would generally assume around a 50/50 split between male and female respondents in a population. The large female proportion could indicate either a unique sample or a potential error in the data. If we review the survey documentation and see this was an intentional part of the design, we can continue our analysis using the appropriate methods. If this was not an intentional choice by the researchers, the results alert us that something may be incorrect in the data or our code, and we can verify if there's an issue by comparing the results with the weighted means. -Tables provide a quick check of our assumptions, but there is no substitute for graphs and plots to visualize the distribution of data. We might miss outliers or nuances if we scan only summary statistics. Anscombe's Quartet demonstrates the importance of visualization in analysis. Let's say we have a dataset with x and y variables. Let's take a look at how the dataset is structured: +Tables provide a quick check of our assumptions, but there is no substitute for graphs and plots to visualize the distribution of data. We might miss outliers or nuances if we scan only summary statistics. Anscombe's Quartet demonstrates the importance of visualization in analysis. Let's say we have a dataset with x- and y- variables. Let's take a look at how the dataset is structured: ```{r} head(anscombe_tidy) @@ -100,7 +73,7 @@ anscombe_tidy %>% ) ``` -These are useful statistics - we can note that the data doesn't have high variability; the two variables are strongly correlated. +These are useful statistics. We can note that the data doesn't have high variability and the two variables are strongly correlated. Now, let's check all of our variables. Notice anything interesting? @@ -126,9 +99,9 @@ ggplot(anscombe_tidy, aes(x, y)) + theme_minimal() ``` -When creating the plots, it is apparent that the distributions are very dissimilar. Each set of points results in different shapes and distributions. Imagine sharing each individual plot with a shareholder and how you would describe the data, and how different the interpretations will be. +When reviewing the plots, it becomes apparent that the distributions are very dissimilar. Each set of points results in different shapes and distributions. Imagine sharing each individual plot with a shareholder and how you would describe the data, and how different the interpretations will be. -With survey data, we might not always have continuous data that we can plot like Anscombe's Quartet. However, if the dataset does contain continuous data, or other types of data which would benefit from a visual representation, we recommend taking the time to graph distributions and correlations. +With survey data, we may not always have continuous data that we can plot like Anscombe's Quartet. However, if the dataset does contain continuous data, or other types of data which would benefit from a visual representation, we recommend taking the time to graph distributions and correlations. ## Use the appropriate variable types @@ -144,15 +117,21 @@ anes_2020 %>% While the output shows that `CampaignInterest` is a factor, R will not clearly indicate whether it is an ordinal variable. When working with survey data, analysts need to properly use the questionnaire and codebook along with the data (see Chapter \@ref(c03-understanding-survey-data-documentation)) to understand what the values for each variable represent. -For example, our survey data may represent categorical variables (e.g., the North, South, East, and West regions of the United States) using numbers (e.g., 1, 2, 3, and 4). When importing the file, R will automatically read the column as numeric values. Without carefully reviewing the data frame, we may calculate the mean across all numeric variables: +Here is another example. We have a dataset `example_srvy` that contains information about the respondents' region in the column `region`. Taking a `glimpse()` of the data: + +```{r} +example_srvy %>% + glimpse() +``` + +The categorical variables (e.g., the North, South, East, and West regions of the United States) are represented using numbers (e.g., 1, 2, 3, and 4). When importing the file, R will automatically read the column as numeric values. Without carefully reviewing the data frame, we may calculate the mean across all numeric variables: ```{r} -#| eval: false -svy_dat %>% +example_srvy %>% summarize(across(where(is.numeric), ~ mean(.x, na.rm = TRUE))) ``` -R will calculate the mean even if it is not appropriate, leading to the common mistake of applying an average to categorical values instead of a proportion function. If the variable name is difficult to interpret, we might accidentally report an average region of 2.2 to our stakeholders. Rather, ensuring your variables are of the appropriate type will avoid this pitfall and ensure the measures and models are appropriate for the type of variable. +R will calculate the mean even if it is not appropriate, leading to the common mistake of applying an average to categorical values instead of a proportion function. If the variable name is difficult to interpret, we might accidentally report an average region of 2.67 to our stakeholders. Checking that your variables are of the appropriate type will avoid this pitfall and ensure the measures and models are appropriate for the type of variable. ## Improve your debugging skills @@ -164,3 +143,20 @@ svy_dat %>% filter(!is.na(variable)) ``` +## Draw significance conclusions appropriately + +When we say something is "statistically significant", we mean that our result can be attributed to an effect, a relationship between variables, or a difference between groups, rather than purely to chance. As mentioned in Chapter \@ref(c02-overview-survey), determining the study design is a lengthy and intensive process. Careful consideration is taken to reduce the sampling error, in hopes that our results are not solely due to how the sample was chosen. + +For instance, + +```{r} +anova_out <- recs_des %>% + svyglm(design = ., formula = SummerTempNight ~ Region, na.action = na.omit) + +tidy(anova_out) +``` + + + + + From 029579ecd191adffb02f6d3b67e8454e592ce0c3 Mon Sep 17 00:00:00 2001 From: Isabella Velasquez Date: Mon, 11 Mar 2024 22:26:04 -0700 Subject: [PATCH 5/9] Pushing up chapter --- 12-successful-survey-data-analysis.Rmd | 167 +++++++++++++++---------- 1 file changed, 104 insertions(+), 63 deletions(-) diff --git a/12-successful-survey-data-analysis.Rmd b/12-successful-survey-data-analysis.Rmd index acfb14a5..1d5b10c6 100644 --- a/12-successful-survey-data-analysis.Rmd +++ b/12-successful-survey-data-analysis.Rmd @@ -1,67 +1,102 @@ -# Recommendations for Successful Survey Data Analysis {#c12-pitfalls} - -The previous chapters in this book aimed to provide the technical skills and knowledge required to run survey analyses. This chapter builds upon the best practices previously mentioned to present a curated set of recommendations aimed at running a *successful* survey analysis. We hope this list equips you with practical insights that assist in producing meaningful and reliable results. +# Recommendations for successful survey data analysis {#c12-recommendations} ```{r} +#| label: recommendations-styler #| include: false -library(dplyr) -library(tidyr) -library(ggplot2) +knitr::opts_chunk$set(tidy = 'styler') +``` + +::: {.prereqbox-header} +`r if (knitr:::is_html_output()) '### Prerequisites {- #prereq12}'` +::: + +::: {.prereqbox data-latex="{Prerequisites}"} +For this chapter, load the following packages: +```{r} +#| label: recommendations-setup +#| error: FALSE +#| warning: FALSE +#| message: FALSE +library(tidyverse) +library(survey) +library(srvyr) +library(srvyrexploR) +``` + +To illustrate the importance of data visualization, we will discuss Anscombe's Quartet. The dataset can be replicated by running the code below: +```{r} +#| label: recommendations-anscombe-setup anscombe_tidy <- anscombe %>% mutate(observation = row_number()) %>% pivot_longer(-observation, names_to = "key", values_to = "value") %>% separate(key, c("variable", "set"), 1, convert = TRUE) %>% mutate(set = c("I", "II", "III", "IV")[set]) %>% pivot_wider(names_from = variable, values_from = value) +``` -example_srvy <- tibble::tribble( - ~id, ~region, ~q_d1, ~q_d2_1, ~weight, - 1L, 1L, 1L, 0L, 1740, - 2L, 1L, 1L, 0L, 1428, - 3L, 2L, 1L, 2L, 496, - 4L, 2L, 1L, 2L, 550, - 5L, 3L, 1L, 1L, 1762, - 6L, 4L, 1L, 0L, 1004, - 7L, 4L, 1L, 0L, 522, - 8L, 3L, 2L, 0L, 1099, - 9L, 4L, 2L, 2L, 1295 - ) +We create an example survey dataset to explain potential pitfalls and how to overcome them in survey analysis. To recreate the dataset, run the code below: + +```{r} +#| label: recommendations-example-dat +example_srvy <- tribble( + ~id, ~region, ~q_d1, ~q_d2_1, ~gender, ~weight, + 1L, 1L, 1L, "Somewhat interested", "female", 1740, + 2L, 1L, 1L, "Not much interested", "female", 1428, + 3L, 2L, NA, "Somewhat interested", "female", 496, + 4L, 2L, 1L, "Not much interested", "female", 550, + 5L, 3L, 1L, "Somewhat interested", "female", 1762, + 6L, 4L, NA, "Very much interested", "female", 1004, + 7L, 4L, NA, "Somewhat interested", "female", 522, + 8L, 3L, 2L, "Not much interested", "female", 1099, + 9L, 4L, 2L, "Somewhat interested", "female", 1295, + 10L, 2L, 2L, "Somewhat interested", "male", 983 +) + +example_des <- + example_srvy %>% + as_survey_design(weights = weight) ``` +::: + +## Introduction + +The previous chapters in this book aimed to provide the technical skills and knowledge required to run survey analyses. This chapter builds upon the best practices previously mentioned to present a curated set of recommendations aimed at running a *successful* survey analysis. We hope this list equips you with practical insights that assist in producing meaningful and reliable results. -## Grasp the survey design +## Applying the survey design appropriately -Understanding complex design factors such as clustering, stratification, and weighting is the foundation of complex survey analysis. Each of these techniques impact standard errors and variance, and we cannot treat complex surveys as unweighted simple random samples if we want to produce accurate estimates. +Understanding complex design factors such as clustering, stratification, and weighting is foundational to complex survey analysis. Each of these techniques impacts standard errors and variance, and we cannot treat complex surveys as unweighted simple random samples if we want to produce unbiased estimates. -Let's +Throughout the book, we highlight the importance of running functions like `filter()` after creating the survey design. This is another way to ensure we appropriately apply the survey design to our data. -## Begin your analysis with descriptive data analysis +## Beginning analysis with descriptive data analysis When receiving a fresh batch of data, it's tempting to jump right into running models to find significant results. However, a successful data analyst begins by exploring the dataset. This involves running descriptive analysis on the dataset as a whole, as well as individual variables and combinations of variables. As described in Chapter \@ref(c05-descriptive-analysis), descriptive analyses should always precede statistical analysis to avoid avoidable (and potentially embarrassing) mistakes. Even before applying weights, consider running cross-tabulations on the raw data. Do any results jump out? -Let’s go back to our earlier example. We run `svy_des %>% group_by(group) %>% summarize(p = mean())` and see the data shows that males make up 10% of the sample. +Let’s explore the example survey dataset introduced in the Prerequisites box, `example_srvy`. We run the code below on the unweighted data to inspect the `gender` variable: -```r -## # A tibble: 2 × 2 -## group p -## -## 1 female 0.9 -## 2 male 0.1 +```{r} +#| label: recommendations-example-desc +example_srvy %>% + group_by(gender) %>% + summarise(n = n()) ``` -We would generally assume around a 50/50 split between male and female respondents in a population. The large female proportion could indicate either a unique sample or a potential error in the data. If we review the survey documentation and see this was an intentional part of the design, we can continue our analysis using the appropriate methods. If this was not an intentional choice by the researchers, the results alert us that something may be incorrect in the data or our code, and we can verify if there's an issue by comparing the results with the weighted means. +The data shows that males make up 1 out of 10, or 10%, of the sample. Generally, we assume around a 50/50 split between male and female respondents in a population. The large female proportion could indicate either a unique sample or a potential error in the data. If we review the survey documentation and see this was a deliberate part of the design, we can continue our analysis using the appropriate methods. If this was not an intentional choice by the researchers, the results alert us that something may be incorrect in the data or our code, and we can verify if there's an issue by comparing the results with the weighted means. -Tables provide a quick check of our assumptions, but there is no substitute for graphs and plots to visualize the distribution of data. We might miss outliers or nuances if we scan only summary statistics. Anscombe's Quartet demonstrates the importance of visualization in analysis. Let's say we have a dataset with x- and y- variables. Let's take a look at how the dataset is structured: +Tables provide a quick check of our assumptions, but there is no substitute for graphs and plots to visualize the distribution of data. We might miss outliers or nuances if we scan only summary statistics. Anscombe's Quartet demonstrates the importance of visualization in analysis. Let's say we have a dataset with x- and y- variables in an object called `anscombe_tidy`. Let's take a look at how the dataset is structured: ```{r} +#| label: recommendations-anscombe-head head(anscombe_tidy) ``` We can begin by checking one set of variables. For Set I, the x-variables have an average of 9 with a standard deviation of 3.3; for y, we have an average of 7.5 with a standard deviation of 2.03. The two variables have a correlation of 0.81. ```{r} +#| label: recommendations-anscombe-calc anscombe_tidy %>% filter(set == "I") %>% summarize( @@ -78,6 +113,7 @@ These are useful statistics. We can note that the data doesn't have high variabi Now, let's check all of our variables. Notice anything interesting? ```{r} +#| label: recommendations-anscombe-calc-2 anscombe_tidy %>% group_by(set) %>% summarize( @@ -92,6 +128,7 @@ anscombe_tidy %>% The summary results for these four variables are nearly identical! We might assume that the distribution for each of them is similar. A data visualization can help confirm our assumptions. ```{r} +#| label: recommendations-anscombe-plot ggplot(anscombe_tidy, aes(x, y)) + geom_point() + facet_wrap( ~ set) + @@ -99,64 +136,68 @@ ggplot(anscombe_tidy, aes(x, y)) + theme_minimal() ``` -When reviewing the plots, it becomes apparent that the distributions are very dissimilar. Each set of points results in different shapes and distributions. Imagine sharing each individual plot with a shareholder and how you would describe the data, and how different the interpretations will be. +When reviewing the plots, it becomes apparent that the distributions are not the same at all. Each set of points results in different shapes and distributions. Imagine sharing each plot with a shareholder, how one would describe the data, and how different the interpretations will be. -With survey data, we may not always have continuous data that we can plot like Anscombe's Quartet. However, if the dataset does contain continuous data, or other types of data which would benefit from a visual representation, we recommend taking the time to graph distributions and correlations. +With survey data, we may not always have continuous data that we can plot like Anscombe's Quartet. However, if the dataset does contain continuous data or other types of data which would benefit from a visual representation, we recommend taking the time to graph distributions and correlations. -## Use the appropriate variable types +## Using the appropriate variable types -When we pull the data from surveys into R, the data may be listed as character, factor, numeric, or logical/Boolean. For example, here we take a `glimpse()` of ANES data: +When we pull the data from surveys into R, the data may be listed as character, factor, numeric, or logical/Boolean. Let's revisit the `example_srvy` data. Taking a `glimpse()` of the data gives us insight into what it contains: ```{r} -data(anes_2020) - -anes_2020 %>% - select(-matches("^V\\d")) %>% +#| label: recommendations-example-dat-glimpse +example_srvy %>% glimpse() ``` -While the output shows that `CampaignInterest` is a factor, R will not clearly indicate whether it is an ordinal variable. When working with survey data, analysts need to properly use the questionnaire and codebook along with the data (see Chapter \@ref(c03-understanding-survey-data-documentation)) to understand what the values for each variable represent. +While the output shows that `q_d2_1` is a character, R does not clearly indicate that it is an ordinal variable (Very interested / somewhat interested / Not interested). We will watch to keep an eye out on any ordinal variables to make sure we can make meaningful comparisons in our analyses. -Here is another example. We have a dataset `example_srvy` that contains information about the respondents' region in the column `region`. Taking a `glimpse()` of the data: +We may also notice that there is a column called `region`, which is imported as a number. This is a good hint to use the questionnaire and codebook along with the data to find out if the values actually reflect a number or are perhaps a coded categorical variable (see Chapter \@ref(c03-understanding-survey-data-documentation) for more details). Otherwise, without carefully reviewing the documentation, we may accidentally calculate the mean across all numeric variables: ```{r} -example_srvy %>% - glimpse() +#| label: recommendations-example-dat-num-calc +example_des %>% + select(-weight) %>% + summarize(across(where(is.numeric), ~ survey_mean(.x, na.rm = TRUE))) ``` -The categorical variables (e.g., the North, South, East, and West regions of the United States) are represented using numbers (e.g., 1, 2, 3, and 4). When importing the file, R will automatically read the column as numeric values. Without carefully reviewing the data frame, we may calculate the mean across all numeric variables: +R will calculate the mean even if it is not appropriate, leading to the common mistake of applying an average to categorical values instead of a proportion function. If the variable name is difficult to interpret, we might accidentally report an average region of `r round(example_des %>% summarize(across(where(is.numeric), ~ survey_mean(.x, na.rm = TRUE))) %>% pull(region), 2)` to our stakeholders. Checking that our variables are of the appropriate type will avoid this pitfall and ensure the measures and models are suitable for the type of variable. -```{r} -example_srvy %>% - summarize(across(where(is.numeric), ~ mean(.x, na.rm = TRUE))) -``` +## Improving your debugging skills -R will calculate the mean even if it is not appropriate, leading to the common mistake of applying an average to categorical values instead of a proportion function. If the variable name is difficult to interpret, we might accidentally report an average region of 2.67 to our stakeholders. Checking that your variables are of the appropriate type will avoid this pitfall and ensure the measures and models are appropriate for the type of variable. +It is common for analysts working in R to come across warning or error messages. It's important to improve our debugging skills - our ability to find and fix issues - to ensure we can proceed with our work and avoid mistakes. -## Improve your debugging skills - -Whether `NA` is a level or a value impacts whether `is.na()` works -If the variable is a factor and has an `NA` as a level, and you want to remove those `NA`s, you might try using dplyr's `filter()`: +We've discussed a few examples in this book. For example, if we calculate an average with `survey_mean()` and we get `NA` instead of a number, it may be because there are missing values in our column. ```{r} -svy_dat %>% - filter(!is.na(variable)) +#| label: recommendations-missing-dat +example_des %>% + summarize(mean = survey_mean(q_d1)) ``` -## Draw significance conclusions appropriately - -When we say something is "statistically significant", we mean that our result can be attributed to an effect, a relationship between variables, or a difference between groups, rather than purely to chance. As mentioned in Chapter \@ref(c02-overview-survey), determining the study design is a lengthy and intensive process. Careful consideration is taken to reduce the sampling error, in hopes that our results are not solely due to how the sample was chosen. - -For instance, +Including the `na.rm = TRUE` would resolve the issue: ```{r} -anova_out <- recs_des %>% - svyglm(design = ., formula = SummerTempNight ~ Region, na.action = na.omit) +#| label: recommendations--missing-dat-fix +example_des %>% + summarize(mean = survey_mean(q_d1, na.rm = TRUE)) +``` -tidy(anova_out) +Often, debugging involves interpreting the message from R. For example, if our code results in this error: + +``` +Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : + contrasts can be applied only to factors with 2 or more levels ``` +We can see that the error has to do with a function requiring a factor with two or more levels, and that it has been applied to something else. This ties back to our section on using appropriate variable types. We can check the variable of interest to examine whether it's the correct type. + +The internet also offers many resources for debugging. Searching for a specific error message can often lead us to a solution. In addition, we can post on community forums like Posit Community, [https://community.rstudio.com/](https://community.rstudio.com/), for direct help from others. +## Drawing significant conclusions appropriately +As mentioned in Chapter \@ref(c02-overview-surveys), determining the study design is a lengthy and intensive process. Careful consideration is taken to reduce the sampling error in hopes that our results are not solely due to how the sample was chosen. This allows us to say something is "statistically significant," meaning that our result can be attributed to an effect, a relationship between variables, or a difference between groups rather than purely to chance. +As mentioned above, we should account for the survey design before we draw significance from our results. We also want to carefully consider how we manage missing data, as described in Chapter \@ref(c11-missing-data). Finally, we want to avoid model overfitting by using too many variables in our formulas. Our model may be too tailored to generalize to new data. +It's important to note that even significant results do not mean that it is meaningful or important. A large enough sample can produce statistically significant results. Therefore, we want to look at our results in context, such as comparing them with results from other studies or analyzing them in conjunction with confidence intervals and other measures. As analysts, we also need to curate what we report, including the many significant results that we find. \ No newline at end of file From 538cddcb3eebc5cbfde01e946f4b07de216625b8 Mon Sep 17 00:00:00 2001 From: rpowell22 Date: Sun, 17 Mar 2024 15:15:49 -0400 Subject: [PATCH 6/9] Edits to recommendations chapter. --- 12-successful-survey-data-analysis.Rmd | 109 ++++++++++++++++++------- 1 file changed, 78 insertions(+), 31 deletions(-) diff --git a/12-successful-survey-data-analysis.Rmd b/12-successful-survey-data-analysis.Rmd index 1d5b10c6..183bd40a 100644 --- a/12-successful-survey-data-analysis.Rmd +++ b/12-successful-survey-data-analysis.Rmd @@ -1,4 +1,4 @@ -# Recommendations for successful survey data analysis {#c12-recommendations} +# Successful survey analysis recommendations {#c12-recommendations} ```{r} #| label: recommendations-styler @@ -42,13 +42,13 @@ We create an example survey dataset to explain potential pitfalls and how to ove example_srvy <- tribble( ~id, ~region, ~q_d1, ~q_d2_1, ~gender, ~weight, 1L, 1L, 1L, "Somewhat interested", "female", 1740, - 2L, 1L, 1L, "Not much interested", "female", 1428, + 2L, 1L, 1L, "Not at all interested", "female", 1428, 3L, 2L, NA, "Somewhat interested", "female", 496, - 4L, 2L, 1L, "Not much interested", "female", 550, + 4L, 2L, 1L, "Not at all interested", "female", 550, 5L, 3L, 1L, "Somewhat interested", "female", 1762, - 6L, 4L, NA, "Very much interested", "female", 1004, + 6L, 4L, NA, "Very interested", "female", 1004, 7L, 4L, NA, "Somewhat interested", "female", 522, - 8L, 3L, 2L, "Not much interested", "female", 1099, + 8L, 3L, 2L, "Not at all interested", "female", 1099, 9L, 4L, 2L, "Somewhat interested", "female", 1295, 10L, 2L, 2L, "Somewhat interested", "male", 983 ) @@ -63,19 +63,33 @@ example_des <- The previous chapters in this book aimed to provide the technical skills and knowledge required to run survey analyses. This chapter builds upon the best practices previously mentioned to present a curated set of recommendations aimed at running a *successful* survey analysis. We hope this list equips you with practical insights that assist in producing meaningful and reliable results. -## Applying the survey design appropriately +## Follow survey analysis process {#recs-survey-process} -Understanding complex design factors such as clustering, stratification, and weighting is foundational to complex survey analysis. Each of these techniques impacts standard errors and variance, and we cannot treat complex surveys as unweighted simple random samples if we want to produce unbiased estimates. +As we first introduced in Chapter \@ref(c04-getting-started) (Section \@ref(survey-analysis-process)), there are four main steps to successfully analyze survey data: -Throughout the book, we highlight the importance of running functions like `filter()` after creating the survey design. This is another way to ensure we appropriately apply the survey design to our data. +1. Create a `tbl_svy` object (a survey object) using: `as_survey_design()` or `as_survey_rep()` -## Beginning analysis with descriptive data analysis +2. Subset data (if needed) using `filter()` (to create subpopulations) -When receiving a fresh batch of data, it's tempting to jump right into running models to find significant results. However, a successful data analyst begins by exploring the dataset. This involves running descriptive analysis on the dataset as a whole, as well as individual variables and combinations of variables. As described in Chapter \@ref(c05-descriptive-analysis), descriptive analyses should always precede statistical analysis to avoid avoidable (and potentially embarrassing) mistakes. +3. Specify domains of analysis using `group_by()` -Even before applying weights, consider running cross-tabulations on the raw data. Do any results jump out? +4. Within `summarize()`, specify variables to calculate, including means, totals, proportions, quantiles, and more -Let’s explore the example survey dataset introduced in the Prerequisites box, `example_srvy`. We run the code below on the unweighted data to inspect the `gender` variable: +The order of these steps matters in survey analysis. For example, if we need to subset the data, we must use `filter()` on our data **after** creating the survey design. If we do this before the survey design is created, then we may not be correctly accounting for the study design and it may result in incorrect findings. + +Additionally, correctly identifying the survey design is one of the most important steps in survey analysis. Knowing the type of sample design (e.g., clustered, stratified) will help ensure the underlying error structure is correctly calculated and weights are correctly used. Reviewing the documentation (see Chapter \@ref(c03-understanding-survey-data-documentation)) will help us understand what variables to use from the data. Learning about complex design factors such as clustering, stratification, and weighting is foundational to complex survey analysis, and we recommend that all analysts review Chapter \@ref(c10-specifying-sample-designs) before creating their first design object. + +Making sure to use the survey analysis functions from the {srvyr} and {survey} packages is also important in survey analysis. For example, using `mean()` and `survey_mean()` on the same data will result in different findings and output. Each of the survey functions from {srvyr} and {survey} impacts standard errors and variance, and we cannot treat complex surveys as unweighted simple random samples if we want to produce unbiased estimates. + +## Begin with descriptive analysis + +When receiving a fresh batch of data, it's tempting to jump right into running models to find significant results. However, a successful data analyst begins by exploring the dataset. This involves running descriptive analysis on the dataset as a whole, as well as individual variables and combinations of variables. As described in Chapter \@ref(c05-descriptive-analysis), descriptive analyses should always precede statistical analysis to prevent avoidable (and potentially embarrassing) mistakes. + +### Table review + +Even before applying weights, consider running cross-tabulations on the raw data. This can help us see if any patterns stand out that may be alarming, or something worth further investigating. + +For example, let’s explore the example survey dataset introduced in the Prerequisites box, `example_srvy`. We run the code below on the unweighted data to inspect the `gender` variable: ```{r} #| label: recommendations-example-desc @@ -86,7 +100,11 @@ example_srvy %>% The data shows that males make up 1 out of 10, or 10%, of the sample. Generally, we assume around a 50/50 split between male and female respondents in a population. The large female proportion could indicate either a unique sample or a potential error in the data. If we review the survey documentation and see this was a deliberate part of the design, we can continue our analysis using the appropriate methods. If this was not an intentional choice by the researchers, the results alert us that something may be incorrect in the data or our code, and we can verify if there's an issue by comparing the results with the weighted means. -Tables provide a quick check of our assumptions, but there is no substitute for graphs and plots to visualize the distribution of data. We might miss outliers or nuances if we scan only summary statistics. Anscombe's Quartet demonstrates the importance of visualization in analysis. Let's say we have a dataset with x- and y- variables in an object called `anscombe_tidy`. Let's take a look at how the dataset is structured: +### Graphical review + +Tables provide a quick check of our assumptions, but there is no substitute for graphs and plots to visualize the distribution of data. We might miss outliers or nuances if we scan only summary statistics. + +For example, Anscombe's Quartet demonstrates the importance of visualization in analysis. Let's say we have a dataset with x- and y- variables in an object called `anscombe_tidy`. Let's take a look at how the dataset is structured: ```{r} #| label: recommendations-anscombe-head @@ -108,9 +126,7 @@ anscombe_tidy %>% ) ``` -These are useful statistics. We can note that the data doesn't have high variability and the two variables are strongly correlated. - -Now, let's check all of our variables. Notice anything interesting? +These are useful statistics. We can note that the data doesn't have high variability and the two variables are strongly correlated. Now, let's check all of the sets (I-IV) in the Anscombe data. Notice anything interesting? ```{r} #| label: recommendations-anscombe-calc-2 @@ -125,24 +141,24 @@ anscombe_tidy %>% ) ``` -The summary results for these four variables are nearly identical! We might assume that the distribution for each of them is similar. A data visualization can help confirm our assumptions. +The summary results for these four sets are nearly identical! Based on this, we might assume that the distribution for each of them is similar. Let's look at a data visualization to see if our assumption is correct. ```{r} #| label: recommendations-anscombe-plot ggplot(anscombe_tidy, aes(x, y)) + geom_point() + facet_wrap( ~ set) + - geom_smooth(method = "lm", se = FALSE) + + geom_smooth(method = "lm", se = FALSE, alpha = 0.5) + theme_minimal() ``` -When reviewing the plots, it becomes apparent that the distributions are not the same at all. Each set of points results in different shapes and distributions. Imagine sharing each plot with a shareholder, how one would describe the data, and how different the interpretations will be. +Although each of the four sets has the same summary statistics and regression line, when reviewing the plots, it becomes apparent that the distributions of the data are are not the same at all. Each set of points results in different shapes and distributions. Imagine sharing each set (I-IV) and corresponding plot with a different colleague. The interpretations and descriptions of the data would be very different even though the statistics are similar. Plotting data can also ensure that we are using the correct analysis method on the data, so understanding the underlying distributions is an important first step. With survey data, we may not always have continuous data that we can plot like Anscombe's Quartet. However, if the dataset does contain continuous data or other types of data which would benefit from a visual representation, we recommend taking the time to graph distributions and correlations. -## Using the appropriate variable types +## Check variable types -When we pull the data from surveys into R, the data may be listed as character, factor, numeric, or logical/Boolean. Let's revisit the `example_srvy` data. Taking a `glimpse()` of the data gives us insight into what it contains: +When we pull the data from surveys into R, the data may be listed as character, factor, numeric, or logical/Boolean. The tidyverse functions that read in data (e.g., `read_csv()`, `read_excel()`) default to have all strings load as character variables. This is important when dealing with survey data as many strings may be better suited for factors than character variables. For example, let's revisit the `example_srvy` data. Taking a `glimpse()` of the data gives us insight into what it contains: ```{r} #| label: recommendations-example-dat-glimpse @@ -150,9 +166,24 @@ example_srvy %>% glimpse() ``` -While the output shows that `q_d2_1` is a character, R does not clearly indicate that it is an ordinal variable (Very interested / somewhat interested / Not interested). We will watch to keep an eye out on any ordinal variables to make sure we can make meaningful comparisons in our analyses. +The output shows that `q_d2_1` is a character variable, but the values of that variable show three options (Very interested / Somewhat interested / Not at all interested). In this case, we will most likely want to change `q_d2_1` to be a factor variable and order the factor levels to indicate that this is an ordinal variable. Here is some code on how we might approach this task using the {forcats} package: + +```{r} +#| label: recommendations-example-dat-fct +example_srvy_fct<-example_srvy %>% + mutate(q_d2_1_fct=factor(q_d2_1, + levels=c("Very interested", + "Somewhat interested", + "Not at all interested"))) + +example_srvy_fct %>% + glimpse() + +example_srvy_fct %>% + count(q_d2_1_fct,q_d2_1) +``` -We may also notice that there is a column called `region`, which is imported as a number. This is a good hint to use the questionnaire and codebook along with the data to find out if the values actually reflect a number or are perhaps a coded categorical variable (see Chapter \@ref(c03-understanding-survey-data-documentation) for more details). Otherwise, without carefully reviewing the documentation, we may accidentally calculate the mean across all numeric variables: +This example data also includes a column called `region`, which is imported as a number (``). This is a good hint to use the questionnaire and codebook along with the data to find out if the values actually reflect a number or are perhaps a coded categorical variable (see Chapter \@ref(c03-understanding-survey-data-documentation) for more details). R will calculate the mean even if it is not appropriate, leading to the common mistake of applying an average to categorical values instead of a proportion function. For example, for ease of coding, we may use the `across()` function to calculate the mean across all numeric variables: ```{r} #| label: recommendations-example-dat-num-calc @@ -161,11 +192,11 @@ example_des %>% summarize(across(where(is.numeric), ~ survey_mean(.x, na.rm = TRUE))) ``` -R will calculate the mean even if it is not appropriate, leading to the common mistake of applying an average to categorical values instead of a proportion function. If the variable name is difficult to interpret, we might accidentally report an average region of `r round(example_des %>% summarize(across(where(is.numeric), ~ survey_mean(.x, na.rm = TRUE))) %>% pull(region), 2)` to our stakeholders. Checking that our variables are of the appropriate type will avoid this pitfall and ensure the measures and models are suitable for the type of variable. +In this example, if we do not adjust `region` to be a factor variable type, we might accidentally report an average region of `r round(example_des %>% summarize(across(where(is.numeric), ~ survey_mean(.x, na.rm = TRUE))) %>% pull(region), 2)` in our findings. Checking that our variables are of the appropriate type will avoid this pitfall and ensure the measures and models are suitable for the type of variable. -## Improving your debugging skills +## Improve debugging skills -It is common for analysts working in R to come across warning or error messages. It's important to improve our debugging skills - our ability to find and fix issues - to ensure we can proceed with our work and avoid mistakes. +It is common for analysts working in R to come across warning or error messages and learning how to debug these messages (i.e., find and fix issues), ensures we can proceed with our work and avoid potential mistakes. We've discussed a few examples in this book. For example, if we calculate an average with `survey_mean()` and we get `NA` instead of a number, it may be because there are missing values in our column. @@ -178,11 +209,27 @@ example_des %>% Including the `na.rm = TRUE` would resolve the issue: ```{r} -#| label: recommendations--missing-dat-fix +#| label: recommendations-missing-dat-fix example_des %>% summarize(mean = survey_mean(q_d1, na.rm = TRUE)) ``` +Another common error message that you may see with survey analysis may look something like the following: +```{r} +#| label: recommendations-desobj-loc +example_des %>% + svyttest(q_d1~gender) +``` + +In this case, we need to remember that with functions from the {survey} packages like `svyttest()`, the design object is not the first argument and we have to use the dot (`.`) notation (see Chapter \@ref(c06-statistical-testing)). Adding in the named argument of `design=.` will fix this error. + +```{r} +#| label: recommendations-desobj-locfix +example_des %>% + svyttest(q_d1~gender, + design=.) +``` + Often, debugging involves interpreting the message from R. For example, if our code results in this error: ``` @@ -194,10 +241,10 @@ We can see that the error has to do with a function requiring a factor with two The internet also offers many resources for debugging. Searching for a specific error message can often lead us to a solution. In addition, we can post on community forums like Posit Community, [https://community.rstudio.com/](https://community.rstudio.com/), for direct help from others. -## Drawing significant conclusions appropriately +## Think critically about conclusions -As mentioned in Chapter \@ref(c02-overview-surveys), determining the study design is a lengthy and intensive process. Careful consideration is taken to reduce the sampling error in hopes that our results are not solely due to how the sample was chosen. This allows us to say something is "statistically significant," meaning that our result can be attributed to an effect, a relationship between variables, or a difference between groups rather than purely to chance. +Once we have our findings, we need to learn to think critically about our findings. As mentioned in Chapter \@ref(c02-overview-surveys), there are many aspects to the study design that can impact our interpretation of the results. For example, the number and types of response options provided to the respondent or who was asked the question (both thinking about the full sample and any skip patterns). Knowing the overall study design can help us accurately think through what the findings may mean and identify any issues with our analyses. Additionally, we should make sure that our survey design object is correctly defined (see Chapter \@ref(c10-specifying-sample-designs)), carefully consider how we are managing missing data (see Chapter \@ref(c11-missing-data)), and follow best statistical analysis procedures such as avoiding model overfitting by using too many variables in our formulas. -As mentioned above, we should account for the survey design before we draw significance from our results. We also want to carefully consider how we manage missing data, as described in Chapter \@ref(c11-missing-data). Finally, we want to avoid model overfitting by using too many variables in our formulas. Our model may be too tailored to generalize to new data. +With these considerations, we can conduct our analyses and review findings for statistically significant results. It's important to note that even significant results do not mean that it is meaningful or important. A large enough sample can produce statistically significant results. Therefore, we want to look at our results in context, such as comparing them with results from other studies or analyzing them in conjunction with confidence intervals and other measures. -It's important to note that even significant results do not mean that it is meaningful or important. A large enough sample can produce statistically significant results. Therefore, we want to look at our results in context, such as comparing them with results from other studies or analyzing them in conjunction with confidence intervals and other measures. As analysts, we also need to curate what we report, including the many significant results that we find. \ No newline at end of file +Communicating the results (see Chapter \@ref(c08-communicating-results)) in an unbiased manner in also a critical step to any analysis project. If we present results with out error measures, or only present results that support our initial hypotheses, we are not thinking critically and may incorrectly represent the data. As survey data analysts we are often the interpreter of the survey data to the public. We must ensure that we are the best stewards of the data and work to bring light to important and interesting findings that the public will want to and need to know about. From 455cd5a8e7a649eb77393ec927d57d52cc5f0970 Mon Sep 17 00:00:00 2001 From: rpowell22 Date: Sun, 17 Mar 2024 15:25:49 -0400 Subject: [PATCH 7/9] Fix code to ensure error message displays and doesn't quit the book build --- 12-successful-survey-data-analysis.Rmd | 1 + 1 file changed, 1 insertion(+) diff --git a/12-successful-survey-data-analysis.Rmd b/12-successful-survey-data-analysis.Rmd index 183bd40a..ab5a476d 100644 --- a/12-successful-survey-data-analysis.Rmd +++ b/12-successful-survey-data-analysis.Rmd @@ -217,6 +217,7 @@ example_des %>% Another common error message that you may see with survey analysis may look something like the following: ```{r} #| label: recommendations-desobj-loc +#| error: true example_des %>% svyttest(q_d1~gender) ``` From ac741e7d74550ec3cf8f1f3376f6c3f69e0c6c43 Mon Sep 17 00:00:00 2001 From: Stephanie Zimmer Date: Sun, 17 Mar 2024 21:45:58 -0400 Subject: [PATCH 8/9] SZ successful recs updates --- 01-introduction.Rmd | 2 +- 04-set-up.Rmd | 2 +- 11-missing-data.Rmd | 2 +- 12-successful-survey-data-analysis.Rmd | 4 ++-- 4 files changed, 5 insertions(+), 5 deletions(-) diff --git a/01-introduction.Rmd b/01-introduction.Rmd index 614649e5..920749cd 100644 --- a/01-introduction.Rmd +++ b/01-introduction.Rmd @@ -36,7 +36,7 @@ This book will cover many aspects of survey design and analysis, from understand - **Chapter \@ref(c09-reprex-data)**: TO-DO - **Chapter \@ref(c10-specifying-sample-designs)**: Specifying sampling designs. Descriptions of common sampling designs, when they are used, the math behind the mean and standard error estimates, how to specify the designs in R, and examples using real data. - **Chapter \@ref(c11-missing-data)**: TO-DO -- **Chapter \@ref(c12-pitfalls)**: TO-DO +- **Chapter \@ref(c12-recommendations)**: TO-DO - **Chapter \@ref(c13-ncvs-vignette)**: National Crime Victimization Survey Vignette. A vignette on how to analyze data from the NCVS, a survey in the U.S. that collects information on crimes and their characteristics. This illustrates an analysis that requires multiple files to calculate victimization rates. - **Chapter \@ref(c14-ambarom-vignette)**: AmericasBarometer Vignette. A vignette on how to analyze data from the AmericasBarometer, a survey of attitudes, evaluations, experiences, and behavior in countries in the Western Hemisphere. This includes how to make choropleth maps with survey estimates. diff --git a/04-set-up.Rmd b/04-set-up.Rmd index 09544faf..e17c1ff6 100644 --- a/04-set-up.Rmd +++ b/04-set-up.Rmd @@ -207,7 +207,7 @@ The design object is the backbone for survey analysis. It is where we specify th In this chapter, we provide details on how to code the design object for the ANES and RECS data used in the book. However, we only provide a high-level overview to get readers started. For a deeper understanding of creating these design objects for a variety of sampling designs, see Chapter \@ref(c10-specifying-sample-designs). -While we recommend conducting exploratory data analysis on the original data before diving into complex survey analysis (see Chapter \@ref(c12-pitfalls)), the actual analysis and inference should be performed with the survey design objects instead of the original survey data. For example, the ANES data is called `anes_2020`. If we create a survey design object called `anes_des`, our analyses should begin with `anes_des` and not `anes_2020`. Using the survey design object ensures that our calculations are appropriately accounting for the details of the survey design. +While we recommend conducting exploratory data analysis on the original data before diving into complex survey analysis (see Chapter \@ref(c12-recommendations)), the actual analysis and inference should be performed with the survey design objects instead of the original survey data. For example, the ANES data is called `anes_2020`. If we create a survey design object called `anes_des`, our analyses should begin with `anes_des` and not `anes_2020`. Using the survey design object ensures that our calculations are appropriately accounting for the details of the survey design. #### American National Election Studies (ANES) Design Object {-} diff --git a/11-missing-data.Rmd b/11-missing-data.Rmd index 0a7045bc..d6bad00f 100644 --- a/11-missing-data.Rmd +++ b/11-missing-data.Rmd @@ -84,7 +84,7 @@ There are two main categories that missing data typically fall into: missing by ## Assessing missing data -Before beginning analysis, we should explore the data to determine if there is missing data and what types of missing data are present. Conducting this descriptive analysis can help with analysis and reporting of survey data (see Section \@ref(c12-pitfalls)), and can inform the survey design in future studies. For example, large amounts of unexpected missing data may indicate the questions were unclear or difficult to recall. There are several ways to explore missing data which we walk through below. When assessing the missing data, we recommend using a data.frame object and not the survey object as most of the analysis is about patterns of records and weights are not necessary. +Before beginning analysis, we should explore the data to determine if there is missing data and what types of missing data are present. Conducting this descriptive analysis can help with analysis and reporting of survey data (see Section \@ref(c12-recommendations)), and can inform the survey design in future studies. For example, large amounts of unexpected missing data may indicate the questions were unclear or difficult to recall. There are several ways to explore missing data which we walk through below. When assessing the missing data, we recommend using a data.frame object and not the survey object as most of the analysis is about patterns of records and weights are not necessary. ### Summarize data diff --git a/12-successful-survey-data-analysis.Rmd b/12-successful-survey-data-analysis.Rmd index ab5a476d..d748d491 100644 --- a/12-successful-survey-data-analysis.Rmd +++ b/12-successful-survey-data-analysis.Rmd @@ -192,7 +192,7 @@ example_des %>% summarize(across(where(is.numeric), ~ survey_mean(.x, na.rm = TRUE))) ``` -In this example, if we do not adjust `region` to be a factor variable type, we might accidentally report an average region of `r round(example_des %>% summarize(across(where(is.numeric), ~ survey_mean(.x, na.rm = TRUE))) %>% pull(region), 2)` in our findings. Checking that our variables are of the appropriate type will avoid this pitfall and ensure the measures and models are suitable for the type of variable. +In this example, if we do not adjust `region` to be a factor variable type, we might accidentally report an average region of `r round(example_des %>% summarize(across(where(is.numeric), ~ survey_mean(.x, na.rm = TRUE))) %>% pull(region), 2)` in our findings which is meaningless. Checking that our variables are of the appropriate type will avoid this pitfall and ensure the measures and models are suitable for the type of variable. ## Improve debugging skills @@ -240,7 +240,7 @@ Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : We can see that the error has to do with a function requiring a factor with two or more levels, and that it has been applied to something else. This ties back to our section on using appropriate variable types. We can check the variable of interest to examine whether it's the correct type. -The internet also offers many resources for debugging. Searching for a specific error message can often lead us to a solution. In addition, we can post on community forums like Posit Community, [https://community.rstudio.com/](https://community.rstudio.com/), for direct help from others. +The internet also offers many resources for debugging. Searching for a specific error message can often lead us to a solution. In addition, we can post on community forums like [Posit Community](https://community.rstudio.com/), for direct help from others. ## Think critically about conclusions From 4931d94878b32ec5149f6b9833f1dcf71c5a2d92 Mon Sep 17 00:00:00 2001 From: Isabella Velasquez Date: Wed, 20 Mar 2024 23:06:52 -0500 Subject: [PATCH 9/9] Grammar edits --- 12-successful-survey-data-analysis.Rmd | 69 ++++++++++++++------------ 1 file changed, 36 insertions(+), 33 deletions(-) diff --git a/12-successful-survey-data-analysis.Rmd b/12-successful-survey-data-analysis.Rmd index d748d491..b7119883 100644 --- a/12-successful-survey-data-analysis.Rmd +++ b/12-successful-survey-data-analysis.Rmd @@ -61,7 +61,7 @@ example_des <- ## Introduction -The previous chapters in this book aimed to provide the technical skills and knowledge required to run survey analyses. This chapter builds upon the best practices previously mentioned to present a curated set of recommendations aimed at running a *successful* survey analysis. We hope this list equips you with practical insights that assist in producing meaningful and reliable results. +The previous chapters in this book aimed to provide the technical skills and knowledge required for running survey analyses. This chapter builds upon the previously mentioned best practices to present a curated set of recommendations for running a *successful* survey analysis. We hope this list equips you with practical insights that assist in producing meaningful and reliable results. ## Follow survey analysis process {#recs-survey-process} @@ -75,11 +75,11 @@ As we first introduced in Chapter \@ref(c04-getting-started) (Section \@ref(surv 4. Within `summarize()`, specify variables to calculate, including means, totals, proportions, quantiles, and more -The order of these steps matters in survey analysis. For example, if we need to subset the data, we must use `filter()` on our data **after** creating the survey design. If we do this before the survey design is created, then we may not be correctly accounting for the study design and it may result in incorrect findings. +The order of these steps matters in survey analysis. For example, if we need to subset the data, we must use `filter()` on our data **after** creating the survey design. If we do this before the survey design is created, we may not be correctly accounting for the study design, resulting in incorrect findings. Additionally, correctly identifying the survey design is one of the most important steps in survey analysis. Knowing the type of sample design (e.g., clustered, stratified) will help ensure the underlying error structure is correctly calculated and weights are correctly used. Reviewing the documentation (see Chapter \@ref(c03-understanding-survey-data-documentation)) will help us understand what variables to use from the data. Learning about complex design factors such as clustering, stratification, and weighting is foundational to complex survey analysis, and we recommend that all analysts review Chapter \@ref(c10-specifying-sample-designs) before creating their first design object. -Making sure to use the survey analysis functions from the {srvyr} and {survey} packages is also important in survey analysis. For example, using `mean()` and `survey_mean()` on the same data will result in different findings and output. Each of the survey functions from {srvyr} and {survey} impacts standard errors and variance, and we cannot treat complex surveys as unweighted simple random samples if we want to produce unbiased estimates. +Making sure to use the survey analysis functions from the {srvyr} and {survey} packages is also important in survey analysis. For example, using `mean()` and `survey_mean()` on the same data will result in different findings and outputs. Each of the survey functions from {srvyr} and {survey} impacts standard errors and variance, and we cannot treat complex surveys as unweighted simple random samples if we want to produce unbiased estimates. ## Begin with descriptive analysis @@ -87,7 +87,7 @@ When receiving a fresh batch of data, it's tempting to jump right into running m ### Table review -Even before applying weights, consider running cross-tabulations on the raw data. This can help us see if any patterns stand out that may be alarming, or something worth further investigating. +Even before applying weights, consider running cross-tabulations on the raw data. Crosstabs can help us see if any patterns stand out that may be alarming or something worth further investigating. For example, let’s explore the example survey dataset introduced in the Prerequisites box, `example_srvy`. We run the code below on the unweighted data to inspect the `gender` variable: @@ -98,13 +98,13 @@ example_srvy %>% summarise(n = n()) ``` -The data shows that males make up 1 out of 10, or 10%, of the sample. Generally, we assume around a 50/50 split between male and female respondents in a population. The large female proportion could indicate either a unique sample or a potential error in the data. If we review the survey documentation and see this was a deliberate part of the design, we can continue our analysis using the appropriate methods. If this was not an intentional choice by the researchers, the results alert us that something may be incorrect in the data or our code, and we can verify if there's an issue by comparing the results with the weighted means. +The data shows that males comprise 1 out of 10, or 10%, of the sample. Generally, we assume something close to a 50/50 split between male and female respondents in a population. The sizeable female proportion could indicate either a unique sample or a potential error in the data. If we review the survey documentation and see this was a deliberate part of the design, we can continue our analysis using the appropriate methods. If this was not an intentional choice by the researchers, the results alert us that something may be incorrect in the data or our code, and we can verify if there’s an issue by comparing the results with the weighted means. ### Graphical review Tables provide a quick check of our assumptions, but there is no substitute for graphs and plots to visualize the distribution of data. We might miss outliers or nuances if we scan only summary statistics. -For example, Anscombe's Quartet demonstrates the importance of visualization in analysis. Let's say we have a dataset with x- and y- variables in an object called `anscombe_tidy`. Let's take a look at how the dataset is structured: +For example, Anscombe's Quartet demonstrates the importance of visualization in analysis. Let's say we have a dataset with x- and y- variables in an object called `anscombe_tidy`. Let's take a look at how the da taset is structured: ```{r} #| label: recommendations-anscombe-head @@ -126,7 +126,7 @@ anscombe_tidy %>% ) ``` -These are useful statistics. We can note that the data doesn't have high variability and the two variables are strongly correlated. Now, let's check all of the sets (I-IV) in the Anscombe data. Notice anything interesting? +These are useful statistics. We can note that the data doesn’t have high variability, and the two variables are strongly correlated. Now, let’s check all the sets (I-IV) in the Anscombe data. Notice anything interesting? ```{r} #| label: recommendations-anscombe-calc-2 @@ -141,7 +141,7 @@ anscombe_tidy %>% ) ``` -The summary results for these four sets are nearly identical! Based on this, we might assume that the distribution for each of them is similar. Let's look at a data visualization to see if our assumption is correct. +The summary results for these four sets are nearly identical! Based on this, we might assume that each distribution is similar. Let's look at a data visualization to see if our assumption is correct. ```{r} #| label: recommendations-anscombe-plot @@ -152,13 +152,13 @@ ggplot(anscombe_tidy, aes(x, y)) + theme_minimal() ``` -Although each of the four sets has the same summary statistics and regression line, when reviewing the plots, it becomes apparent that the distributions of the data are are not the same at all. Each set of points results in different shapes and distributions. Imagine sharing each set (I-IV) and corresponding plot with a different colleague. The interpretations and descriptions of the data would be very different even though the statistics are similar. Plotting data can also ensure that we are using the correct analysis method on the data, so understanding the underlying distributions is an important first step. +Although each of the four sets has the same summary statistics and regression line, when reviewing the plots, it becomes apparent that the distributions of the data are not the same at all. Each set of points results in different shapes and distributions. Imagine sharing each set (I-IV) and the corresponding plot with a different colleague. The interpretations and descriptions of the data would be very different even though the statistics are similar. Plotting data can also ensure that we are using the correct analysis method on the data, so understanding the underlying distributions is an important first step. -With survey data, we may not always have continuous data that we can plot like Anscombe's Quartet. However, if the dataset does contain continuous data or other types of data which would benefit from a visual representation, we recommend taking the time to graph distributions and correlations. +With survey data, we may not always have continuous data that we can plot like Anscombe's Quartet. However, if the dataset does contain continuous data or other types of data that would benefit from a visual representation, we recommend taking the time to graph distributions and correlations. ## Check variable types -When we pull the data from surveys into R, the data may be listed as character, factor, numeric, or logical/Boolean. The tidyverse functions that read in data (e.g., `read_csv()`, `read_excel()`) default to have all strings load as character variables. This is important when dealing with survey data as many strings may be better suited for factors than character variables. For example, let's revisit the `example_srvy` data. Taking a `glimpse()` of the data gives us insight into what it contains: +When we pull the data from surveys into R, the data may be listed as character, factor, numeric, or logical/Boolean. The tidyverse functions that read in data (e.g., `read_csv()`, `read_excel()`) default to have all strings load as character variables. This is important when dealing with survey data, as many strings may be better suited for factors than character variables. For example, let's revisit the `example_srvy` data. Taking a `glimpse()` of the data gives us insight into what it contains: ```{r} #| label: recommendations-example-dat-glimpse @@ -170,17 +170,19 @@ The output shows that `q_d2_1` is a character variable, but the values of that v ```{r} #| label: recommendations-example-dat-fct -example_srvy_fct<-example_srvy %>% - mutate(q_d2_1_fct=factor(q_d2_1, - levels=c("Very interested", - "Somewhat interested", - "Not at all interested"))) - -example_srvy_fct %>% +example_srvy_fct <- example_srvy %>% + mutate(q_d2_1_fct = factor( + q_d2_1, + levels = c("Very interested", + "Somewhat interested", + "Not at all interested") + )) + +example_srvy_fct %>% glimpse() -example_srvy_fct %>% - count(q_d2_1_fct,q_d2_1) +example_srvy_fct %>% + count(q_d2_1_fct, q_d2_1) ``` This example data also includes a column called `region`, which is imported as a number (``). This is a good hint to use the questionnaire and codebook along with the data to find out if the values actually reflect a number or are perhaps a coded categorical variable (see Chapter \@ref(c03-understanding-survey-data-documentation) for more details). R will calculate the mean even if it is not appropriate, leading to the common mistake of applying an average to categorical values instead of a proportion function. For example, for ease of coding, we may use the `across()` function to calculate the mean across all numeric variables: @@ -192,13 +194,13 @@ example_des %>% summarize(across(where(is.numeric), ~ survey_mean(.x, na.rm = TRUE))) ``` -In this example, if we do not adjust `region` to be a factor variable type, we might accidentally report an average region of `r round(example_des %>% summarize(across(where(is.numeric), ~ survey_mean(.x, na.rm = TRUE))) %>% pull(region), 2)` in our findings which is meaningless. Checking that our variables are of the appropriate type will avoid this pitfall and ensure the measures and models are suitable for the type of variable. +In this example, if we do not adjust `region` to be a factor variable type, we might accidentally report an average region of `r round(example_des %>% summarize(across(where(is.numeric), ~ survey_mean(.x, na.rm = TRUE))) %>% pull(region), 2)` in our findings which is meaningless. Checking that our variables are appropriate will avoid this pitfall and ensure the measures and models are suitable for the variable type. ## Improve debugging skills -It is common for analysts working in R to come across warning or error messages and learning how to debug these messages (i.e., find and fix issues), ensures we can proceed with our work and avoid potential mistakes. +It is common for analysts working in R to come across warning or error messages, and learning how to debug these messages (i.e., find and fix issues) ensures we can proceed with our work and avoid potential mistakes. -We've discussed a few examples in this book. For example, if we calculate an average with `survey_mean()` and we get `NA` instead of a number, it may be because there are missing values in our column. +We've discussed a few examples in this book. For example, if we calculate an average with `survey_mean()` and get `NA` instead of a number, it may be because our column has missing values. ```{r} #| label: recommendations-missing-dat @@ -214,7 +216,8 @@ example_des %>% summarize(mean = survey_mean(q_d1, na.rm = TRUE)) ``` -Another common error message that you may see with survey analysis may look something like the following: +Another common error message that we may see with survey analysis may look something like the following: + ```{r} #| label: recommendations-desobj-loc #| error: true @@ -222,13 +225,13 @@ example_des %>% svyttest(q_d1~gender) ``` -In this case, we need to remember that with functions from the {survey} packages like `svyttest()`, the design object is not the first argument and we have to use the dot (`.`) notation (see Chapter \@ref(c06-statistical-testing)). Adding in the named argument of `design=.` will fix this error. +In this case, we need to remember that with functions from the {survey} packages like `svyttest()`, the design object is not the first argument, and we have to use the dot (`.`) notation (see Chapter \@ref(c06-statistical-testing)). Adding in the named argument of `design=.` will fix this error. ```{r} #| label: recommendations-desobj-locfix -example_des %>% - svyttest(q_d1~gender, - design=.) +example_des %>% + svyttest(q_d1 ~ gender, + design = .) ``` Often, debugging involves interpreting the message from R. For example, if our code results in this error: @@ -238,14 +241,14 @@ Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels ``` -We can see that the error has to do with a function requiring a factor with two or more levels, and that it has been applied to something else. This ties back to our section on using appropriate variable types. We can check the variable of interest to examine whether it's the correct type. +We can see that the error has to do with a function requiring a factor with two or more levels and that it has been applied to something else. This ties back to our section on using appropriate variable types. We can check the variable of interest to examine whether it's the correct type. -The internet also offers many resources for debugging. Searching for a specific error message can often lead us to a solution. In addition, we can post on community forums like [Posit Community](https://community.rstudio.com/), for direct help from others. +The internet also offers many resources for debugging. Searching for a specific error message can often lead to a solution. In addition, we can post on community forums like [Posit Community](https://forum.posit.co/) for direct help from others. ## Think critically about conclusions -Once we have our findings, we need to learn to think critically about our findings. As mentioned in Chapter \@ref(c02-overview-surveys), there are many aspects to the study design that can impact our interpretation of the results. For example, the number and types of response options provided to the respondent or who was asked the question (both thinking about the full sample and any skip patterns). Knowing the overall study design can help us accurately think through what the findings may mean and identify any issues with our analyses. Additionally, we should make sure that our survey design object is correctly defined (see Chapter \@ref(c10-specifying-sample-designs)), carefully consider how we are managing missing data (see Chapter \@ref(c11-missing-data)), and follow best statistical analysis procedures such as avoiding model overfitting by using too many variables in our formulas. +Once we have our findings, we need to learn to think critically about our findings. As mentioned in Chapter \@ref(c02-overview-surveys), many aspects of the study design can impact our interpretation of the results, for example, the number and types of response options provided to the respondent or who was asked the question (both thinking about the full sample and any skip patterns). Knowing the overall study design can help us accurately think through what the findings may mean and identify any issues with our analyses. Additionally, we should make sure that our survey design object is correctly defined (see Chapter \@ref(c10-specifying-sample-designs)), carefully consider how we are managing missing data (see Chapter \@ref(c11-missing-data)), and follow statistical analysis procedures such as avoiding model overfitting by using too many variables in our formulas. -With these considerations, we can conduct our analyses and review findings for statistically significant results. It's important to note that even significant results do not mean that it is meaningful or important. A large enough sample can produce statistically significant results. Therefore, we want to look at our results in context, such as comparing them with results from other studies or analyzing them in conjunction with confidence intervals and other measures. +These considerations allow us to conduct our analyses and review findings for statistically significant results. It's important to note that even significant results do not mean that they are meaningful or important. A large enough sample can produce statistically significant results. Therefore, we want to look at our results in context, such as comparing them with results from other studies or analyzing them in conjunction with confidence intervals and other measures. -Communicating the results (see Chapter \@ref(c08-communicating-results)) in an unbiased manner in also a critical step to any analysis project. If we present results with out error measures, or only present results that support our initial hypotheses, we are not thinking critically and may incorrectly represent the data. As survey data analysts we are often the interpreter of the survey data to the public. We must ensure that we are the best stewards of the data and work to bring light to important and interesting findings that the public will want to and need to know about. +Communicating the results (see Chapter \@ref(c08-communicating-results)) in an unbiased manner is also a critical step in any analysis project. If we present results without error measures or only present results that support our initial hypotheses, we are not thinking critically and may incorrectly represent the data. As survey data analysts, we often interpret the survey data for the public. We must ensure that we are the best stewards of the data and work to bring light to meaningful and interesting findings that the public will want and need to know about. \ No newline at end of file