diff --git a/03-understanding-survey-data-documentation.Rmd b/03-understanding-survey-data-documentation.Rmd index 3c692e53..60dd8b27 100644 --- a/03-understanding-survey-data-documentation.Rmd +++ b/03-understanding-survey-data-documentation.Rmd @@ -190,7 +190,9 @@ The section "Data Analysis, Weights, and Variance Estimation" includes informati > For analysis of the complete set of cases using pre-election data only, including all cases and representative of the 2020 electorate, use the full sample pre-election weight, **V200010a**. For analysis including post-election data for the complete set of participants (i.e., analysis of post-election data only or a combination of pre- and post-election data), use the full sample post-election weight, **V200010b**. Additional weights are provided for analysis of subsets of the data... -The document provides more information about the variables, summarized below: +The document provides more information about the variables, summarized in Table \@ref(tab:aneswgts). + +Table: (\#tab:aneswgts) Weight and variance information for ANES For weight | Use variance unit/PSU/cluster | and use variance stratum :-----------:|:-----------:|:-----------: diff --git a/06-statistical-testing.Rmd b/06-statistical-testing.Rmd index b151e032..9cbc524c 100644 --- a/06-statistical-testing.Rmd +++ b/06-statistical-testing.Rmd @@ -509,23 +509,39 @@ However, as researchers, we often want to know about the proportions and not jus ```{r} #| label: stattest-chi-ex2-prop1 #| warning: false -chi_ex2$observed %>% as_tibble() %>% +chi_ex2_table<-chi_ex2$observed %>% + as_tibble() %>% group_by(TrustPeople) %>% mutate(prop = round(n / sum(n), 3)) %>% select(-n) %>% - pivot_wider(names_from = TrustPeople, values_from = prop) %>% + pivot_wider(names_from = TrustPeople, values_from = prop) %>% gt(rowname_col = "TrustGovernment") %>% tab_stubhead(label = "Trust in Government") %>% tab_spanner(label = "Trust in People", - columns = everything()) %>% + columns = everything()) %>% cols_label(`Most of the time` = md("Most of
the time"), `About half the time` = md("About half
the time"), - `Some of the time` = md("Some of
the time")) %>% - tab_caption("Estimates of proportion of people - by levels of trust in people and government, - ANES 2020") + `Some of the time` = md("Some of
the time")) +``` + +```{r} +#| label: stattest-chi-ex2-prop1-noeval +#| eval: false +chi_ex2_table +``` + +(ref:stattest-chi-ex2-prop1-tab) Estimates of proportion of people by levels of trust in people and government, ANES 2020 + +```{r} +#| label: stattest-chi-ex2-prop1-out +#| echo: FALSE +#| warning: FALSE + +chi_ex2_table %>% + print_gt_book("stattest-chi-ex2-prop1-tab") ``` + The second option is to use `group_by()` and `survey_mean()` functions to calculate the proportions from the ANES design object. A reminder that with more than one variable listed in the `group_by()` statement, the proportions are within the first variable listed. As mentioned above, we are looking at the distribution of `TrustGovernment` for each level of `TrustPeople`. ```{r} @@ -536,7 +552,7 @@ chi_ex2_obs <- anes_des %>% summarize(Observed = round(survey_mean(vartype = "ci"), 3), .groups="drop") -chi_ex2_obs %>% +chi_ex2_obs_table<-chi_ex2_obs %>% mutate(prop = paste0(Observed, " (", Observed_low, ", ", Observed_upp, ")")) %>% select(TrustGovernment, TrustPeople, prop) %>% @@ -545,11 +561,24 @@ chi_ex2_obs %>% tab_stubhead(label = "Trust in Government") %>% tab_spanner(label = "Trust in People", columns = everything()) %>% - tab_options(page.orientation = "landscape") %>% - tab_caption("Estimates of proportion of people - by levels of trust in people and government - with confidence intervals, - ANES 2020") + tab_options(page.orientation = "landscape") +``` + +```{r} +#| label: stattest-chi-ex2-prop2-noeval +#| eval: false +chi_ex2_obs_table +``` + +(ref:stattest-chi-ex2-prop2-tab) Estimates of proportion of people by levels of trust in people and government with confidence intervals, ANES 2020 + +```{r} +#| label: stattest-chi-ex2-prop2-out +#| echo: FALSE +#| warning: FALSE + +chi_ex2_obs_table %>% + print_gt_book("stattest-chi-ex2-prop2-tab") ``` Both methods produce the same output as the `svychisq()` function does account for the survey design. However, calculating the proportions directly from the design object means we can also obtain the variance information. In this case, the table output displays the survey estimate followed by the confidence intervals. Based on the output, we can see that of those who never trust people, 50.3% also never trust the government, while the proportions of never trusting the government are much lower for each of the other levels of trusting people. @@ -621,17 +650,31 @@ chi_ex3_obs <- anes_des %>% group_by(VotedPres2020_selection, AgeGroup) %>% summarize(Observed = round(survey_mean(vartype = "ci"), 3)) -chi_ex3_obs %>% +chi_ex3_obs_table<-chi_ex3_obs %>% mutate(prop = paste0(Observed, " (", Observed_low, ", ", Observed_upp, ")")) %>% select(AgeGroup, VotedPres2020_selection, prop) %>% pivot_wider(names_from = VotedPres2020_selection, values_from = prop) %>% gt(rowname_col = "AgeGroup") %>% - tab_stubhead(label = "Age Group") %>% - tab_caption("Distribution of age group - by presidential candidate selection - with confidence intervals") + tab_stubhead(label = "Age Group") +``` + +```{r} +#| label: stattest-chi-ex3-table-noeval +#| eval: false +chi_ex3_obs_table +``` + +(ref:stattest-chi-ex3-tab) Distribution of age group by presidential candidate selection with confidence intervals + +```{r} +#| label: stattest-chi-ex3-table-out +#| echo: FALSE +#| warning: FALSE + +chi_ex3_obs_table %>% + print_gt_book("stattest-chi-ex3-tab") ``` We can see that the age group distribution was younger for Biden and other candidates and older for Trump. For example, of those who voted for Biden, 20.4% were in the 18-29 age group, compared to only 11.4% of those who voted for Trump were in that age group. On the other side, 23.4% of those who voted for Trump were in the 50-59 age group compared to only 15.4% of those who voted for Biden. diff --git a/08-communicating-results.Rmd b/08-communicating-results.Rmd index 84ab6009..bed538f1 100644 --- a/08-communicating-results.Rmd +++ b/08-communicating-results.Rmd @@ -120,40 +120,69 @@ The native output that R produces may work for initial viewing inside RStudio or Looking at the output from `trust_gov`, there are a couple of items that are probably obvious to fix: (1) use percentages instead of proportions and (2) the variable names as the column headers. The {gt} package is a good tool for implementing better labeling and creating publishable tables. Let's walk through some code as we implement a few changes to improve the table's usefulness. -We begin with the `gt()` function to initiate the table and use the argument `rowname_col` to make the `TrustGovernment` column the labels for each row (called the table "stub"). The `cols_label()` function is used to create informative column labels instead of variable names, and the `tab_spanner()` function is applied to add a label across multiple columns. In this case, we apply the label "Trust in Government, 2020" across all the columns except the stub. Finally, the `fmt_percent()` function is used to format the proportions into percentages and reduce the number of decimals shown. Note, the `tab_caption()` function is used to add a table title for the book and allows for cross-referencing in R Markdown Quarto and bookdown, as well as adding it to the list of tables in the book. +We begin with the `gt()` function to initiate the table and use the argument `rowname_col` to make the `TrustGovernment` column the labels for each row (called the table "stub"). The `cols_label()` function is used to create informative column labels instead of variable names, and the `tab_spanner()` function is applied to add a label across multiple columns. In this case, we apply the label "Trust in Government, 2020" across all the columns except the stub. Finally, the `fmt_percent()` function is used to format the proportions into percentages and reduce the number of decimals shown. Note, the `tab_caption()` function is used to add a table title for the book in HTML and allows for cross-referencing in R Markdown Quarto and bookdown, as well as adding it to the list of tables in the book. ```{r} #| label: results-table-gt1 -trust_gov_gt <- - trust_gov %>% +trust_gov_gt <- trust_gov %>% gt(rowname_col = "TrustGovernment") %>% cols_label(trust_gov_p = "%", trust_gov_p_se = "s.e. (%)") %>% tab_spanner(label = "Trust in Government, 2020", columns = c(trust_gov_p, trust_gov_p_se)) %>% - fmt_percent(decimals = 1) %>% - tab_caption("Example of gt table with trust in government estimates.") + fmt_percent(decimals = 1) +``` + +```{r} +#| label: results-table-gt1-noeval +#| eval: false +trust_gov_gt %>% + tab_caption("Example of gt table with trust in government estimate") +``` + +(ref:results-table-gt1-tab) Example of gt table with trust in government estimate -trust_gov_gt +```{r} +#| label: results-table-gt1-out +#| echo: FALSE +#| warning: FALSE + +trust_gov_gt %>% + print_gt_book("results-table-gt1-tab") ``` A few more things we can add are a title, a data source note, and a footnote with the question information using the functions `tab_header()`, `tab_source_note()`, and `tab_footnote()`. ```{r} #| label: results-table-gt2 -trust_gov_gt %>% +trust_gov_gt2<-trust_gov_gt %>% tab_header("American voter's trust in the federal government, 2020") %>% tab_source_note("American National Election Studies, 2020") %>% tab_footnote( "Question text: How often can you trust the federal government in Washington to do what is right?" - ) %>% - tab_caption("Example of gt table with trust - in government estimates with additional context.") + ) +``` + +```{r} +#| label: results-table-gt2-noeval +#| eval: false +trust_gov_gt2 +``` + +(ref:results-table-gt2-tab) Example of gt table with trust in government estimates with additional context + +```{r} +#| label: results-table-gt2-out +#| echo: FALSE +#| warning: FALSE + +trust_gov_gt2 %>% + print_gt_book("results-table-gt2-tab") ``` -#### Expanding Tables using {gtsummary} +#### Expanding Tables using {gtsummary} {-} The {gtsummary} package simultaneously summarizes data and creates publication-ready tables. Its origins are in clinical trial data but it has been extended to include survey analysis in some limited ways. At this time, it only works with survey objects using Taylor's Series Linearization and not replicate methods. A limited set of summary statistics are available. For categorical variables, the following summary statistics are available: @@ -179,42 +208,68 @@ For continuous variables, the following summary statistics are available: - `{p##}` any integer percentile, where `##` is an integer from 0 to 100 - `{sum}` sum -In the following example, we will build up a table using {gtsummary}, which will be similar to the table in the {gt} example. The main function used is `tbl_svysummary()`. In this function, the variables we want to analyze are included in the `include` argument, and the statistics we want to display are in the `statistic` argument. To specify statistics, the syntax from the {glue} package is used where variables you want to insert are included inside curly brackets. To specify that we want, the proportion followed by the standard error of the proportion in parentheses, we use "{p} ({p.std.error})". We must specify the statistics we want using the names of the statistics in the two lists above. To print this table in all format types, we use the `as_gt()` function to maintain the {gtsummary} formatting. +In the following example, we will build up a table using {gtsummary}, which will be similar to the table in the {gt} example. The main function used is `tbl_svysummary()`. In this function, the variables we want to analyze are included in the `include` argument, and the statistics we want to display are in the `statistic` argument. To specify statistics, the syntax from the {glue} package is used where variables you want to insert are included inside curly brackets. To specify that we want, the proportion followed by the standard error of the proportion in parentheses, we use "{p} ({p.std.error})". We must specify the statistics we want using the names of the statistics in the two lists above. ```{r} #| label: results-gts-ex-1 -anes_des %>% +anes_des_gtsum<-anes_des %>% tbl_svysummary(include = TrustGovernment, - statistic = list(all_categorical() ~ "{p} ({p.std.error})")) %>% - as_gt() %>% - tab_caption("Example of gtsummary table with trust - in government estimates.") + statistic = list(all_categorical() ~ "{p} ({p.std.error})")) +``` + +```{r} +#| label: results-table-gt3-noeval +#| eval: false +anes_des_gtsum +``` + +(ref:results-gts-ex-1-tab) Example of gtsummary table with trust in government estimates + +```{r} +#| label: results-gts-ex-1-out +#| echo: FALSE +#| warning: FALSE + +anes_des_gtsum %>% + print_gt_book("results-gts-ex-1-tab") ``` In this default table, the weighted number of missing (or Unknown) records is included. Additionally, the standard error is reported as a proportion while the proportion is styled as a percentage. In the next step, we remove the Unknown category by setting the missing argument to "no" and format the standard error as a percentage within the digits argument. Finally, we label the "TrustGovernment" variable to something more publication-ready using the label argument. ```{r} #| label: results-gts-ex-2 -anes_des %>% +anes_des_gtsum2<-anes_des %>% tbl_svysummary( include = TrustGovernment, statistic = list(all_categorical() ~ "{p} ({p.std.error})"), missing = "no", digits = list(TrustGovernment ~ style_percent), label = list(TrustGovernment ~ "Trust in Government, 2020") - ) %>% - as_gt() %>% - tab_caption( - "Example of gtsummary table with trust - in government estimates with labeling and digits options." ) ``` -To remove the phrase "Characteristic" and the estimated population size, we can modify the header using the function `modify_header()` to update the label and stat_0. To add footnotes and a title, we do this after converting the object to a gt table using `as_gt()` and can use the same functions we did in Section \@ref(results-gt) +```{r} +#| label: results-gts-ex-2-noeval +#| eval: false +anes_des_gtsum2 +``` + +(ref:results-gts-ex-2-tab) Example of gtsummary table with trust in government estimates with labeling and digits options + +```{r} +#| label: results-gts-ex-2-out +#| echo: FALSE +#| warning: FALSE + +anes_des_gtsum2 %>% + print_gt_book("results-gts-ex-2-tab") +``` + +To remove the phrase "Characteristic" and the estimated population size, we can modify the header using the function `modify_header()` to update the label and stat_0. To add footnotes and a title, we do this after converting the object to a gt table using `as_gt()` and can use the same functions we did in Section \@ref(results-gt). ```{r} #| label: results-gts-ex-3 -anes_des %>% +anes_des_gtsum3<-anes_des %>% tbl_svysummary( include = TrustGovernment, statistic = list(all_categorical() ~ "{p} ({p.std.error})"), @@ -233,20 +288,32 @@ anes_des %>% "Question text: How often can you trust the federal government in Washington to do what is right?" - ) %>% - tab_caption( - "Example of gtsummary table with trust - in government estimates with more - labeling options and context." ) ``` +```{r} +#| label: results-gts-ex-3-noeval +#| eval: false +anes_des_gtsum3 +``` + +(ref:results-gts-ex-3-tab) Example of gtsummary table with trust in government estimates with more labeling options and context + +```{r} +#| label: results-gts-ex-3-out +#| echo: FALSE +#| warning: FALSE + +anes_des_gtsum3 %>% + print_gt_book("results-gts-ex-3-tab") +``` + Continuous variables can also be added, and we add a summary of the age variable to the table below by updating the include, statistic, and digits argument. Adding on additional variables is a large benefit to the {gtsummary} package. ```{r} #| label: results-gts-ex-4 #| tidy: FALSE -anes_des %>% +anes_des_gtsum4<-anes_des %>% tbl_svysummary( include = c(TrustGovernment, Age), statistic = list( @@ -267,17 +334,33 @@ anes_des %>% "Question text: How often can you trust the federal government in Washington to do what is right?" ) %>% - tab_caption("Example of gtsummary table with trust in government estimates and average age.") + tab_caption("Example of gtsummary table with trust in government + estimates and average age") ``` -The { -gtsummary -} also allows calculating statistics by different groups easily. Let's adapt the prior example to perform analysis by whether the person voted for president in 2020. The argument for by is updated and the header names are updated. Finally, we update the header. +```{r} +#| label: results-gts-ex-4-noeval +#| eval: false +anes_des_gtsum4 +``` + +(ref:results-gts-ex-4-tab) Example of gtsummary table with trust in government estimates and average age + +```{r} +#| label: results-gts-ex-4-out +#| echo: FALSE +#| warning: FALSE + +anes_des_gtsum4 %>% + print_gt_book("results-gts-ex-4-tab") +``` + +The {gtsummary} also allows calculating statistics by different groups easily. Let's adapt the prior example to perform analysis by whether the person voted for president in 2020. The argument for by is updated and the header names are updated. Finally, we update the header. ```{r} #| label: results-gts-ex-5 #| messages: FALSE -anes_des %>% +anes_des_gtsum5<-anes_des %>% drop_na(VotedPres2020) %>% tbl_svysummary( include=TrustGovernment, @@ -298,11 +381,27 @@ anes_des %>% in the 2020 presidential election") %>% tab_source_note("American National Election Studies, 2020") %>% tab_footnote("Question text: How often can you trust the federal government - in Washington to do what is right?") %>% - tab_caption("Example of gtsummary table with trust - in government estimates by voting status.") + in Washington to do what is right?") ``` +```{r} +#| label: results-gts-ex-5-noeval +#| eval: false +anes_des_gtsum5 +``` + +(ref:results-gts-ex-5-tab) Example of gtsummary table with trust in government estimates by voting status + +```{r} +#| label: results-gts-ex-5-out +#| echo: FALSE +#| warning: FALSE + +anes_des_gtsum5 %>% + print_gt_book("results-gts-ex-5-tab") +``` + + ### Charts and Plots Survey analysis can result in an abundance of printed summary statistics and models. Even with the best analysis, the results can be overwhelming and difficult to comprehend. This is where charts and plots play a key role in our work. By transforming complex data into a visual representation, we can recognize patterns, relationships, and trends with greater ease. diff --git a/13-ncvs-vignette.Rmd b/13-ncvs-vignette.Rmd index 055c2afe..a37e6f4f 100644 --- a/13-ncvs-vignette.Rmd +++ b/13-ncvs-vignette.Rmd @@ -724,7 +724,7 @@ pers_est_df <- c("Sex", "RaceHispOrigin", "AgeGroup", "MaritalStatus", "Income") %>% map_df(pers_est_by) -pers_est_df %>% +vr_gt<-pers_est_df %>% mutate( Variable = case_when( Variable == "RaceHispOrigin" ~ "Race/Hispanic origin", @@ -760,9 +760,9 @@ pers_est_df %>% ) %>% tab_footnote( footnote = "Excludes persons of Hispanic origin", - locations = - cells_stub(rows = Level %in% - c("White", "Black", "Asian", NHOPI, "Other"))) %>% + locations = + cells_stub(rows = Level %in% + c("White", "Black", "Asian", NHOPI, "Other"))) %>% tab_footnote( footnote = "Includes persons who identified as Native Hawaiian or Other Pacific Islander only.", @@ -782,6 +782,23 @@ pers_est_df %>% by type of crime and demographic characteristics, 2021") ``` +```{r} +#| label: ncvs-vign-rates-demo-noeval +#| eval: false +vr_gt +``` + +(ref:ncvs-vign-rates-demo-tab) Rate and standard error of violent victimization, by type of crime and demographic characteristics, 2021 + +```{r} +#| label: ncvs-vign-rates-demo-out +#| echo: FALSE +#| warning: FALSE + +vr_gt %>% + print_gt_book("ncvs-vign-rates-demo-tab") +``` + ### Estimation 4: Prevalence Rates {#prev-rate} Prevalence rates differ from victimization rates as the numerator is the number of people or households victimized rather than the number of victimizations. To calculate the prevalence rates, we must run another summary of the data by calculating an indicator for whether a person or household is a victim of a particular crime at any point in the year. Below is an example of calculating first the indicator and then the prevalence rate of violent crime and aggravated assault. diff --git a/14-ambarom-vignette.Rmd b/14-ambarom-vignette.Rmd index 447e6a8d..01b7ee94 100644 --- a/14-ambarom-vignette.Rmd +++ b/14-ambarom-vignette.Rmd @@ -172,17 +172,31 @@ covid_worry_country_ests <- summarize(p = survey_mean(CovidWorry_bin == "WorriedHi", na.rm = TRUE) * 100) -covid_worry_country_ests %>% +covid_worry_country_ests_gt<-covid_worry_country_ests %>% gt(rowname_col = "Country") %>% cols_label(p = "Percent", p_se = "SE") %>% - tab_caption("Proportion worried about the possibility that - they or someone in their household will get sick from - coronavirus in the next 3 months") %>% fmt_number(decimals = 1) %>% tab_source_note("AmericasBarometer Surveys, 2021") ``` +```{r} +#| label: ambarom-est1-noeval +#| eval: false +covid_worry_country_ests_gt +``` + +(ref:ambarom-est1-tab) Proportion worried about the possibility that they or someone in their household will get sick from coronavirus in the next 3 months + +```{r} +#| label: ambarom-est1-out +#| echo: FALSE +#| warning: FALSE + +covid_worry_country_ests_gt %>% + print_gt_book("ambarom-est1-tab") +``` + Another question asked how education was affected by the pandemic. This question was asked among households with children under the age of 13, and respondents could select more than one option as follows: > Did any of these children have their school education affected due to the pandemic? @@ -232,7 +246,7 @@ covid_educ_ests <- p_noschool = survey_mean(Educ_NoSchool, na.rm = TRUE) * 100, ) -covid_educ_ests %>% +covid_educ_ests_gt<-covid_educ_ests %>% gt(rowname_col = "Country") %>% cols_label(p_onlynormal = "%", p_onlynormal_se = "SE", @@ -247,12 +261,26 @@ covid_educ_ests %>% tab_spanner(label = "Cut ties with school", columns = c("p_noschool", "p_noschool_se")) %>% fmt_number(decimals = 1) %>% - tab_caption("Impact on education in households with children under - the age of 13 who had children that would generally - attend school.") %>% tab_source_note("AmericasBarometer Surveys, 2021") ``` +```{r} +#| label: ambarom-covid-ed-der-noeval +#| eval: false +covid_educ_ests_gt +``` + +(ref:ambarom-covid-ed-der-tab) Impact on education in households with children under the age of 13 who had children that would generally attend school + +```{r} +#| label: ambarom-covid-ed-der-out +#| echo: FALSE +#| warning: FALSE + +covid_educ_ests_gt %>% + print_gt_book("ambarom-covid-ed-der-tab") +``` + Of the countries that used this question, many had households where their children had an education medium change, except Haiti, where only `r covid_educ_ests %>% filter(Country=="Haiti") %>% pull(p_mediumchange) %>% signif(.,2)`% of households with students changed to virtual or hybrid learning. ## Mapping survey data @@ -420,6 +448,7 @@ int_ests %>% ```{r} #| label: ambarom-facet-map #| error: true +#| fig.cap: "Percent of broadband internet and any internet usage, Central and South America" internet_sf <- country_shape_upd %>% full_join(select(int_ests, p = p_internet, geounit = Country), by = "geounit") %>% mutate(Type = "Internet") diff --git a/index.Rmd b/index.Rmd index 6ddd018d..2bb98c42 100644 --- a/index.Rmd +++ b/index.Rmd @@ -36,6 +36,34 @@ library(formatR) book_colors <- c("#0b3954", "#087e8b", "#bfd7ea", "#ff8484", "#8d6b94") +as_latex_with_caption <- function(gtobj, chunk_label) { + gt_l <- gt::as_latex(gtobj) + caption <- paste0( + "\\caption{\\label{tab:", chunk_label, "}(ref:", chunk_label, ")}\\\\") + latex <- strsplit(gt_l[1], split = "\n")[[1]] + idxtable <- which(stringr::str_detect(latex, "begin") & stringr::str_detect(latex, "table")) + latex2 <- c(latex[1:idxtable], caption, latex[-c(1:idxtable)]) + latex3 <- paste(latex2, collapse = "\n") + gt_l[1] <- latex3 + return(gt_l) +} + +print_gt_book <- function(gtobj, ref){ + if ("gtsummary" %in% class(gtobj)){ + gtobj <- as_gt(gtobj) + } + + if (knitr::is_latex_output()){ + gtobj %>% + as_latex_with_caption(ref) + } else { + gtobj %>% + tab_caption(glue::glue("(ref:{ref})")) + } + + +} + ``` # Preface {-}