From fa7473e535228ac532b840ed1cf6076e37ae9000 Mon Sep 17 00:00:00 2001
From: szimmer The {survey} package was released on the Comprehensive R Archive Network (CRAN) in 2003 and has been continuously developed over time. This package, primarily authored by Thomas Lumley, offers an extensive array of features, including:
Chapter 1 Introduction1.1 Survey analysis in R
-
1.3 Prerequisites(Wickham, Çetinkaya-Rundel, and Grolemund 2023) is a beginner-friendly guide for getting started in data science using R. It offers guidance on preliminary installation steps, basic R syntax, and tidyverse concepts and packages.
If these concepts or skills are unfamiliar, we recommend starting with introductory resources to cover these topics before reading this book. R for Data Science (Wickham, Çetinkaya-Rundel, and Grolemund 2023) is a beginner-friendly guide for getting started in data science using R. It offers guidance on preliminary installation steps, basic R syntax, and tidyverse workflows and packages.
This version of the book was built with R version 4.3.1 (2023-06-16) and with the packages listed in Table 1.1.
-Developing surveys to gather accurate information about populations involves an intricate and time-intensive process. Researchers can spend months, or even years, developing the study design, questions, and other methods for a single survey to ensure high-quality data is collected.
Before analyzing survey data, we recommend understanding the entire survey life cycle. This understanding can provide better insight into what types of analyses should be conducted on the data. The survey life cycle consists of the necessary stages to execute a survey project successfully. Each stage influences the survey’s timing, costs, and feasibility, consequently impacting the data collected and how we should analyze it. Figure 2.1 shows a high-level overview of the survey process.
After completing the survey life cycle, the data are ready for analysts to use. Chapter 4 continues from this point. For more information on the survey life cycle, please explore the references cited throughout this chapter.
+After completing the survey life cycle, the data are ready for analysts. Chapter 4 continues from this point. For more information on the survey life cycle, please explore the references cited throughout this chapter.
This chapter provides an overview of the packages, data, and design objects we use frequently throughout this book. As mentioned in Chapter 2, understanding how a survey was conducted helps us make sense of the results and interpret findings. Therefore, we provide background on the datasets used in examples and exercises. Next, we walk through how to create the survey design objects necessary to begin an analysis. Finally, we provide an overview of the {srvyr} package and the steps needed for analysis. Please report any bugs and issues while going through the book to the book’s GitHub repository.
+This chapter provides an overview of the packages, data, and design objects we use frequently throughout this book. As mentioned in Chapter 2, understanding how a survey was conducted helps us make sense of the results and interpret findings. Therefore, we provide background on the datasets used in examples and exercises. Next, we walk through how to create the survey design objects necessary to begin an analysis. Finally, we provide an overview of the {srvyr} package and the steps needed for analysis. Please report any bugs and issues encountered while going through the book to the book’s GitHub repository.
install.packages("censusapi")
After installing this package, load it using the library()
function:
Note that the {censusapi} package requires a Census API key, available for free from the U.S. Census Bureau website (refer to the package documentation for more information). We recommend storing the Census API key in the R environment instead of directly in the code. To do this, run Sys.setenv()
after obtaining the API key.
Note that the {censusapi} package requires a Census API key, available for free from the U.S. Census Bureau website (refer to the package documentation for more information.) We recommend storing the Census API key in the R environment instead of directly in the code. To do this, run the Sys.setenv()
script below, substituting the API key where it says YOUR_API_KEY_HERE
.
Then, restart the R session. Once the Census API key is stored, we can retrieve it in our R code with Sys.getenv("CENSUS_KEY")
.
There are a few other packages used in the book in limited frequency. We list them in the Prerequisite boxes at the beginning of each chapter. As we work through the book, make sure to check the Prerequisite box and install any missing packages before proceeding.
@@ -561,486 +561,7 @@ANES is a study that collects data from election surveys dating back to 1948. These surveys contain information on public opinion and voting behavior in U.S. presidential elections and some midterm elections3. They cover topics such as party affiliation, voting choice, and level of trust in the government. The 2020 survey (data used in this book) was fielded online, through live video interviews, or via computer-assisted telephone interviews (CATI).
-When working with new survey data, we should review the survey documentation (see Chapter 3) to understand the data collection methods. The original ANES data contains variables starting with V20
(DeBell 2010), so to assist with our analysis throughout the book, we created descriptive variable names. For example, the respondent’s age is now in a variable called Age
, and gender is in a variable called Gender
. These descriptive variables are included in the {srvyrexploR} package, and Table 4.1 displays the list of these renamed variables. A complete overview of all variables can be found in Appendix B.
Variable Name | -
---|
CaseID |
InterviewMode |
Weight |
VarUnit |
Stratum |
CampaignInterest |
EarlyVote2020 |
VotedPres2016 |
VotedPres2016_selection |
PartyID |
TrustGovernment |
TrustPeople |
Age |
AgeGroup |
Education |
RaceEth |
Gender |
Income |
Income7 |
VotedPres2020 |
VotedPres2020_selection |
When working with new survey data, we should review the survey documentation (see Chapter 3) to understand the data collection methods. The original ANES data contains variables starting with V20
(DeBell 2010), so to assist with our analysis throughout the book, we created descriptive variable names. For example, the respondent’s age is now in a variable called Age
, and gender is in a variable called Gender
. These descriptive variables are included in the {srvyrexploR} package. A complete overview of all variables can be found in Appendix B.
Before beginning an analysis, it is useful to view the data to understand the available variables. The dplyr::glimpse()
function produces a list of all variables, their types (e.g., function, double), and a few example values. Below, we remove variables containing a “V” followed by numbers with select(-matches("^V\\d"))
before using glimpse()
to get a quick overview of the data with descriptive variable names:
anes_2020 %>%
select(-matches("^V\\d")) %>%
@@ -1073,504 +594,7 @@ American National Election Studies (ANES) Data
Residential Energy Consumption Survey (RECS) Data
RECS is a study that measures energy consumption and expenditure in American households. Funded by the Energy Information Administration, RECS data are collected through interviews with household members and energy suppliers. These interviews take place in person, over the phone, via mail, and on the web, with modes changing over time. The survey has been fielded 14 times between 1950 and 2020. It includes questions about appliances, electronics, heating, air conditioning (A/C), temperatures, water heating, lighting, energy bills, respondent demographics, and energy assistance.
-We should read the survey documentation (see Chapter 3) to understand how the data were collected and implemented. Table 4.2 displays the list of variables in the RECS data (not including the weights, which start with NWEIGHT
and are described in more detail in Chapter 10). An overview of all variables can be found in Appendix C.
-
-
-
-
- TABLE 4.2: List of Variables in the RECS Data
-
-
-
-
-
-
- Variable Name
-
-
-
- DOEID
- ClimateRegion_BA
- Urbanicity
- Region
- REGIONC
- Division
- STATE_FIPS
- state_postal
- state_name
- HDD65
- CDD65
- HDD30YR
- CDD30YR
- HousingUnitType
- YearMade
- TOTSQFT_EN
- TOTHSQFT
- TOTCSQFT
- SpaceHeatingUsed
- ACUsed
- HeatingBehavior
- WinterTempDay
- WinterTempAway
- WinterTempNight
- ACBehavior
- SummerTempDay
- SummerTempAway
- SummerTempNight
- BTUEL
- DOLLAREL
- BTUNG
- DOLLARNG
- BTULP
- DOLLARLP
- BTUFO
- DOLLARFO
- BTUWOOD
- TOTALBTU
- TOTALDOL
-
-
-
-
-
+We should read the survey documentation (see Chapter 3) to understand how the data were collected and implemented. An overview of all variables can be found in Appendix C.
Before starting an analysis, we recommend viewing the data to understand the types of data and variables that are included. The dplyr::glimpse()
function produces a list of all variables, the type of the variable (e.g., function, double), and a few example values. Below, we remove the weight variables with select(-matches("^NWEIGHT"))
before using glimpse()
to get a quick overview of the data:
recs_2020 %>%
select(-matches("^NWEIGHT")) %>%
@@ -1997,7 +1021,7 @@ References
Note: {broom} is already included in the tidyverse, so no separate installation is required.↩︎
-In the United States, presidential elections are held in years divisible by four. In other even years, there are elections at the federal level for congress which are referred to as midterm elections as they occur at the middle of the term of a president.↩︎
+In the United States, presidential elections are held in years divisible by four. In other even years, there are elections at the federal level for Congress which are referred to as midterm elections, as they occur at the middle of the term of a president.↩︎
diff --git a/c05-descriptive-analysis.html b/c05-descriptive-analysis.html
index 6c077bc9..c5df1a6d 100644
--- a/c05-descriptive-analysis.html
+++ b/c05-descriptive-analysis.html
@@ -173,7 +173,7 @@
- Dedication
-- I Intro
+- I Introduction
- 1 Introduction
- 1.1 Survey analysis in R
@@ -185,7 +185,7 @@
- 1.7 Acknowledgements
- 1.8 Colophon
-- 2 Overview of Surveys
+
- 2 Overview of surveys
- 2.1 Introduction
- 2.2 Searching for public-use survey data
@@ -295,7 +295,7 @@
- 6.3.1 Syntax
- 6.3.2 Examples
-- 6.4 Chi-square tests
+
- 6.4 Chi-squared tests
- 6.4.1 Syntax
- 6.4.2 Examples
@@ -447,8 +447,8 @@
- 14.4 Survey design objects
- 14.5 Calculating estimates
- 14.6 Mapping survey data
- 14.7 Exercises
@@ -457,7 +457,7 @@
- A Importing survey data into R
- A.1 Importing delimiter-separated files into R
-- A.2 Loading Excel files into R
+- A.2 Importing Excel files into R
- A.3 Importing Stata, SAS, and SPSS files into R
- A.3.1 Syntax
@@ -465,7 +465,7 @@
- A.3.3 Labeled missing data values
- A.4 Importing data from APIs into R
-- A.5 Accessing databases in R
+- A.5 Importing data from databases in R
- A.6 Importing data from other formats
- B ANES derived variable codebook
@@ -553,7 +553,7 @@
Prerequisites
5.1 Introduction
-Descriptive analyses, such as basic counts, cross-tabulations, or means, are among the first steps in making sense of our survey results. By reviewing the findings, we can glean insight into the data, the underlying population, and any unique aspects of the data or population. For example, if only 10% of the survey respondents are male, it could indicate a unique population, a potential error or bias, an intentional survey sampling method, or other factors. Additionally, descriptive analyses allow us to provide summaries like means, proportions, or other measures to make estimates about the population. These analyses lay the groundwork for the next steps of running statistical tests or developing models.
+Descriptive analyses, such as basic counts, cross-tabulations, or means, are among the first steps in making sense of our survey results. During descriptive analyses, we calculate point estimates of unknown population parameters, such as population mean, and uncertainty estimates, such as confidence intervals. By reviewing the findings, we can glean insight into the data, the underlying population, and any unique aspects of the data or population. For example, if only 10% of survey respondents are male, it could indicate a unique population, a potential error or bias, an intentional survey sampling method, or other factors. Additionally, descriptive analyses allow us to provide summaries like means, proportions, or other measures to make estimates about the population. These analyses lay the groundwork for the next steps of running statistical tests or developing models.
We discuss many different types of descriptive analyses in this chapter. However, it is important to know what type of data we are working with and which statistics are appropriate. In survey data, we typically consider data as one of four main types:
- Categorical/nominal data: variables with levels or descriptions that cannot be ordered, such as the region of the country (North, South, East, and West)
diff --git a/c06-statistical-testing.html b/c06-statistical-testing.html
index 2891fd5a..db0f9b71 100644
--- a/c06-statistical-testing.html
+++ b/c06-statistical-testing.html
@@ -173,7 +173,7 @@
- Dedication
-- I Intro
+- I Introduction
- 1 Introduction
- 1.1 Survey analysis in R
@@ -185,7 +185,7 @@
- 1.7 Acknowledgements
- 1.8 Colophon
-- 2 Overview of Surveys
+
- 2 Overview of surveys
- 2.1 Introduction
- 2.2 Searching for public-use survey data
@@ -295,7 +295,7 @@
- 6.3.1 Syntax
- 6.3.2 Examples
-- 6.4 Chi-square tests
+
- 6.4 Chi-squared tests
- 6.4.1 Syntax
- 6.4.2 Examples
@@ -447,8 +447,8 @@
- 14.4 Survey design objects
- 14.5 Calculating estimates
- 14.6 Mapping survey data
- 14.7 Exercises
@@ -457,7 +457,7 @@
- A Importing survey data into R
- A.1 Importing delimiter-separated files into R
-- A.2 Loading Excel files into R
+- A.2 Importing Excel files into R
- A.3 Importing Stata, SAS, and SPSS files into R
- A.3.1 Syntax
@@ -465,7 +465,7 @@
- A.3.3 Labeled missing data values
- A.4 Importing data from APIs into R
-- A.5 Accessing databases in R
+- A.5 Importing data from databases in R
- A.6 Importing data from other formats
- B ANES derived variable codebook
@@ -557,15 +557,15 @@
Prerequisites
6.1 Introduction
When analyzing survey results, the point estimates described in Chapter 5 help us understand the data at a high level. Still, we often want to make comparisons between different groups. These comparisons are calculated through statistical testing.
-The general idea of statistical testing is the same for data obtained through surveys and data obtained through other methods, where we compare the point estimates and variance estimates of each statistic to see if statistically significant differences exist. However, statistical testing for complex surveys involves additional considerations due to the need to account for the sampling design in order to obtain accurate variance estimates.
-Statistical testing, also called hypothesis testing, involves declaring a null and alternative hypothesis. A null hypothesis is denoted as \(H_0\) and the alternative hypothesis is denoted as \(H_A\). The null hypothesis is the default assumption in that there are no differences in the data, or that the data are operating under “standard” behaviors. On the other hand, the alternative hypothesis is the break from the “standard” and what we are trying to determine if the data support this alternative hypothesis.
-Let’s review an example outside of survey data. If we are flipping a coin, a null hypothesis would be that the coin is fair and that each side has an equal chance of being flipped. In other words, the probability of the coin landing on each side is 1/2. Whereas an alternative hypothesis could be that the coin is unfair and that one side has a higher probability of being flipped (e.g., a probability of 1/4 to get heads but a probability of 3/4 to get tails.) We write this set of hypotheses as:
+The general idea of statistical testing is the same for data obtained through surveys and data obtained through other methods, where we compare the point estimates and uncertainty estimates of each statistic to see if statistically significant differences exist. However, statistical testing for complex surveys involves additional considerations due to the need to account for the sampling design in order to obtain accurate uncertainty estimates.
+Statistical testing, also called hypothesis testing, involves declaring a null and alternative hypothesis. A null hypothesis is denoted as \(H_0\) and the alternative hypothesis is denoted as \(H_A\). The null hypothesis is the default assumption in that there are no differences in the data, or that the data are operating under “standard” behaviors. On the other hand, the alternative hypothesis is the break from the “standard” and we are trying to determine if the data support this alternative hypothesis.
+Let’s review an example outside of survey data. If we are flipping a coin, a null hypothesis would be that the coin is fair and that each side has an equal chance of being flipped. In other words, the probability of the coin landing on each side is 1/2, whereas an alternative hypothesis could be that the coin is unfair and that one side has a higher probability of being flipped (e.g., a probability of 1/4 to get heads but a probability of 3/4 to get tails.) We write this set of hypotheses as:
- \(H_0: \rho_{heads} = \rho_{tails}\), where \(\rho_{x}\) is the probability of flipping the coin and having it land on heads (\(\rho_{heads}\)) or tails (\(\rho_{tails}\))
- \(H_A: \rho_{heads} \neq \rho_{tails}\)
When we conduct hypothesis testing, the statistical models calculate a p-value, which shows how likely we are to observe the data if the null hypothesis is true. If the p-value (a probability between 0 and 1) is small, we have strong evidence to reject the null hypothesis as it is unlikely to see the data we observe if the null hypothesis is true. However, if the p-value is large, we say we do not have evidence to reject the null hypothesis. The size of the p-value for this cut-off is determined by Type 1 error known as \(\alpha\). A common Type 1 error value for statistical testing is to use \(\alpha = 0.05\).12 It is common for explanations of statistical testing to refer to confidence level. The confidence level is the inverse of the Type 1 error. Thus, if \(\alpha = 0.05\), the confidence level would be 95%.
-The functions in the {survey} package allow for the correct estimation of the variances. This chapter covers the following statistical tests with survey data and the following functions from the {survey} package(Lumley 2010):
+The functions in the {survey} package allow for the correct estimation of the uncertainty estimates (e.g., standard deviations and confidence intervals). This chapter covers the following statistical tests with survey data and the following functions from the {survey} package(Lumley 2010):
- Comparison of proportions (
svyttest()
)
- Comparison of means (
svyttest()
)
@@ -598,7 +598,7 @@ 6.2 Dot notation
6.3 Comparison of proportions and means
-We use t-tests to compare two proportions or means. T-tests allow us to determine if one proportion or mean is statistically different from another. They are commonly used to determine if a single estimate differs from a known value (e.g., 0 or 50%) or to compare two group means (e.g., North versus South.) Comparing a single estimate to a known value is called a one sample t-test, and we can set up the hypothesis test as follows:
+We use t-tests to compare two proportions or means. T-tests allow us to determine if one proportion or mean is statistically different from another. They are commonly used to determine if a single estimate differs from a known value (e.g., 0 or 50%) or to compare two group means (e.g., North versus South.) Comparing a single estimate to a known value is called a one-sample t-test, and we can set up the hypothesis test as follows:
- \(H_0: \mu = 0\) where \(\mu\) is the mean outcome and \(0\) is the value we are comparing it to
- \(H_A: \mu \neq 0\)
@@ -608,11 +608,11 @@ 6.3 Comparison of proportions and
- \(H_0: \mu_1 = \mu_2\) where \(\mu_i\) is the mean outcome for group \(i\)
- \(H_A: \mu_1 \neq \mu_2\)
-Two sample t-tests can also be paired or unpaired. If the data come from two different populations (e.g., North versus South), the t-test run is an unpaired or independent samples t-test. Paired t-tests occur when the data come from the same population. This is commonly seen with data from the same population in two different time periods (e.g., before and after an intervention.)
+Two-sample t-tests can also be paired or unpaired. If the data come from two different populations (e.g., North versus South), the t-test run is an unpaired or independent samples t-test. Paired t-tests occur when the data come from the same population. This is commonly seen with data from the same population in two different time periods (e.g., before and after an intervention.)
The difference between t-tests with non-survey data and survey data is based on the underlying variance estimation difference. Chapter 10 provides a detailed overview of the math behind the mean and sampling error calculations for various sample designs. The functions in the {survey} package account for these nuances, provided the design object is correctly defined.
6.3.1 Syntax
-When we do not have survey data, we can use the t.test()
function from the {stats} package. This function does not allow for weights or the variance structure that need to be accounted for with survey data. Therefore, we need to use the svyttest()
function from {survey} when using survey data. Many of the arguments are the same between the two functions, but there are a few key differences:
+When we do not have survey data, we can use the t.test()
function from the {stats} package to run t-tests. This function does not allow for weights or the variance structure that need to be accounted for with survey data. Therefore, we need to use the svyttest()
function from {survey} when using survey data. Many of the arguments are the same between the two functions, but there are a few key differences:
- We need to use the survey design object instead of the original data frame
- We can only use a formula and not separate x and y data
@@ -691,7 +691,7 @@ Example 1: One-sample t-test for mean\(^\circ\)F. Looking at the output from svyttest()
, the t-statistic is 84.8, and the p-value is \(<0.0001\), indicating that the average is statistically different from 68\(^\circ\)F at an \(\alpha\) level of \(0.05\).
-If we want an 80% confidence interval for the test statistic, we can use the function confint()
to change the confidence level. Below, we print the default confidence interval (95%), the confidence interval explicitly specifying the level as 95%, and the 80% confidence interval. The default confidence level is 95%, and when we specify this level, R returns a vector with both row and column names. However, when we specify any other confidence level, an unnamed vector is returned, with the first element being the lower bound and the second element being the upper bound of the confidence interval.
+If we want an 80% confidence interval for the test statistic, we can use the function confint()
to change the confidence level. Below, we print the default confidence interval (95%), the confidence interval explicitly specifying the level as 95%, and the 80% confidence interval. When the confidence level is 95% either by default or explicitly, R returns a vector with both row and column names. However, when we specify any other confidence level, an unnamed vector is returned, with the first element being the lower bound and the second element being the upper bound of the confidence interval.
## 2.5 % 97.5 %
## as.numeric(SummerTempNight - 68) 3.288 3.447
@@ -753,29 +753,29 @@ Example 2: One-sample t-test for proportion8. The function pretty_p_value()
comes from the {prettyunits} package and converts numeric p-values to characters and, by default, prints four decimal places and displays any p-value less than 0.0001 as "<0.0001"
though another minimum display p-value can be specified (Csardi 2023).
+
The ‘tidied’ output can also be piped into the {gt} package to create a table ready for publication. We go over the {gt} package in Chapter 8. The function pretty_p_value()
comes from the {prettyunits} package and converts numeric p-values to characters and, by default, prints four decimal places and displays any p-value less than 0.0001 as "<0.0001"
, though another minimum display p-value can be specified (Csardi 2023).
-
Example 3: Unpaired two-sample t-test
@@ -1237,7 +1237,7 @@ Example 3: Unpaired two-sample t-test\(H_0: \mu_{AC} = \mu_{noAC}\) where \(\mu_{AC}\) is the electrical bill cost for U.S. households that used A/C and \(\mu_{noAC}\) is the electrical bill cost for U.S. households that did not use A/C
- \(H_A: \mu_{AC} \neq \mu_{noAC}\)
-Let’s take a quick look at the data to see the format the data are in:
+Let’s take a quick look at the data to see how they are formatted:
@@ -1256,23 +1256,23 @@ Example 3: Unpaired two-sample t-test gt() %>%
fmt_number()
-
-
-
@@ -2220,8 +2220,8 @@ Example 4: Paired two-sample t-test
-6.4 Chi-square tests
-Chi-square tests (\(\chi^2\)) allow us to examine multiple proportions using a goodness-of-fit test, a test of independence, or a test of homogeneity. These three tests have the same \(\chi^2\) distributions but with slightly different underlying assumptions.
+6.4 Chi-squared tests
+Chi-squared tests (\(\chi^2\)) allow us to examine multiple proportions using a goodness-of-fit test, a test of independence, or a test of homogeneity. These three tests have the same \(\chi^2\) distributions but with slightly different underlying assumptions.
First, goodness-of-fit tests are used when comparing observed data to expected data. For example, this could be used to determine if respondent demographics (the observed data in the sample) match known population information (the expected data.) In this case, we can set up the hypothesis test as follows:
- \(H_0: p_1 = \pi_1, ~ p_2 = \pi_2, ~ ..., ~ p_k = \pi_k\) where \(p_i\) is the observed proportion for category \(i\), \(\pi_i\) is expected proportion for category \(i\), and \(k\) is the number of categories
@@ -2240,7 +2240,7 @@ 6.4 Chi-square tests\(\chi^2\) tests with non-survey data and survey data is based on the underlying variance estimation. The functions in the {survey} package account for these nuances, provided the design object is correctly defined. For basic variance estimation formulas for different survey design types, refer to Chapter 10.
6.4.1 Syntax
-When we do not have survey data, we may be able to use the chisq.test()
function from the {stats} package in base R (R Core Team 2023). However, this function does not allow for weights or the variance structure to be accounted for with survey data. Therefore, when using survey data, we need to use one of two functions:
+When we do not have survey data, we may be able to use the chisq.test()
function from the {stats} package in base R to run chi-squared tests (R Core Team 2023). However, this function does not allow for weights or the variance structure to be accounted for with survey data. Therefore, when using survey data, we need to use one of two functions:
svygofchisq()
: For goodness of fit tests
svychisq()
: For tests of independence and homogeneity
@@ -2310,7 +2310,7 @@ Example 1: Goodness of fit test(Wickham 2023a). The package’s fct_collapse()
function helps us create a new variable by collapsing categories into a single one. Then, we use the svygofchisq()
function to compare the ANES data to the ACS data, where we specify the updated design object, the formula using the collapsed education variable, the ACS estimates for education levels as p, and removing NA values.
+
Based on this output, we can see that we have different levels from the ACS data. Specifically, the education data from ANES include two levels for Bachelor’s Degree or Higher (Bachelor’s and Graduate), so these two categories need to be collapsed into a single category to match the ACS data. For this, among other methods, we can use the {forcats} package from the tidyverse (Wickham 2023a). The package’s fct_collapse()
function helps us create a new variable by collapsing categories into a single one. Then, we use the svygofchisq()
function to compare the ANES data to the ACS data, where we specify the updated design object, the formula using the collapsed education variable, the ACS estimates for education levels as p, and removing NA
values.
anes_des_educ <- anes_des %>%
mutate(Education2 =
fct_collapse(Education,
@@ -2446,23 +2446,23 @@ Example 2: Test of independence `Some of the time` = md("Some of<br />the time"))
-
-
@@ -2960,23 +2960,23 @@ Example 2: Test of independence tab_options(page.orientation = "landscape")
-
-
-
diff --git a/c07-modeling.html b/c07-modeling.html
index 3e07979f..7747d94e 100644
--- a/c07-modeling.html
+++ b/c07-modeling.html
@@ -173,7 +173,7 @@
- Dedication
-- I Intro
+- I Introduction
- 1 Introduction
- 1.1 Survey analysis in R
@@ -185,7 +185,7 @@
- 1.7 Acknowledgements
- 1.8 Colophon
-- 2 Overview of Surveys
+
- 2 Overview of surveys
- 2.1 Introduction
- 2.2 Searching for public-use survey data
@@ -295,7 +295,7 @@
- 6.3.1 Syntax
- 6.3.2 Examples
-- 6.4 Chi-square tests
+
- 6.4 Chi-squared tests
- 6.4.1 Syntax
- 6.4.2 Examples
@@ -447,8 +447,8 @@
- 14.4 Survey design objects
- 14.5 Calculating estimates
- 14.6 Mapping survey data
- 14.7 Exercises
@@ -457,7 +457,7 @@
- A Importing survey data into R
- A.1 Importing delimiter-separated files into R
-- A.2 Loading Excel files into R
+- A.2 Importing Excel files into R
- A.3 Importing Stata, SAS, and SPSS files into R
- A.3.1 Syntax
@@ -465,7 +465,7 @@
- A.3.3 Labeled missing data values
- A.4 Importing data from APIs into R
-- A.5 Accessing databases in R
+- A.5 Importing data from databases in R
- A.6 Importing data from other formats
- B ANES derived variable codebook
@@ -679,7 +679,7 @@
7.2.1 Syntax )
The arguments are:
-formula
: Formula in the form of outcome~group
. The group variable must be a factor or character.
+formula
: formula in the form of outcome~group
. The group variable must be a factor or character.
design
: a tbl_svy
object created by as_survey
na.action
: handling of missing data
df.resid
: degrees of freedom for Wald tests (optional) - defaults to using degf(design)-(g-1)
where \(g\) is the number of groups
@@ -718,7 +718,7 @@ 7.2.2 Example(Iannone et al. 2023; Robinson, Hayes, and Couch 2023) (see Chapter 8 for more information on the {gt} package.)
+
If we wanted to change the reference value, we would reorder the factor before modeling using the relevel()
function from {stats} or using one of many factor ordering functions in {forcats} such as fct_relevel()
or fct_infreq()
. For example, if we wanted the reference level to be the Midwest region, we could use the following code. Note the usage of the gt()
function on top of tidy()
to print a nice-looking output table (Iannone et al. 2023; Robinson, Hayes, and Couch 2023) (see Chapter 8 for more information on the {gt} package.)
anova_out_relevel <- recs_des %>%
mutate(Region=fct_relevel(Region, "Midwest", after = 0)) %>%
svyglm(design = .,
@@ -728,23 +728,23 @@ 7.2.2 Example gt() %>%
fmt_number()
-
The arguments are:
-formula
: Formula in the form of y~x
+formula
: formula in the form of y~x
design
: a tbl_svy
object created by as_survey
na.action
: handling of missing data
df.resid
: degrees of freedom for Wald tests (optional) - defaults to using degf(design)-p
where \(p\) is the rank of the design matrix
-As discussed in Section 7.1, the formula on the right-hand side can be specified in many ways, whether interactions are desired or not, for example.
+As discussed in Section 7.1, the formula on the right-hand side can be specified in many ways, for example, denoting whether interactions are desired or not.
7.3.2 Examples
@@ -1279,23 +1279,23 @@ Example 1: Linear regression with a single variable gt() %>%
fmt_number()
-
-
@@ -1748,7 +1748,7 @@ Example 1: Linear regression with a single variable
Example 2: Linear regression with multiple variables and interactions
@@ -1765,23 +1765,23 @@ Example 2: Linear regression with multiple variables and interactions gt() %>%
fmt_number()
-
-
As shown above, there are many terms in this model. To test whether coefficients for a term are different from zero, the function regTermTest()
can be used. For example, in the above regression, we can test whether the interaction of region and urbanicity is significant as follows:
+As shown above, there are many terms in this model. To test whether coefficients for a term are different from zero, the regTermTest()
function can be used. For example, in the above regression, we can test whether the interaction of region and urbanicity is significant as follows:
## Wald test for Urbanicity:Region
@@ -2356,7 +2356,7 @@ Example 2: Linear regression with multiple variables and interactions
This output indicates there is a significant interaction between urbanicity and region (p-value=<0.0001.)
-To examine the predictions, residuals, and more from the model, the function augment()
from {broom} can be used. The augment()
function returns a tibble with the independent and dependent variables and other fit statistics. The augment()
function has not been specifically written for objects of class svyglm
, and as such, a warning is displayed indicating this at this time. As it was not written exactly for this class of objects, a little tweaking needs to be done after using augment()
. To obtain the standard error of the predicted values (.se.fit
), we need to use the attr()
function on the predicted values (.fitted
) created by augment()
. Additionally, the predicted values created are outputted with a type of svrep
. If we want to plot the predicted values, we need to use as.numeric()
to get the predicted values into a numeric format to work with. However, it is important to note that this adjustment must be completed after the standard error adjustment.
+To examine the predictions, residuals, and more from the model, the augment()
function from {broom} can be used. The augment()
function returns a tibble with the independent and dependent variables and other fit statistics. The augment()
function has not been specifically written for objects of class svyglm
, and as such, a warning is displayed indicating this at this time. As it was not written exactly for this class of objects, a little tweaking needs to be done after using augment()
. To obtain the standard error of the predicted values (.se.fit
), we need to use the attr()
function on the predicted values (.fitted
) created by augment()
. Additionally, the predicted values created are outputted with a type of svrep
. If we want to plot the predicted values, we need to use as.numeric()
to get the predicted values into a numeric format to work with. However, it is important to note that this adjustment must be completed after the standard error adjustment.
fitstats <-
augment(m_electric_multi) %>%
mutate(.se.fit = sqrt(attr(.fitted, "var")),
@@ -2455,7 +2455,7 @@ 7.4.1 Syntax )
The arguments are:
-formula
: Formula in the form of y~x
+formula
: formula in the form of y~x
design
: a tbl_svy
object created by as_survey
na.action
: handling of missing data
df.resid
: degrees of freedom for Wald tests (optional) - defaults to using degf(design)-p
where \(p\) is the rank of the design matrix
@@ -2472,7 +2472,7 @@ 7.4.1 Syntax7.4.2 Examples
Example 1: Logistic regression with a single variable
-In the following example, the ANES data are used, and we are modeling whether someone usually has trust in the government25 by who someone voted for president in 2020. As a reminder, the leading candidates were Biden and Trump, though people could vote for someone else not in the Democratic or Republican parties. Those votes are all grouped into an “Other” category. We first create a binary outcome for trusting in the government by collapsing “Always” and “Most of the time” into a single-factor level, and the other response options (“About half the time”, “Some of the time”, and “Never”) into a second factor level. Next, a scatter plot of the raw data is not useful as it is all 0 and 1 outcomes, so instead, we plot a summary of the data.
+In the following example, we use the ANES data to model whether someone usually has trust in the government25 by who someone voted for president in 2020. As a reminder, the leading candidates were Biden and Trump, though people could vote for someone else not in the Democratic or Republican parties. Those votes are all grouped into an “Other” category. We first create a binary outcome for trusting in the government by collapsing “Always” and “Most of the time” into a single-factor level, and the other response options (“About half the time”, “Some of the time”, and “Never”) into a second factor level. Next, a scatter plot of the raw data is not useful as it is all 0 and 1 outcomes, so instead, we plot a summary of the data.
anes_des_der <- anes_des %>%
mutate(TrustGovernmentUsually = case_when(
is.na(TrustGovernment) ~ NA,
@@ -2514,23 +2514,23 @@ Example 1: Logistic regression with a single variable gt() %>%
fmt_number()
-
-
@@ -2994,23 +2994,23 @@ Example 1: Logistic regression with a single variable gt() %>%
fmt_number()
-
-
-
-
diff --git a/c08-communicating-results.html b/c08-communicating-results.html
index cffbb0e0..894f16d5 100644
--- a/c08-communicating-results.html
+++ b/c08-communicating-results.html
@@ -173,7 +173,7 @@
- Dedication
-- I Intro
+- I Introduction
- 1 Introduction
- 1.1 Survey analysis in R
@@ -185,7 +185,7 @@
- 1.7 Acknowledgements
- 1.8 Colophon
-- 2 Overview of Surveys
+
- 2 Overview of surveys
- 2.1 Introduction
- 2.2 Searching for public-use survey data
@@ -295,7 +295,7 @@
- 6.3.1 Syntax
- 6.3.2 Examples
-- 6.4 Chi-square tests
+
- 6.4 Chi-squared tests
- 6.4.1 Syntax
- 6.4.2 Examples
@@ -447,8 +447,8 @@
- 14.4 Survey design objects
- 14.5 Calculating estimates
- 14.6 Mapping survey data
- 14.7 Exercises
@@ -457,7 +457,7 @@
- A Importing survey data into R
- A.1 Importing delimiter-separated files into R
-- A.2 Loading Excel files into R
+- A.2 Importing Excel files into R
- A.3 Importing Stata, SAS, and SPSS files into R
- A.3.1 Syntax
@@ -465,7 +465,7 @@
- A.3.3 Labeled missing data values
- A.4 Importing data from APIs into R
-- A.5 Accessing databases in R
+- A.5 Importing data from databases in R
- A.6 Importing data from other formats
- B ANES derived variable codebook
@@ -546,7 +546,7 @@
Prerequisites
8.1 Introduction
-After finishing the analysis and modeling, we proceed to the important task of communicating the survey results. Our audience may range from seasoned researchers familiar with our survey data to newcomers encountering the information for the first time. We should aim to explain the methodology and analysis while presenting findings in an accessible way, and it is our responsibility to report information with care.
+After finishing the analysis and modeling, we proceed to the task of communicating the survey results. Our audience may range from seasoned researchers familiar with our survey data to newcomers encountering the information for the first time. We should aim to explain the methodology and analysis while presenting findings in an accessible way, and it is our responsibility to report information with care.
Before beginning any dissemination of results, consider questions such as:
+ tab_caption("Example of {gt} table with trust in government estimate")
-
-