diff --git a/02-overview-surveys.Rmd b/02-overview-surveys.Rmd index 5ab7ff66..b641be78 100644 --- a/02-overview-surveys.Rmd +++ b/02-overview-surveys.Rmd @@ -2,11 +2,41 @@ ## Introduction -Developing surveys to gather accurate information about populations involves a more intricate and time-intensive process compared to surveys that use non-random criteria for selecting samples. Researchers can spend months, or even years, developing the study design, questions, and other methods for a single survey to ensure high-quality data is collected. +Developing surveys to gather accurate information about populations often involves a intricate and time-intensive process. Researchers can spend months, or even years, developing the study design, questions, and other methods for a single survey to ensure high-quality data is collected. -While this book focuses on the analysis methods of complex surveys, understanding the entire survey life cycle can provide a better insight into what types of analyses should be conducted on the data. The *survey life cycle* consists of the necessary stages to execute a survey project successfully. Each stage influences the survey's timing, costs, and feasibility, consequently impacting the data collected and how we should analyze it. +Prior to analyzing survey data, we recommend understanding the entire survey life cycle. This understanding can provide a better insight into what types of analyses should be conducted on the data. The *survey life cycle* consists of the necessary stages to execute a survey project successfully. Each stage influences the survey's timing, costs, and feasibility, consequently impacting the data collected and how we should analyze it. Figure \@ref(fig:overview-diag) shows a high level view of the survey process and this chapter gives an overview of each step. -The survey life cycle starts with a *research topic or question of interest* (e.g., what impact does childhood trauma have on health outcomes later in life). Researchers typically review existing data sources to determine if data are already available that can answer this question, as drawing from available resources can result in a reduced burden on respondents, cheaper research costs, and faster research outcomes. However, if existing data cannot answer the nuances of the research question, a survey can be used to capture the exact data that the researcher needs through a questionnaire, or a set of questions. +```{r} +#| label: overview-diag +#| echo: false +#| fig.cap: "Overview of the survey process" +#| fig.alt: "Diagram of survey process beginning with survey concept (first level), then sampling design, questionnaire design, and data collection planning (second level), data collection (third level), post-survey processing (fourth level), analysis (fifth level), and reporting (sixth level)" +library(DiagrammeR) +mermaid(" +graph TD + A[Survey Concept]-->B[Sampling Design] + A-->C[Questionnaire Design] + A-->D[Data Collection Planning] + B-->E[Data Collection] + C-->E + D-->E + E-->F[Post-Survey Processing] + F-->G[Analysis] + G-->H[Reporting] + + style A fill: #bfd7ea, stroke: #0b3954 + style B fill: #bfd7ea, stroke: #0b3954 + style C fill: #bfd7ea, stroke: #0b3954 + style D fill: #bfd7ea, stroke: #0b3954 + style E fill: #bfd7ea, stroke: #0b3954 + style F fill: #bfd7ea, stroke: #0b3954 + style G fill: #bfd7ea, stroke: #0b3954 + style H fill: #bfd7ea, stroke: #0b3954 +") +``` + + +The survey life cycle starts with a *research topic or question of interest* (e.g., what impact does childhood trauma have on health outcomes later in life). Researchers typically review existing data sources to determine if data are already available that can address this question, as drawing from available resources can result in a reduced burden on respondents, cheaper research costs, and faster research outcomes. However, if existing data cannot answer the nuances of the research question, a survey can be used to capture the exact data that the researcher needs through a questionnaire, or a set of questions. To gain a deeper understanding of survey design and implementation, we recommend reviewing several pieces of existing literature in detail [e.g., @dillman2014mode; @groves2009survey; @Tourangeau2000psych; @Bradburn2004; @valliant2013practical; @biemer2003survqual]. @@ -25,23 +55,23 @@ There are multiple things to consider when starting a survey. *Errors* are the d Generally, survey researchers consider there to be seven main sources of error that fall under either Representation and Measurement [@groves2009survey]: -- Representation - - **Coverage Error**: A mismatch between the *population of interest* (also known as the target population or study population) and the sampling frame. - - **Sampling Error**: Error produced when selecting a *sample*, the subset of the population, from the *sampling frame*, the list from which the sample is drawn (there is no sampling error if conducting a census). This error is due to randomization, and we discuss how to quantify this error in Chapter \@ref(c10-specifying-sample-designs). +- **Representation** + - **Coverage Error**: A mismatch between the *population of interest* (also known as the target population or study population) and the *sampling frame*, the list from which the sample is drawn. + - **Sampling Error**: Error produced when selecting a *sample*, the subset of the population, from the sampling frame. This error is due to randomization, and we discuss how to quantify this error in Chapter \@ref(c10-specifying-sample-designs). There is no sampling error in a census as there is no randomization. The sampling error measures the difference between all potential samples under the same sampling method. - **Nonresponse Error**: Differences between those who responded and did not respond to the survey (unit nonresponse) or a given question (item nonresponse). - **Adjustment Error**: Error introduced during post-survey statistical adjustments. -- Measurement +- **Measurement** - **Validity**: A mismatch between the topic of interest and the question(s) used to collect that information. - **Measurement Error**: A mismatch between what the researcher asked and how the respondent answered. - **Processing Error**: Edits by the researcher to responses provided by the respondent (e.g., adjustments to data based on illogical responses). -Almost every survey has errors. Researchers attempt to conduct a survey that reduces the *total survey error*, or the accumulation of all errors that may arise throughout the survey life cycle. By assessing these different types of errors together, researchers can seek strategies to maximize the overall survey quality and improve the reliability and validity of results [@tse-doc]. However, attempts to lower individual sources errors (and therefore total survey error) come at the price of time, resources, and money. For example: +Almost every survey has errors. Researchers attempt to conduct a survey that reduces the *total survey error*, or the accumulation of all errors that may arise throughout the survey life cycle. By assessing these different types of errors together, researchers can seek strategies to maximize the overall survey quality and improve the reliability and validity of results [@tse-doc]. However, attempts to reduce individual sources errors (and therefore total survey error) come at the price of time and money. For example: - **Coverage Error Tradeoff**: Researchers can search for or create more accurate and updated sampling frames, but they can be difficult to construct or obtain. - **Sampling Error Tradeoff**: Researchers can increase the sample size to reduce sampling error; however, larger samples can be expensive and time-consuming to field. - **Nonresponse Error Tradeoff**: Researchers can increase or diversify efforts to improve survey participation but this may be resource-intensive while not entirely removing nonresponse bias. -- **Adjustment Error Tradeoff**: *Weighting*, or a statistical technique used to adjust the contribution of individual survey responses to the final survey estimates, is typically done to make the sample more representative of the target population. However, if researchers do not carefully execute the adjustments or base them on inaccurate information, they can introduce new biases, leading to less accurate estimates. -- **Validity Error Tradeoff**: Reseachers can increase validity through a variety of ways, such as extensive research, using established scales, or collaborating with a psychometrician during survey design. However, doing so lengthens the amount of time and resources needed to complete survey design. +- **Adjustment Error Tradeoff**: *Weighting* is a statistical technique used to adjust the contribution of individual survey responses to the final survey estimates. It is typically done to make the sample more representative of the target population. However, if researchers do not carefully execute the adjustments or base them on inaccurate information, they can introduce new biases, leading to less accurate estimates. +- **Validity Error Tradeoff**: Researchers can increase validity through a variety of ways, such as using established scales or collaborating with a psychometrician during survey design to pilot and evaluate questions. However, doing so lengthens the amount of time and resources needed to complete survey design. - **Measurement Error Tradeoff**: Reseachers can use techniques such as questionnaire testing and cognitive interviewing to ensure respondents are answering questions as expected. However, these activities also require time and resources to complete. - **Processing Error Tradeoff**: Researchers can impose rigorous data cleaning and validation processes. However, this requires supervision, training, and time. @@ -55,13 +85,17 @@ From formulating methodologies to choosing an appropriate sampling frame, the st ### Sampling Design {#overview-design-sampdesign} -The set or group we want to survey is known as the *population of interest*. The population of interest could be broad, such as “all adults age 18+ living in the U.S.” or a specific population based on a particular characteristic or location. For example, we may want to know about "adults aged 18-24 who live in North Carolina" or "eligible voters living in Illinois." However, a *sampling frame* with contact information is needed to survey individuals in these populations of interest. If researchers are looking at eligible voters, the sampling frame could be the voting registry for a given state or area. The sampling frame is likely imperfect for more broad target populations like all adults in the United States. In these cases, researchers may choose to use a sampling frame of mailing addresses and send the survey to households, or they may choose to use random digit dialing (RDD) and call random phone numbers (that may or may not be assigned, connected, and working). These imperfect sampling frames can result in *coverage error* where there is a mismatch between the target population and the list of individuals researchers can select. For example, if a researcher is looking to obtain estimates for "all adults aged 18+ living in the U.S.", a sampling frame of mailing addresses will miss specific types of individuals, such as the homeless, transient populations, and incarcerated individuals. Additionally, many households have more than one adult living there, so researchers would need to consider how to get a specific individual to fill out the survey (called within household selection) or adjust the target population to report on "U.S. households" instead of "individuals." +The set or group we want to survey is known as the *population of interest* or the *target population*. The population of interest could be broad, such as “all adults age 18+ living in the U.S.” or a specific population based on a particular characteristic or location. For example, we may want to know about "adults aged 18-24 who live in North Carolina" or "eligible voters living in Illinois." + +However, a *sampling frame* with contact information is needed to survey individuals in these populations of interest. If researchers are looking at eligible voters, the sampling frame could be the voting registry for a given state or area. If researchers are looking at more board target populations like all adults in the United States, the sampling frame is likely imperfect. In these cases, a full list of individuals in the United States is not available for a sampling frame. Instead, researchers may choose to use a sampling frame of mailing addresses and send the survey to households, or they may choose to use random digit dialing (RDD) and call random phone numbers (that may or may not be assigned, connected, and working). + +These imperfect sampling frames can result in *coverage error* where there is a mismatch between the target population and the list of individuals researchers can select. For example, if a researcher is looking to obtain estimates for "all adults aged 18+ living in the U.S.", a sampling frame of mailing addresses will miss specific types of individuals, such as the homeless, transient populations, and incarcerated individuals. Additionally, many households have more than one adult resident, so researchers would need to consider how to get a specific individual to fill out the survey (called *within household selection*) or adjust the target population to report on "U.S. households" instead of "individuals." Once the researchers have selected the sampling frame, the next step is determining how to select individuals for the survey. In rare cases, researchers may conduct a *census* and survey everyone on the sampling frame. However, the ability to implement a questionnaire at that scale is something only some can do (e.g., government censuses). Instead, researchers typically choose to sample individuals and use weights to estimate numbers in the target population. They can use a variety of different sampling methods, and more information on these can be found in Chapter \@ref(c10-specifying-sample-designs). This decision of which sampling method to use impacts *sampling error* and can be accounted for in weighting. #### Example: Number of Pets in a Household {.unnumbered #overview-design-sampdesign-ex} -Let's use a simple example where a researcher is interested in the average number of pets in a household. Our researcher needs to consider the target population for this study. Specifically, are they interested in all households in a given country or households in a more local area (e.g., city or state)? Let's assume our researcher is interested in the number of pets in a U.S. household with at least one adult (18 years old or older). In this case, a sampling frame of mailing addresses would provide the least coverage error as the frame would closely match our target population. Specifically, our researcher would likely want to use the Computerized Delivery Sequence File (CDSF), which is a file of mailing addresses that the United States Postal Service (USPS) creates and covers nearly 100% of U.S. households [@harter2016address]. To sample these households, for simplicity, we use a stratified simple random sample design, where we randomly sample households within each state (i.e., we stratify by state). +Let's use a simple example where a researcher is interested in the average number of pets in a household. Our researcher needs to consider the target population for this study. Specifically, are they interested in all households in a given country or households in a more local area (e.g., city or state)? Let's assume our researcher is interested in the number of pets in a U.S. household with at least one adult (18 years old or older). In this case, a sampling frame of mailing addresses would introduce only a small amount of coverage error as the frame would closely match our target population. Specifically, our researcher would likely want to use the Computerized Delivery Sequence File (CDSF), which is a file of mailing addresses that the United States Postal Service (USPS) creates and covers nearly 100% of U.S. households [@harter2016address]. To sample these households, for simplicity, we use a stratified simple random sample design (see Chapter \@ref(c10-specifying-sample-designs) for more information on sample designs), where we randomly sample households within each state (i.e., we stratify by state). Throughout this chapter, we build on this example research question to plan a survey. @@ -74,11 +108,11 @@ With the sampling design decided, researchers can then decide how to survey thes - Computer Assisted Web Interview (CAWI; also known as web or online interviewing) - Paper and Pencil Interview (PAPI) -Researchers can use a single mode to collect data or multiple modes (also called *mixed modes*). Using mixed modes can allow for broader reach and increase response rates depending on the target population [@deLeeuw2005; @DeLeeuw_2018; @biemer_choiceplus]. For example, researchers could both call households to conduct a CATI survey and send mail with a PAPI survey to the household. Using both modes, researchers could gain participation through the mail from individuals who do not pick up the phone to unknown numbers or through the phone from individuals who do not open all of their mail. However, mode effects (where responses differ based on the mode of response) can be present in the data and may need to be considered during analysis. +Researchers can use a single mode to collect data or multiple modes (also called *mixed-modes*). Using mixed-modes can allow for broader reach and increase response rates depending on the target population [@deLeeuw2005; @DeLeeuw_2018; @biemer_choiceplus]. For example, researchers could both call households to conduct a CATI survey and send mail with a PAPI survey to the household. Using both modes, researchers could gain participation through the mail from individuals who do not pick up the phone to unknown numbers or through the phone from individuals who do not open all of their mail. However, mode effects (where responses differ based on the mode of response) can be present in the data and may need to be considered during analysis. When selecting which mode, or modes, to use, understanding the unique aspects of the chosen target population and sampling frame provides insight into how they can best be reached and engaged. For example, if we plan to survey adults aged 18-24 who live in North Carolina, asking them to complete a survey using CATI (i.e., over the phone) would likely not be as successful as other modes like the web. This age group does not talk on the phone as much as other generations and often does not answer their phones for unknown numbers. Additionally, the mode for contacting respondents relies on what information is available in the sampling frame. For example, if our sampling frame includes an email address, we could email our selected sample members to convince them to complete a survey. Alternatively, if the sampling frame is a list of mailing addresses, we could contact sample members with a letter. -It is important to note that there can be a difference between the contact and survey modes. For example, if we have a sampling frame with addresses, we can send a letter to our sample members and provide information on completing a web survey. Another option is using mixed-mode surveys by sending sample members a paper and pencil survey with our letter and also asking them to complete the survey online. Combining different contact modes and different survey modes can be helpful in reducing *unit nonresponse error*--where the entire unit (e.g., a household) does not respond to the survey at all--as different sample members may respond better to different contact and survey modes. However, when considering which modes to use, it is important to make access to the survey as easy as possible for sample members to reduce burden and unit nonresponse. +It is important to note that there can be a difference between the contact and survey modes. For example, if we have a sampling frame with addresses, we can send a letter to our sample members and provide information on completing a web survey. Another option is using mixed-mode surveys by mailing sample members a paper and pencil survey but also including instructions to complete the survey online. Combining different contact modes and different survey modes can be helpful in reducing *unit nonresponse error*--where the entire unit (e.g., a household) does not respond to the survey at all--as different sample members may respond better to different contact and survey modes. However, when considering which modes to use, it is important to make access to the survey as easy as possible for sample members to reduce burden and unit nonresponse. Another way to reduce unit nonresponse error is by varying the language of the contact materials [@dillman2014mode]. People are motivated by different things, so constantly repeating the same message may not be helpful. Instead, mixing up the messaging and the type of contact material the sample member receives can increase response rates and reduce the unit nonresponse error. For example, instead of only sending standard letters, researchers could consider sending mailings that invoke "urgent" or "important" thoughts by sending priority letters or using other delivery services like FedEx, UPS, or DHL. @@ -114,7 +148,7 @@ This is just one possible protocol that we can use that starts respondents with ### Questionnaire Design {#overview-design-questionnaire} -When developing the questionnaire, it can be helpful to first outline the topics to be asked and include the "why" each question or topic is important to the research question(s). This can help researchers better tailor the questionnaire and reduce the number of questions (and thus the burden on the respondent) if topics are deemed irrelevant to the research question. When making these decisions, researchers should also consider questions needed for weighting. While we would love to have everyone in our population of interest answer our survey, this is rarely the case. Thus, including questions about demographics in the survey can assist with weighting for *nonresponse errors* (both unit and item nonresponse). Knowing the details of the sampling plan and what may impact *coverage error* and *sampling error* can help researchers determine what types of demographics to include. +When developing the questionnaire, it can be helpful to first outline the topics to be asked and include the "why" each question or topic is important to the research question(s). This can help researchers better tailor the questionnaire and reduce the number of questions (and thus the burden on the respondent) if topics are deemed irrelevant to the research question. When making these decisions, researchers should also consider questions needed for weighting. While we would love to have everyone in our population of interest answer our survey, this rarely happens. Thus, including questions about demographics in the survey can assist with weighting for *nonresponse errors* (both unit and item nonresponse). Knowing the details of the sampling plan and what may impact *coverage error* and *sampling error* can help researchers determine what types of demographics to include. Thus questionnaire design is done in conjunction with sampling design. Researchers can benefit from the work of others by using questions from other surveys. Demographic sections such as race, ethnicity, or education borrow questions from a government census or other official surveys. Question banks such as the [Inter-university Consortium for Political and Social Research (ICPSR) variable search](https://www.icpsr.umich.edu/web/pages/ICPSR/ssvd/) can provide additional potential questions. @@ -158,19 +192,19 @@ This is a simple example of how the presentation of the question and options can ## Data Collection {#overview-datacollection} -Once the data collection starts, researchers try to stick to the data collection protocol designed during pre-survey planning. However, effective researchers adjust their plans and adapt as needed to the current progress of data collection [@Schouten2018]. Some extreme examples could be natural disasters that could prevent mail or interviewers from getting to the sample members. Others could be smaller in that something newsworthy occurs connected to the survey, so researchers could choose to play this up in communication materials. In addition to these external factors, there could be factors unique to the survey, such as lower response rates for a specific sub-group, so the data collection protocol may need to find ways to improve response rates for that specific group. +Once the data collection starts, researchers try to stick to the data collection protocol designed during pre-survey planning. However, effective researchers also prepare to adjust their plans and adapt as needed to the current progress of data collection [@Schouten2018]. Some extreme examples could be natural disasters that could prevent mailings or interviewers getting to the sample members. This could cause an in-person survey needing to quickly pivot to a self-administered survey, or the field period could be delayed, for example. Others could be smaller in that something newsworthy occurs connected to the survey, so researchers could choose to play this up in communication materials. In addition to these external factors, there could be factors unique to the survey, such as lower response rates for a specific sub-group, so the data collection protocol may need to find ways to improve response rates for that specific group. ## Post-Survey Processing {#overview-post} -After data collection, various activities need to be completed before we can analyze the survey. Multiple decisions made during this post-survey phase can assist researchers in reducing different error sources, such as through weighting to account for the sample selection. Knowing the decisions researchers made in creating the final analytic data can impact how analysts use the data and interpret the results. +After data collection, various activities need to be completed before we can analyze the survey. Multiple decisions made during this post-survey phase can assist researchers in reducing different error sources, such as weighting to account for the sample selection. Knowing the decisions researchers made in creating the final analytic data can impact how analysts use the data and interpret the results. ### Data Cleaning and Imputation {#overview-post-cleaning} -Post-survey cleaning and *imputation* is one of the first steps researchers do to get the survey responses into a dataset for use by analysts. Data cleaning can consist of cleaning inconsistent data (e.g., with skip pattern errors or multiple questions throughout the survey being consistent with each other), editing numeric entries or open-ended responses for grammar and consistency, or recoding open-ended questions into categories for analysis. There is no universal set of fixed rules that every project must adhere to. Instead, each project or research study should establish its own guidelines and procedures for handling various cleaning scenarios based on its specific objectives. +Post-survey cleaning is one of the first steps researchers do to get the survey responses into a dataset for use by analysts. Data cleaning can consist of correcting inconsistent data (e.g., with skip pattern errors or multiple questions throughout the survey being consistent with each other), editing numeric entries or open-ended responses for grammar and consistency, or recoding open-ended questions into categories for analysis. There is no universal set of fixed rules that every project must adhere to. Instead, each project or research study should establish its own guidelines and procedures for handling various cleaning scenarios based on its specific objectives. Researchers should use their best judgment to ensure data integrity, and all decisions should be documented and available to those using the data in the analysis. Each decision a researcher makes impacts *processing error*, so often, researchers have multiple people review these rules or recode open-ended data and adjudicate any differences in an attempt to reduce this error. -Another crucial step in post-survey processing is *imputation*. Often, there is item nonresponse where respondents do not answer specific questions. If the questions are crucial to analysis efforts or the research question, researchers may implement imputation to reduce *item nonresponse error*. Imputation is a technique for replacing missing or incomplete data values with estimated values. However, as imputation is a way of assigning a value to missing data based on an algorithm or model, it can also introduce *processing error*, so researchers should consider the overall implications of imputing data compared to having item nonresponse. There are multiple ways to impute data. We recommend reviewing other resources like @Kim2021 for more information. +Another crucial step in post-survey processing is *imputation*. Often, there is item nonresponse where respondents do not answer specific questions. If the questions are crucial to analysis efforts or the research question, researchers may implement imputation to reduce *item nonresponse error*. Imputation is a technique for replacing missing or incomplete data values with estimated values. However, as imputation is a way of assigning a value to missing data based on an algorithm or model, it can also introduce *processing error*, so researchers should consider the overall implications of imputing data compared to having item nonresponse. There are multiple ways to impute data. We recommend reviewing other resources like @Kim2021 for more information. #### Example: Number of Pets in a Household {.unnumbered #overview-post-cleaning-ex} @@ -178,16 +212,24 @@ Let's return to the question we created to ask about [animal preference](#overvi ### Weighting {#overview-post-weighting} -We can address some of the error sources identified in the previous sections using *weighting*. For example, weights can address coverage, sampling, and nonresponse errors. Many published surveys include an "analysis weight" variable that combines these adjustments. However, weighting itself can also introduce *adjustment error*, so researchers need to balance which types of errors should be corrected with weighting. The construction of weights is outside the scope of this book, and researchers should reference other materials if interested in constructing their own [@Valliant2018weights]. Instead, this book assumes the survey has been completed, weights are constructed, and data is available to users. We walk users through how to read the documentation (Chapter \@ref(c03-understanding-survey-data-documentation)) and work with the data and analysis weights provided to analyze and interpret survey results correctly. +We can address some of the error sources identified in the previous sections using *weighting*. During the weighting process, weights are created for each respondent record. These weights allow the survey responses to generalize to the population. A weight, generally, reflects how many units in the population each respondent represents, and, often the weight is constructed such that the sum of the weights is the size of the population. + +Weights can address coverage, sampling, and nonresponse errors. Many published surveys include an "analysis weight" variable that combines these adjustments. However, weighting itself can also introduce *adjustment error*, so researchers need to balance which types of errors should be corrected with weighting. The construction of weights is outside the scope of this book, and researchers should reference other materials if interested in constructing their own [@Valliant2018weights]. Instead, this book assumes the survey has been completed, weights are constructed, and data is available to users. #### Example: Number of Pets in a Household {.unnumbered #overview-post-weighting-ex} -In the simple example of our survey, we decided to use a stratified sample by state to select our sample members. Knowing this sampling design, our researcher can include selection weights for analysis that account for how the sample members were selected for the survey. Additionally, the sampling frame may have the type of building associated with each address, so we could include the building type as a potential nonresponse weighting variable, along with some interviewer observations that may be related to our research topic of the average number of pets in a household. Combining these weights, we can create an analytic weight that researchers need to use when analyzing the data. +In the simple example of our survey, we decided to obtain a random sample from each state to select our sample members. Knowing this sampling design, our researcher can include selection weights for analysis that account for how the sample members were selected for the survey. Additionally, the sampling frame may have the type of building associated with each address, so we could include the building type as a potential nonresponse weighting variable, along with some interviewer observations that may be related to our research topic of the average number of pets in a household. Combining these weights, we can create an analytic weight that researchers need to use when analyzing the data. ### Disclosure {#overview-post-disclosure} -Before data is released publicly, researchers need to ensure that individual respondents can not be identified by the data when confidentiality is required. There are a variety of different methods that can be used, including *data swapping*, *top or bottom coding*, *coarsening*, and *perturbation.* In data swapping, researchers may swap specific data values across different respondents so that it does not impact insights from the data but ensures that specific individuals cannot be identified. We can use *top and bottom coding* to mask extreme values. For example, researchers may top-code income values such that households with income greater than \$500,000 are coded into a single category of "\$500,000 or more". Other disclosure methods may include aggregating response categories or location information to avoid having only a few respondents in a given group and thus be identified. For example, researchers may use coarsening to display income in categories instead of as a continuous variable. We can also perturb the data by adding random noise. There is as much art as there is science to the methods used for disclosure. In the survey documentation, researchers should only provide high-level comments about the disclosure and not specific details. This ensures nobody can reverse the disclosure and thus identify individuals. For more information on different disclosure methods, please see @Skinner2009 and -[AAPOR Standards](https://www-archive.aapor.org/Standards-Ethics/AAPOR-Code-of-Ethics/Survey-Disclosure-Checklist.aspx). +Before data is released publicly, researchers need to ensure that individual respondents can not be identified by the data when confidentiality is required. There are a variety of different methods that can be used. Here we describe a few of the most commonly used: + +- **Data swapping**: Researchers may swap specific data values across different respondents so that it does not impact insights from the data but ensures that specific individuals cannot be identified. +- **Top/bottom coding**: Researchers may choose top or bottom coding to mask extreme values. For example, researchers may top-code income values such that households with income greater than \$500,000 are coded as "\$500,000 or more" with other incomes being presented as integers between \$0 and \$499,999. This can impact analyses at the tails of the distribution. +- **Coarsening**: Researchers may use coarsening to mask unique values. For example, a survey question may ask for a precise income but the public data may include data as a categorical variable. Another example commonly used in survey practice is to coarsen geographic variables. Data collectors likely know the precise address of sample members but the public data may only include the state or even region of respondents. +- **Perturbation**: Researchers may add random noise to outcomes. As with swapping, this is done so that it does not impact insights from the data but ensures that specific individuals cannot be identified. + +There is as much art as there is science to the methods used for disclosure. In the survey documentation, researchers will only provide high-level comments about the disclosure and not specific details. This ensures nobody can reverse the disclosure and thus identify individuals. For more information on different disclosure methods, please see @Skinner2009 and the [AAPOR Standards](https://www-archive.aapor.org/Standards-Ethics/AAPOR-Code-of-Ethics/Survey-Disclosure-Checklist.aspx). ### Documentation {#overview-post-documentation} diff --git a/book.bib b/book.bib index e75c4972..e09cd093 100644 --- a/book.bib +++ b/book.bib @@ -396,7 +396,7 @@ @book{Tourangeau2000psych publisher = {Cambridge University Press} } @article{Tourangeau2004spacing, - title = {Sapcing, Position, and Order: Interpretive Heuristics for Visual Features of Survey Questions}, + title = {Spacing, Position, and Order: Interpretive Heuristics for Visual Features of Survey Questions}, author = {Roger Tourangeau and Mick P. Couper and Frederick Conrad}, year = 2004, journal = {Public Opinion Quarterly}, diff --git a/renv.lock b/renv.lock index e42b3cdd..2b69c065 100644 --- a/renv.lock +++ b/renv.lock @@ -20,6 +20,34 @@ ], "Hash": "b2866e62bab9378c3cc9476a1954226b" }, + "DiagrammeR": { + "Package": "DiagrammeR", + "Version": "1.0.10", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "RColorBrewer", + "downloader", + "dplyr", + "glue", + "htmltools", + "htmlwidgets", + "igraph", + "magrittr", + "purrr", + "readr", + "rlang", + "rstudioapi", + "scales", + "stringr", + "tibble", + "tidyr", + "viridis", + "visNetwork" + ], + "Hash": "f3de4a4878163a4629a528bbcc6e655d" + }, "KernSmooth": { "Package": "KernSmooth", "Version": "2.23-22", @@ -523,6 +551,17 @@ ], "Hash": "8b708f296afd9ae69f450f9640be8990" }, + "downloader": { + "Package": "downloader", + "Version": "0.4", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "digest", + "utils" + ], + "Hash": "f4f2a915e0dedbdf016a83b63477349f" + }, "dplyr": { "Package": "dplyr", "Version": "1.1.4", @@ -1045,6 +1084,27 @@ ], "Hash": "99df65cfef20e525ed38c3d2577f7190" }, + "igraph": { + "Package": "igraph", + "Version": "1.5.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "Matrix", + "R", + "cli", + "cpp11", + "grDevices", + "graphics", + "magrittr", + "methods", + "pkgconfig", + "rlang", + "stats", + "utils" + ], + "Hash": "84818361421d5fc3ff0bf4e669524217" + }, "isoband": { "Package": "isoband", "Version": "0.2.7", @@ -2256,6 +2316,24 @@ ], "Hash": "c826c7c4241b6fc89ff55aaea3fa7491" }, + "visNetwork": { + "Package": "visNetwork", + "Version": "2.1.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "grDevices", + "htmltools", + "htmlwidgets", + "jsonlite", + "magrittr", + "methods", + "stats", + "utils" + ], + "Hash": "3e48b097e8d9a91ecced2ed4817a678d" + }, "visdat": { "Package": "visdat", "Version": "0.6.0",