diff --git a/89-Appendix-DataImport.Rmd b/89-Appendix-DataImport.Rmd index 5301d9d..90beddd 100644 --- a/89-Appendix-DataImport.Rmd +++ b/89-Appendix-DataImport.Rmd @@ -25,14 +25,14 @@ This appendix guides analysts through the process of importing these various typ ## Importing delimiter-separated files into R -Delimiter-separated files use specific characters, known as delimiters, to separate values within the file. For example, CSV (Comma-Separated Values) files use commas as delimiters, while TSV (Tab-Separated Values) files use tabs. These file formats are widely used because of their simplicity and compatibility with various software applications. +Delimiter-separated files use specific characters, known as delimiters, to separate values within the file. For example, CSV (comma-separated values) files use commas as delimiters, while TSV (tab-separated values) files use tabs. These file formats are widely used because of their simplicity and compatibility with various software applications. The {readr} package, part of the tidyverse ecosystem, offers efficient ways to import delimiter-separated files into R [@R-readr]. It offers several advantages, including automatic data type detection and flexible handling of missing values, depending on one's survey analysis needs. The {readr} package includes functions for: -* `read_csv()`: This function is specifically designed to read files with comma separated values (CSV). -* `read_tsv()`: Use this function for files with tab separated values (TSV). +* `read_csv()`: This function is specifically designed to read CSV files. +* `read_tsv()`: Use this function for TSV files. * `read_delim()`: This function can handle a broader range of delimiter-separated files, including CSV and TSV. Specify the delimiter using the `delim` argument. -* `read_fwf()`: This function is useful for importing Fixed-Width Files, where columns have predetermined widths, and values are aligned in specific positions. +* `read_fwf()`: This function is useful for importing fixed-width files (FWF), where columns have predetermined widths, and values are aligned in specific positions. * `read_table()`: Use this function when dealing with whitespace-separated files, such as those with spaces or multiple spaces as delimiters. * `read_log()`: This function can read and parse web log files. @@ -74,7 +74,7 @@ The arguments are: * `trim_ws`: a value of `TRUE` trims leading and trailing white space * `skip`: number of lines to skip before importing the data * `n_max`: maximum number of lines to read -* `guess_max`: maximum number of lines use for guessing column types +* `guess_max`: maximum number of lines used for guessing column types * `name_repair`: whether to check column names. By default, the column names are unique. * `num_threads`: the number of processing threads to use for initial parsing and lazy reading of data * `progress`: a value of `TRUE` displays a progress bar @@ -163,7 +163,7 @@ The arguments are: * `file`: the path to the proprietary data file to import * `encoding`: specifies the character encoding of the data file * `col_select`: selects specific columns for import -* `skip` and `n_max`: controls the number of rows skipped and the maximum number of rows imported +* `skip` and `n_max`: control the number of rows skipped and the maximum number of rows imported * `.name_repair`: determines how column names are repaired if they are not valid The syntax for `read_sav()` is similar to `read_dat()`: @@ -185,8 +185,8 @@ The arguments are: * `file`: the path to the proprietary data file to import * `encoding`: specifies the character encoding of the data file * `col_select`: selects specific columns for import -* `user_na`: a value of `TRUE` reads variables with user defined missing labels into `labelled_spss()` objects -* `skip` and `n_max`: controls the number of rows skipped and the maximum number of rows imported +* `user_na`: a value of `TRUE` reads variables with user-defined missing labels into `labelled_spss()` objects +* `skip` and `n_max`: control the number of rows skipped and the maximum number of rows imported * `.name_repair`: determines how column names are repaired if they are not valid The syntax for importing SAS files with `read_sas()` is as follows: @@ -211,7 +211,7 @@ The arguments are: * `encoding`: specifies the character encoding of the data file * `catalog_encoding`: specifies the character encoding of the catalog file * `col_select`: selects specific columns for import -* `skip` and `n_max`: controls the number of rows skipped and the maximum number of rows imported +* `skip` and `n_max`: control the number of rows skipped and the maximum number of rows imported * `.name_repair`: determines how column names are repaired if they are not valid In the code examples below, we demonstrate how to import Stata, SPSS, and SAS files into R using the respective {haven} functions. The resulting data are stored in `anes_dta`, `anes_sav`, and `anes_sas` objects as tibbles, ready for use in R. For the Stata example, we show how to import the data from the {srvyrexploR} package to use in examples. @@ -253,7 +253,7 @@ anes_sas <- \index{American National Election Studies (ANES)|(} \index{Categorical data|(} Stata, SPSS, and SAS files can contain labeled variables and values. These labels provide descriptive information about categorical data, making them easier to understand and analyze. When importing data from Stata, SPSS, or SAS, we want to preserve these labels to maintain data fidelity. -Consider a variable like 'Education Level' with coded values (e.g., 1, 2, 3.) Without labels, these codes can be cryptic. However, with labels ('High School Graduate,' 'Bachelor's Degree,' 'Master's Degree'), the data become more informative and easier to work with. +Consider a variable like 'Education Level' with coded values (e.g., 1, 2, 3). Without labels, these codes can be cryptic. However, with labels ('High School Graduate,' 'Bachelor's Degree,' 'Master's Degree'), the data become more informative and easier to work with. With the {haven} package, we have the capability to import and work with labeled data from Stata, SPSS, and SAS files. The package uses a special class of data called `haven_labelled` to store labeled variables. When a dataset label is defined in Stata, it is stored in the 'label' attribute of the tibble when imported, ensuring that the information is not lost. @@ -276,7 +276,7 @@ We can confirm their label status using the `haven::is.labelled()` function. haven::is.labelled(anes_dta$V200002) ``` -To explore the labels further, we can use the `attributes()` function. This function provides insights into both the variable labels (`$label`) and the associated value labels (`$labels`.) +To explore the labels further, we can use the `attributes()` function. This function provides insights into both the variable labels (`$label`) and the associated value labels (`$labels`). ```{r} #| label: readr-attributes @@ -310,7 +310,7 @@ Factors are integer vectors, though they may look like character strings. We can glimpse(factors) ``` -R's factors differ from Stata, SPSS, or SAS' labeled vectors. However, we can convert labeled variables into factors using the `as_factor()` function. +R's factors differ from Stata, SPSS, or SAS labeled vectors. However, we can convert labeled variables into factors using the `as_factor()` function. ```{r} #| label: readr-factor-create @@ -335,9 +335,9 @@ anes_dta_factor %>% #### Option 2: Strip the labels {-} -The second option is to remove the labels altogether, converting the labeled data into a regular R data frame. To remove, or 'zap', the labels from our tibble, we can use the {haven} package's `zap_label()` and `zap_labels()` functions. This approach removes the labels but retains the data values in their original form. +The second option is to remove the labels altogether, converting the labeled data into a regular R data frame. To remove, or 'zap,' the labels from our tibble, we can use the {haven} package's `zap_label()` and `zap_labels()` functions. This approach removes the labels but retains the data values in their original form. -The ANES Stata file columns contain variable labels. Using the `map()` function from {purrr}, we can review the labels using `attr`. In the example below, we list the first two variables and their labels. For instance, the label for `V200002` is "Mode of interview: pre-election interview". +The ANES Stata file columns contain variable labels. Using the `map()` function from {purrr}, we can review the labels using `attr`. In the example below, we list the first two variables and their labels. For instance, the label for `V200002` is "Mode of interview: pre-election interview." ```{r} #| label: readr-label-show @@ -398,7 +398,7 @@ In survey data analysis, dealing with missing values is a crucial aspect of data SAS and Stata use a concept known as 'tagged' missing values, which extend R's regular `NA`. A 'tagged' missing value is essentially an `NA` with an additional single-character label. These values behave identically to regular `NA` in standard R operations while preserving the informative tag associated with the missing value. -Here is an example from the NORC at the University of Chicago’s 2018 General Society Survey, where Don't know (`DK`) responses are tagged as `NA(d)`, Inapplicable (`IAP`) responses are tagged as `NA(i)`, and `No answer` responses are tagged as `NA(n)` [@gss-codebook]. +Here is an example from the NORC at the University of Chicago’s 2018 General Society Survey, where Don't Know (`DK`) responses are tagged as `NA(d)`, Inapplicable (`IAP`) responses are tagged as `NA(i)`, and `No Answer` responses are tagged as `NA(n)` [@gss-codebook]. ```r head(gss_dta$HEALTH) @@ -441,7 +441,7 @@ head(gss_sps$HEALTH) ## Importing data from APIs into R -In addition to working with data saved as files, we may also need to retrieve data through Application Programming Interfaces (APIs.) APIs provide a structured way to access data hosted on external servers and import them directly into R for analysis. +In addition to working with data saved as files, we may also need to retrieve data through Application Programming Interfaces (APIs). APIs provide a structured way to access data hosted on external servers and import them directly into R for analysis. To access these data, we need to understand how to construct API requests. Each API has unique endpoints, parameters, and authentication requirements. Pay attention to: @@ -467,7 +467,7 @@ survey_data <- fromJSON(content(response, "text")) Note that these are dummy examples. Please review the documentation to understand how to make requests from a specific API. -R offers several packages that simplify API access by providing ready-to-use functions for popular APIs. These packages are called "wrappers", as they "wrap" the API in R to make it easier to use. For example, the {tidycensus} package used in this book simplifies access to U.S. Census data, allowing us to retrieve data with R commands instead of writing API requests from scratch [@R-tidycensus]. Behind the scenes, `get_pums()` is making a GET request from the Census API, and the {tidycensus} functions are converting the response into an R-friendly format. For example, if we are interested in the age, sex, race, and Hispanicity of those in the American Community Survey sample of Durham County, North Carolina^[The public use microdata areas (PUMA) for Durham County were identified using the 2020 PUMA Names File: https://www2.census.gov/geo/pdfs/reference/puma2020/2020_PUMA_Names.pdf], we can use the `get_pums()` function to extract the microdata as shown in the code below. We can then use the replicate weights to create a survey object and calculate estimates for Durham County. +R offers several packages that simplify API access by providing ready-to-use functions for popular APIs. These packages are called "wrappers," as they "wrap" the API in R to make it easier to use. For example, the {tidycensus} package used in this book simplifies access to U.S. Census data, allowing us to retrieve data with R commands instead of writing API requests from scratch [@R-tidycensus]. Behind the scenes, `get_pums()` is making a GET request from the Census API, and the {tidycensus} functions are converting the response into an R-friendly format. For example, if we are interested in the age, sex, race, and Hispanicity of those in the American Community Survey sample of Durham County, North Carolina^[The public use microdata areas (PUMA) for Durham County were identified using the 2020 PUMA Names File: https://www2.census.gov/geo/pdfs/reference/puma2020/2020_PUMA_Names.pdf], we can use the `get_pums()` function to extract the microdata as shown in the code below. We can then use the replicate weights to create a survey object and calculate estimates for Durham County. ```{r} #| label: readr-pumsin @@ -490,7 +490,7 @@ durh_pums <- get_pums( durh_pums ``` -In Chapter \@ref(c04-getting-started), we used the {censusapi} package to get data from the Census data API for the Current Population Survey. To discover if there's an R package that directly interfaces with a specific survey or data source, search for "[survey] R wrapper" or "[data source] R package" online. +In Chapter \@ref(c04-getting-started), we used the {censusapi} package to get data from the Census data API for the Current Population Survey. To discover if there is an R package that directly interfaces with a specific survey or data source, search for "[survey] R wrapper" or "[data source] R package" online. ## Importing data from databases in R