-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path2_40-importing-data.qmd
697 lines (399 loc) · 17.5 KB
/
2_40-importing-data.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
# Getting Data into R
<br>
<div class="tip">
## Key Concepts
In this chapter, we'll explore the following key concepts:
* File Formats
* RData Format
* Flat or Text Files
* Wrapper Functions
## Key Takeaways
Too long; didn't read? Here's what you need to know:
* Importing data is called "reading in" data in R, and done with `read*()` functions
* Know as what format your data are stored to use the appropriate function
* Read in statistical software files with packages "foreign" or "haven"
* Read in JSON files with package "jsonlite"
* Read directly from the web with URLs
<br>
<br>
</div>
<br>
<br>
## "Reading In" Data
**Importing data** is also called **reading in data** in R.
Many base R functions begin with `read`, like `read.csv()` and `read.table()`.
Some R packages have new reading functions, like `read_csv()` in the **"readr"** package.
<br>
```{r, echo=F, fig.align="center", fig.cap="Be careful with what you read into R!",out.width="75%"}
knitr::include_graphics("https://pbfcomics.com/wp-content/uploads/2016/04/PBF130-Book_World.png")
```
<center> Source: [Perry Bible Fellowship](https://pbfcomics.com/comics/book-world/) </center>
<br>
<br>
## Know Thy Data: File Formats
> "If you know the [data] and know yourself,
> you need not fear the result of a hundred [analyses]."
> Sun Tzu, The Art of [Data Science]
<br>
If you want to get certain data into R, you must know the **file format**.
* File formats are standardized frameworks for encoding and storing information
* Some file formats are better than others, depending on the information stored
* Determine a file's format by its *filename extension*, e.g. `.csv`
* The most common data file formats include `.csv`, `.tsv`, and `.xlsx`
<br>
```{r, echo=F, fig.align="center", fig.cap='Certain formats are better for storing data.',out.width="75%"}
knitr::include_graphics("https://imgs.xkcd.com/comics/file_extensions.png")
```
<center> Source: [XKCD](https://xkcd.com/1301/) </center>
<br>
<br>
### Text or "Flat" Files
> **flatly** /ˈflatli/
> 1. Showing little interest or emotion.
> 2. In a firm and unequivocal manner; absolutely.
> 3. In a smooth and even way.
> *Oxford English Dictionary*
<br>
The far most familiar family of file formats includes **text data** or **flat files**.
* Flat files are simple, standard formats that use plain text
* Flat files are ubiquitous, non-proprietary, and accessible
* Converting a spreadsheet into a text file will "flatten it", removing all formatting
<br>
<div class="tip">
**PRO TIP:**
Flat files are ideal for sharing data with collaborators.
* Not everyone has spreadsheet software, but anyone can use flat files
* Flat files don't preserve formats, like borders and highlighted cells
* Flat files force users to store data tabularly "flatly"
</div>
<br>
**Comma-separated values** or **CSV** files store data in plain text:
* Each line in a CSV file represents one row
* Each value in a CSV file is separated by a comma
* Therefore, each comma designates a separate column
<br>
**A CSV in the Wild:** Observe the following [public construction records](https://github.com/jamisoncrawford/wealth/blob/master/Tidy%20Data/hancock_lakeview_tidy.csv) on GitHub:
```{r, echo=F, fig.align="center", fig.cap='What can you decode by observing this raw CSV file?',out.width="90%"}
knitr::include_graphics("figures/flat_file_image.jpg")
```
<br>
**A Wrangled CSV:** R (and other software) interpret and tame these formats tabularly.
```{r echo=F, warning=F, message=F}
url <- paste0("https://raw.githubusercontent.com/jamisoncrawford/",
"wealth/master/Tidy%20Data/hancock_lakeview_tidy.csv")
head(read.csv(url), 10)
```
<br>
<br>
**Tab-separated values** or **TSV** files store data in plain text:
* Like CSV, each line in a TSV file represents one row
* Also like CSV, each value in a CSV file is separated by a tab
* Each tab designates a separate column
<br>
**A TSV in the Wild:** Observe the same [public construction records](https://github.com/jamisoncrawford/wealth/blob/master/Tidy%20Data/my_tsv.tsv) on GitHub:
```{r, echo=F, fig.align="center", fig.cap='Raw TSV files are much more intepretable thanks to tabulation.',out.width="90%"}
knitr::include_graphics("figures/tsv_file_image.jpg")
```
<br>
<br>
### Excel File Formats
*Microsoft Excel* files are significantly different from text or flat files.
* All Excel file extensions begin with `.xls`; a workbook is in `.xlsx` format
* Excel workbooks with multiple sheets must be read in one sheet at a time
* Multiple packages can help R handle Excel files, e.g. `readxl`, `XLConnect`, `gdata`
<br>
<div class="caution">
**WARNING:**
Reading Excel files into R will remove all formatting, like:
* Highlighted and color-coded cells
* Font formatting, e.g. **bold** and *italics*
* Borders, merged cells, formulas, and macros
Some Excel users use formatting to represent information. For example, a user may use red, yellow, and green to represent a categorical variable like "low", "medium", and "high", respectively. In such cases, create a new column to store these data.
</div>
<br>
<br>
**Excel Files in the Wild:** It's common to see Excel files with heavy formatting:
```{r, echo=F, fig.align="center", fig.cap='If R were to import this file, as is, what information would survive?',out.width="90%"}
knitr::include_graphics("figures/excel_workbook.jpg")
```
<center> Source: [Special District Governments, CNY Vitals Legacy](https://legacy.cnyvitals.org/onondaga/civic-engagement/special-district-governments-2.html) </center>
<br>
<br>
### JSON File Format
The **JSON** format stands for "JavaScript Object Notation":
* JSON files use extension `.json`
* Like TSV files, JSON files are concise and well-structured
* Also like TSV, JSON files are generally human-readable
* JSON structure differs across files and are read into R differently
* Packages like `jsonlite` help R read in `.json` files
<br>
**JSON Files in the Wild:** Check out these [Game of Thrones subtitles](https://www.kaggle.com/gunnvant/game-of-thrones-srt#season1.json) in JSON format:
```{r, echo=F, fig.align="center", fig.cap='This JSON file has nested arrays, or sets of values within sets of values.',out.width="90%"}
knitr::include_graphics("figures/json_file.jpg")
```
<br>
<br>
### Statistical Software Formats
Each major statistical software has its own (sometimes proprietary) file format, e.g.:
* **SPSS** uses extensions `.sav` and `.por`
* **SAS** uses extensions `.sas7bdat` and `.sas7bcat`
* **STATA** uses extension `.dta`
<br>
Different packages exist for importing statistical software files, e.g.:
* Package `foreign`, created by R's core team, supports many formats
* Package `haven`, created by Hadley Wickham, is faster but for fewer formats
<br>
```{r, echo=F, fig.align="center", fig.cap='.NORM files do not exist.',out.width="75%"}
knitr::include_graphics("https://imgs.xkcd.com/comics/norm_normal_file_format.png")
```
<center> Source: [XKCD](https://xkcd.com/2116/) </center>
<br>
<br>
### RData Format
You may have noticed, when working in R, that you're asked to save your workspace.
When saving a workspace, `.rdata` or `.rda` files are stored in the working directory.
* You can manually save part or all of your workspace with function `save()`
* You can then load part or all of your workspace with function `load()`
* `.rdata` and `.rda` files store your session's objects, command history, etc.
* If creating an object is computationally expensive, consider saving it!
<br>
<br>
### Conclusions
There are a variety of file formats that you can read into R.
For virtually every file format, there are functions and packages to import them.
**You must know the file's format to decide on the best import function.**
<br>
<br>
## Key Functions for Importing Data
Once you've identified the extension type, you should consider how to read it into R.
* Is it a flat or text file?
* What character delimits each value, e.g. `,`, `;`?
* Can you read it in using base R or do you need a new package?
* What other arguments should you specify to ensure success?
We'll look at a few common functions for importing common formats.
<br>
<br>
### Workhorses & Wrapper Functions:
The *workhorse* of all base R reading functions is `read.table()`.
Thanks to its modifiability, function `read.table()` is remarkably flexible.
Not only does it import most file types, it's used by many [**wrapper functions**](https://stat.ethz.ch/pipermail/r-help/2008-March/158393.html).
<br>
<div class="quiz">
**QUESTION**
You can see whats under a function's hood by typing it without `()`.
Look at the internals of function `read.csv()`. What function is used?
</div>
```{r}
read.csv
```
<br>
A [**wrapper function**](https://stat.ethz.ch/pipermail/r-help/2008-March/158393.html) is a modified version of a more powerful and versatile function.
Each is optimized for a specific task, like `read.csv()` - a very common function.
**Wrapper functions** are like customized tools. Their toolboxes are **packages**.
<br>
<br>
## Base R: Reading CSV Files
Reading comma-separated values (CSV) files is done with base R's `read.csv()` function.
* Only requires one argument: `path =`
* Argument `path =` accepts the directory path or a web URL
```{r eval=FALSE}
read.csv(file = "~/dp4ss-textbook/my_data.csv")
read.csv(file = "http://www.ds4ps.com/textbook/my_data.csv")
```
<br>
We can read in [Syracuse, NY lead violations](http://data.syrgov.net/datasets/c15a39a8a00e48b1a60c826c8a2cb3e0_0) directly from the city's open data portal.
```{r message=F, warning=F, eval=F}
#' Put your URL in quotes; here, we name it "url"
#' Use function paste0() to join long strings of text
url <- paste0("https://opendata.arcgis.com/datasets/",
"c15a39a8a00e48b1a60c826c8a2cb3e0_0.csv")
lead_data <- read.csv(file = url, # Use object storing URL text
stringsAsFactors = FALSE) # Always include this argument!
```
<br>
Now, we have a locally-stored dataset:
```{r message=F, warning=F, eval=F}
lead_data[1:10, 1:4] # Bracks specify rows and columns to include
```
<br>
Consider the following notable arguments for `read.csv()`:
* **file =** takes a string of text, either a directory and file name or a URL
* **nrow =** take a number, setting the limit of rows to read in
* **stringsAsFactors =** converts text into categories (not cool!)
* **col.names =** takes an array made with `c()` to rename variables
* **colClasses =** also takes an array using `c()` to specify classes
<br>
<br>
<div class="tip">
**PRO TIP: READING FROM CLIPBOARDS**
Why not read the data you've copied to your clipboard?
In this approach, just change the first argument, **file =**, to "clipboard".
```{r eval=F}
read.csv(file = "clipboard")
```
</div>
<br>
<br>
## Base R: Reading TSV & Other Delimited Formats
There are many ways to delimit data in a flat (text) file.
For example, tab-delimited, semicolon-delimited, asterisk-delimited - you get it.
In these instances, you can use function `read.delim()` to specify the delimiter.
<br>
Here, we specify the delimiter as a comma, `,`:
```{r eval=F}
lead <- read.delim(file = url,
stringsAsFactors = FALSE,
sep = ',') # Specifies delimiter
```
<br>
You can specify your delimiter as any character you wish:
```{r eval=F}
read.delim(file = my_file, sep = ',')
read.delim(file = my_file, sep = ';')
read.delim(file = my_file, sep = '\t') # Tab-delimited
```
<br>
You can also read in any flat file, point-and-click, with RStudio's import wizard:
```{r, echo=F, fig.align="center", fig.cap='Open the import wizard in the Environment pane.',out.width="90%"}
knitr::include_graphics("figures/read_text_data.jpg")
```
<br>
In the import wizard, you can specify arguments with a user-friendly interface.
```{r, echo=F, fig.align="center", fig.cap='Click your way to victory!',out.width="90%"}
knitr::include_graphics("figures/import_text_ui.jpg")
```
<br>
<br>
## Package "readr" Functions
The "readr" package improves on several functions in base R's reading toolkit, e.g.:
* Built into RStudio
* All functions begin with `read_`
* All functions have more consistent argument names
* Automatically sets **stringsAsFactors =** to FALSE
* Expanded, specialized functions for specific data types
* Enhances your data frame by converting it into a "tibble"
<br>
Observe the output from the "tibble" data frame. What's different?
```{r message=F, warning=F, eval=F}
library(readr)
url <- paste0("https://opendata.arcgis.com/datasets/",
"c15a39a8a00e48b1a60c826c8a2cb3e0_0.csv")
read_csv(url)
```
<br>
<div class="tip">
**PRO TIP**
Consider getting used to **tibbles** (enhanced data frames from "readr").
* Tibbles print a lot more details about your data, e.g. variable classes
* They cooperate more easily with packages in the Rstudio ecosystem
* Also, they automatically truncate output to appear more organized
</div>
<br>
<div class="tip">
**ANOTHER PRO TIP**
You can change variable classes much more easily in "readr" functions.
**Short string representation** allows you to specify classes with one character:
* "c" for character
* "d" for double
* "i" for integer
* "l" for logical
* "_" to exclude variable
In a function, to specify 3 logical, 2 integer, and 4 character variables:
`read_csv(path = path, col_types = c("llliicccc"))`
</div>
<br>
<br>
## Importing Excel Files
There are many packages available for reading in Excel's `.xlsx` format.
Package "readxl" is part of the same package ecosystem as "readr".
* Allows new Excel-related functions, like `read_xlsx()`
* Lists all sheets in a workbook with `excel_sheets()`
* Specify certain sheets with argument **sheet =**
```{r eval=F}
library(readxl)
read_xlsx(path = url,
sheet = "FY 2018 Revenue", # Specify sheet
range = "B3:G78", # Specify range
trim_ws = TRUE, # Remove leading/trailing spaces
col_types = "dcic_cli_cc") # Specify variable classes
```
<br>
Of course, RStudio has macros set up if you'd prefer:
```{r, echo=F, fig.align="center", fig.cap="RStudio's wizard for importing from Excel.",out.width="900%"}
knitr::include_graphics("figures/excel_wizard.jpg")
```
<br>
<br>
```{r, echo=F, fig.align="center", fig.cap="Excel at times.",out.width="40%"}
knitr::include_graphics("https://i.giphy.com/media/Sb9KqeeymLlESGWZyE/giphy.webp")
```
<center> Source: [*Reddit*](https://www.reddit.com/r/gifs/comments/asanrg/nice_one_excel/) </center>
<br>
<br>
<div class="tip">
**LIFE HACK**
Using RStudio's import wizard may seem like such a noob move.
It is, but that doesn't mean you can't turn it into a clutch move.
<br>
Even when using wizards and macros, **code will still run in the R console:**
<br>
*Copy that code and throw it into your script. It's often faster than typing.*
</div>
<br>
<br>
## Importing Statistical Software Files
Two packages are commonly used to import files from SPSS, SAS, STATA and other software:
* Package "foreign" was developed by the R core team and highly versatile
* Package "haven" was developed by RStudio and only reads in SPSS, SAS, and STATA
<br>
Their differences are similar to those between base R and "readr" functions.
* `read.dta()` is equivalent to `read_dta()` in the packages, respectively
* `read.spss()` is equivalent to `read_spss()`, respectively
* `read.stata()` is equavalent to `read_stata()`, and so on
<br>
<br>
## Getting Data Off of the Web
There are a variety of methods to get data out of the web and into R.
Here are a few popular methods:
<br>
<br>
### Retrieve Download URLs
Many times, it's a matter of copying URLs from links that download datasets.
Here, we find the "Download" button on a [City Of Pheonix Open Data dataset](https://www.phoenixopendata.com/dataset/officer-involved-shooting-incidents):
<br>
```{r, echo=F, fig.align="center", fig.cap="Open data portals make it easy to download data straight into R.",out.width="90%"}
knitr::include_graphics("figures/data_from_portals.jpg")
```
<br>
```{r, echo=F, fig.align="center", fig.cap='Right-click "Download" to copy the URL.',out.width="90%"}
knitr::include_graphics("figures/data_from_portals2.jpg")
```
<br>
You can save the URL in an object or use the whole string for the **path =** argument.
<br>
<br>
### Retrieve Raw Data URLs
GitHub is riddled with datasets.
You can often find a "Download" button for them.
<br>
```{r, echo=F, fig.align="center", fig.cap="The head of a table in GitHub's data viewer.",out.width="90%"}
knitr::include_graphics("figures/github_data.jpg")
```
<br>
In the event that you cannot find a "Download" button, click on "Raw".
You can use the URL provided to read the data into R.
<br>
```{r, echo=F, fig.align="center", fig.cap='The raw CSV file, with the URL for reading into R.',out.width="90%"}
knitr::include_graphics("figures/github_raw.jpg")
```
<br>
<br>
## Further Resources
The following resources should help you further your knowledge on reading in data.
* [Official R Documentation for Function `read.table()`](https://www.rdocumentation.org/packages/utils/versions/3.6.1/topics/read.table)
* [Introduction to Package "readr"](https://cran.r-project.org/web/packages/readr/vignettes/readr.html)
* ["Reading Text Data" (Crawford, 2018)](http://rpubs.com/JamisonCrawford/readingtextdata)
* ["How to Share Data for Collaboration" (Ellis & Leek, 2017)](https://peerj.com/preprints/3139v5.pdf)
* [RStudio's "Data Import Cheat Sheet"](https://resources.rstudio.com/rstudio-cheatsheets/data-import-cheat-sheet)
<br>
<br>