-
Notifications
You must be signed in to change notification settings - Fork 0
/
04-data-io.Rmd
158 lines (97 loc) · 8.81 KB
/
04-data-io.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
---
---
---
# Data Import and Export {#data-io}
## Best Practice Summary
### Reading
When reading data the recommended approach is:
1. Read data as it is without making any changes
2. Validate the column names and types
3. Possibly also validate values
4. Apply transformations to make it tidy
Validating the data when reading it can save lots of problems later. If the data changes it should result in clear error messages so the user knows where the problem is.
Tidying the data after reading it makes it much easier to work with.
For a really robust process, you can either:
- Create a function per input containing all of the above steps
- Specify the structure of the inputs as a configuration file and create a general function that maps over all inputs reading and validating them.
## CSV
Perhaps the most convenient way to transfer tabular data between applications is the CSV (Comma-Separated Values). CSVs are just plain text files with a line per row and each column separated by commas. Reading and writing CSVs is straight forward with {readr}.
### Reading
To read a CSV use `read_csv()`, supplying the path to the file:
```{R}
library(readr)
chickens_csv <- readr_example("chickens.csv")
chickens <- read_csv(chickens_csv)
```
Notice that `read_csv()` automatically detects the type of data in each column and lets you know how it has interpreted it. It is good practice when building a process to specify the columns and types you are expecting explicitly. This way you are alerted when new data does not conform to your specification. Doing so with `read_csv()` is straight forward with the `col_types` argument. You can even use the `spec()` function to extract the types readr guessed and modify them as required:
```{R}
spec(chickens)
```
```{R}
chickens <- read_csv(
file = chickens_csv,
na = "",
col_types = cols(
chicken = col_character(),
sex = col_character(),
eggs_laid = col_integer(),
motto = col_character()
)
)
```
If you want to override the column names, you can with the `col_names` argument, but that is not recommended as it can mask unexpected data changes. It is generally better to read it in as it is, validate the columns, then rename them afterwards.
One thing to be careful of with when reading CSVs is that `read_csv()` interprets `"NA"` as a missing value. This particularly bad for Lloyd's as we have a "NA" Risk Code. To avoid this set the `na` argument to `""`, so only empty values are interpreted as missing.
One other issue you may encounter when reading tables are headers that are difficult to work with. `read_csv()` preserves the column headers as they are, but if there are spaces of special characters they have to be surrounded by ``` `` ```. Unless the preservation of names is absolutely necessary it is a good idea to replace them with tidy names. You can use the `rename()` function from {dplyr} to rename them directly, or the {janitor} package has a useful function called `clean_names()` which converts the column names into snake_case.
### Writing
Despite the name, the {readr} package also contains a function for writing CSVs. As you may expect, it is called `write_csv()`:
```{R}
write_csv(chickens, "chickens.csv", na = "")
```
As with `read_csv()` you have to be careful with missing values (`NA`). The default behaviour in `write_csv()` is to write `NA`s as "NA" in a CSV. Since we have a Risk Code called "NA" this could have unforeseen issues. It is better to specify `write_csv()` outputs `NA`s as empty strings (`""`).
## Excel
Interacting with Excel from R is not always straight forward. Excel is not the best medium for data as it combines it with formatting and presentation. As a result it can be fiddly to pick out the data you are interested in. In addition, due to the interactive nature of Excel it is unlikely data is validated and users can modify the workbook without realising it will break the downstream R processes. It is difficult to build a robust process that depends on Excel.
The following high-level rules should help working with R and Excel:
1. **Avoid using Excel**. If another data format can be used, e.g. CSV or a database, then use it. It is better to have a more robust medium and import/export into R and Excel from the simpler format. Using the data import functionality in Excel makes this very simple.
2. **Keep it simple**. Ideally you should only import/export tidy flat tables, preferably with a single table per sheet. If you must have multiple tables per worksheet then lay them out side-by-side. Avoid multiple, nested header rows. While it may look nice for presentation, it is much harder to work with. It is better to use flat tables in "input" worksheets then use pivot tables or similar to present the data in another worksheet or possibly another workbook.
3. **Validate data**. Because the structure of Excel workbooks is not locked down, the most common issue when using R is changes to the layout, which breaks the process. While validation of inputs is recommended in all processes it is even more important here. In particular, you should specify the column names and types expected, as well as allowed values. Giving a good error message if the validation fails will save lots of time debugging later on.
### Packages
#### {readxl}
The best package for reading data from Excel is {readxl}. It is part of the Tidyverse and so works in a very similar way to {readr}. The {readxl} package is intentionally simple and most of the time you will only need to use two functions:
- `excel_sheets()`: List all sheets in an excel spreadsheet.
- [`read_xlsx()`](<https://readxl.tidyverse.org/reference/read_excel.html>): Read tabular data from an xlsx file.
`read_xlsx()` has lots of options for reading the data, similar to `read_csv()`, we will summarise some of them here, but for the full list check out the documentation.
##### Table location
In order to locate the required data within the spreadsheet you have a number of options:
- `sheet`: The worksheet name. The function will extract a table based on the sheet contents. If there is only one table this will work fine, but if you have multiple tables you will need to specify a `range`.
- `skip`: Skip the specified number of rows before reading the table.
- `range`: Specify a range to read from. This can be in the format `"B3:D87"` or `"R1C3:R27C12"`. Alternately, several helper functions can be used for more flexible ranges:
- `cell_rows()`: Specify rows only, columns are determined by non-empty cells.
- `cell_cols()`: Specify columns only, rows are determined by non-empty cells.
- `cell_limits()`: Specify the upper-left and lower-right cell. Unspecified limits will be determined by non-empty cells.
##### Column Names and Types
`read_xlsx()` allows you to define the column names and types, with a few arguments:
- `col_names`: If unspecified the first row will used as column names. Otherwise, it can be used to override the column names in the spreadsheet. As with CSVs this is generally not recommended, as it may mask issues with the data, such as columns being moved.
- `col_types`: This can be used in the same way as CSVs to set the expected column types. If the data does not conform to these types a warning is raised.
##### Examples
See the [examples](https://readxl.tidyverse.org/reference/read_excel.html#ref-examples) page for the `read_xlsx()` function.
### {writexl}
Like {readxl}, {writexl} is intentionally simple. It only writes data to a new Excel workbook and the table is positioned on the top-left of the sheet.
```{R}
library(writexl)
write_xlsx(chickens, "~/chickens.xlsx")
```
If you want to write multiple table, you can used a named list, where the name is the sheet name. For example:
```{R}
write_xlsx(
x = list(iris = iris, chickens = chickens),
path = "~/chickens.xlsx"
)
```
### {openxlsx}
While most use-cases can be covered by the above packages, there are some that are not. For example, if you need to
1. Create a spreadsheet that has multiple tables per sheet, or multiple headers per table.
2. Define the formatting for the output.
3. Fill out an existing Excel template
In the above examples {writexl} will not be suitable. In these examples {openxlsx} is the package to use.
Before using {openxlsx}, you should first review alternatives. For example, it might be better to export data as a separate Excel workbook using {writexl} then use lookups in the template. Using {openxlsx} is very powerful, but it can get complicated and difficult to maintain if the spreadsheet changes regularly.
{openxlsx} can read, write and update Excel files, it can also get and set formatting and styles. It can be used to build complicated workbooks from scratch or update templates. Describing how it works is beyond the scope of this document, but if you would like to use it, it has a very good [introduction](https://ycphs.github.io/openxlsx/articles/Introduction.html).