-
Notifications
You must be signed in to change notification settings - Fork 0
/
08-tidy.Rmd
232 lines (156 loc) · 10.9 KB
/
08-tidy.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
# Tidy Data
## Readings
1. [R4DS Chapter 12](http://r4ds.had.co.nz/tidy-data.html)
1. [Wickham, Hadley, 2014. Tidy Data, Journal of Statistical Software Vol 59 (2014), Issue 10, 10.18637/jss.v059.i10](http://www.jstatsoft.org/article/view/v059i10)
## Tidy Data
There are three interrelated rules which make a dataset tidy:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
<center><img src="http://r4ds.had.co.nz/images/tidy-1.png"></center>
That interrelationship leads to an even simpler set of practical instructions:
- Put each dataset in a tibble.
- Put each variable in a column.
In this session, you will learn a consistent way to organise your data in R, an organisation called __tidy data__. Getting your data into this format requires some upfront work, but that work pays off in the long term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand. This session will give you a practical introduction to tidy data and the accompanying tools in the __tidyr__ package.
## Lesson
`table1` - `table4` (loaded with the `tidyverse` package) below shows the same data organised in four different ways. Each dataset shows the same values of four variables *country*, *year*, *population*, and *cases*, but each dataset organises the values in a different way:
```{r}
library(tidyverse)
table1
table2
table3
# Spread across two tibbles
table4a # cases
table4b # population
```
These are all representations of the same underlying data, but they are not equally easy to use. One dataset, the tidy dataset, will be much easier to work with inside the tidyverse.
### Spreading and gathering
The principles of tidy data seem so obvious that you might wonder if you'll ever encounter a dataset that isn't tidy. Unfortunately, however, most data that you will encounter will be untidy. There are two main reasons:
1. Most people aren't familiar with the principles of tidy data, and it's hard to derive them yourself unless you spend a _lot_ of time working with data.
1. Data is often organised to facilitate some use other than analysis. For example, data is often organised to make entry as easy as possible.
This means for most real analyses, you'll need to do some tidying. The first step is always to figure out what the variables and observations are. Sometimes this is easy; other times you'll need to consult with the people who originally generated the data.
The second step is to resolve one of two common problems:
1. One variable might be spread across multiple columns.
1. One observation might be scattered across multiple rows.
Typically a dataset will only suffer from one of these problems; it'll only suffer from both if you're really unlucky! To fix these problems, you'll need the two most important functions in tidyr: `gather()` and `spread()`.
#### Gathering
A common problem is a dataset where some of the column names are not names of variables, but _values_ of a variable. Take `table4a`: the column names `1999` and `2000` represent values of the `year` variable, and each row represents two observations, not one.
```{r}
table4a
```
To tidy a dataset like this, we need to __gather__ those columns into a new pair of variables. To describe that operation we need three parameters:
* The set of columns that represent values, not variables. In this example,
those are the columns `1999` and `2000`.
* The name of the variable whose values form the column names. I call that
the `key`, and here it is `year`.
* The name of the variable whose values are spread over the cells. I call
that `value`, and here it's the number of `cases`.
Together those parameters generate the call to `gather()`:
```{r}
table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")
```
The columns to gather are specified with `dplyr::select()` style notation. Here there are only two columns, so we list them individually. Note that "1999" and "2000" are non-syntactic names so we have to surround them in backticks. To refresh your memory of the other ways to select columns, see [select](#select).
```{r tidy-gather, echo = FALSE, out.width = "100%", fig.cap = "Gathering `table4` into a tidy form."}
knitr::include_graphics("fig/tidy-9.png")
```
In the final result, the gathered columns are dropped, and we get new `key` and `value` columns. Otherwise, the relationships between the original variables are preserved. Visually, this is shown in Figure \@ref(fig:tidy-gather). We can use `gather()` to tidy `table4b` in a similar fashion. The only difference is the variable stored in the cell values:
```{r}
table4b %>%
gather(`1999`, `2000`, key = "year", value = "population")
```
To combine the tidied versions of `table4a` and `table4b` into a single tibble, we need to use `dplyr::left_join()`, which you'll learn about in [relational data].
```{r}
tidy4a <- table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")
tidy4b <- table4b %>%
gather(`1999`, `2000`, key = "year", value = "population")
left_join(tidy4a, tidy4b)
```
#### Spreading
Spreading is the opposite of gathering. You use it when an observation is scattered across multiple rows. For example, take `table2`: an observation is a country in a year, but each observation is spread across two rows.
```{r}
table2
```
To tidy this up, we first analyse the representation in similar way to `gather()`. This time, however, we only need two parameters:
* The column that contains variable names, the `key` column. Here, it's
`type`.
* The column that contains values forms multiple variables, the `value`
column. Here it's `count`.
Once we've figured that out, we can use `spread()`, as shown programmatically below, and visually in Figure \@ref(fig:tidy-spread).
```{r}
spread(table2, key = type, value = count)
```
```{r tidy-spread, echo = FALSE, out.width = "100%", fig.cap = "Spreading `table2` makes it tidy"}
knitr::include_graphics("fig/tidy-8.png")
```
As you might have guessed from the common `key` and `value` arguments, `spread()` and `gather()` are complements. `gather()` makes wide tables narrower and longer; `spread()` makes long tables shorter and wider.
### Separating and uniting
So far you've learned how to tidy `table2` and `table4`, but not `table3`. `table3` has a different problem: we have one column (`rate`) that contains two variables (`cases` and `population`). To fix this problem, we'll need the `separate()` function. You'll also learn about the complement of `separate()`: `unite()`, which you use if a single variable is spread across multiple columns.
#### Separate
`separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears. Take `table3`:
```{r}
table3
```
The `rate` column contains both `cases` and `population` variables, and we need to split it into two variables. `separate()` takes the name of the column to separate, and the names of the columns to separate into, as shown in Figure \@ref(fig:tidy-separate) and the code below.
```{r}
table3 %>%
separate(rate, into = c("cases", "population"))
```
```{r tidy-separate, echo = FALSE, out.width = "75%", fig.cap = "Separating `table3` makes it tidy"}
knitr::include_graphics("fig/tidy-17.png")
```
By default, `separate()` will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter). For example, in the code above, `separate()` split the values of `rate` at the forward slash characters. If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`. For example, we could rewrite the code above as:
```{r eval = FALSE}
table3 %>%
separate(rate, into = c("cases", "population"), sep = "/")
```
(Formally, `sep` is a regular expression, which you'll learn more about in [strings].)
Look carefully at the column types: you'll notice that `case` and `population` are character columns. This is the default behaviour in `separate()`: it leaves the type of the column as is. Here, however, it's not very useful as those really are numbers. We can ask `separate()` to try and convert to better types using `convert = TRUE`:
```{r}
table3 %>%
separate(rate, into = c("cases", "population"), convert = TRUE)
```
You can also pass a vector of integers to `sep`. `separate()` will interpret the integers as positions to split at. Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings. When using integers to separate strings, the length of `sep` should be one less than the number of names in `into`.
You can use this arrangement to separate the last two digits of each year. This make this data less tidy, but is useful in other cases, as you'll see in a little bit.
```{r}
table3 %>%
separate(year, into = c("century", "year"), sep = 2)
```
#### Unite
`unite()` is the inverse of `separate()`: it combines multiple columns into a single column. You'll need it much less frequently than `separate()`, but it's still a useful tool to have in your back pocket.
```{r tidy-unite, echo = FALSE, out.width = "75%", fig.cap = "Uniting `table5` makes it tidy"}
knitr::include_graphics("fig/tidy-18.png")
```
We can use `unite()` to rejoin the *century* and *year* columns that we created in the last example. That data is saved as `tidyr::table5`. `unite()` takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in `dplyr::select()` style:
```{r}
table5 %>%
unite(new, century, year)
```
In this case we also need to use the `sep` argument. The default will place an underscore (`_`) between the values from different columns. Here we don't want any separator so we use `""`:
```{r}
table5 %>%
unite(new, century, year, sep = "")
```
## Exercise
1. Are the bike counts data for the two bridges tidy data?
1. If not, why not? And how can we tidy it?
1. After tidying the bike counts, using functions in the `tidyr` package, create tables summarizing the average bike counts by bridge and day of week in two different formats:
```{r, echo=FALSE}
library(tidyverse)
options(knitr.kable.NA = '')
counts_dow <- tribble(
~Bridge, ~Sun, ~Mon, ~Tue, ~Wed, ~Thur, ~Fri, ~Sat,
"Hawthorne", NA, NA, NA, NA, NA, NA, NA,
"Tilikum", NA, NA, NA, NA, NA, NA, NA
)
counts_dow %>% knitr::kable(caption="Bike Counts by Day of Week and Bridge (1st Format)")
counts_dow %>%
tidyr::gather(key="Day of Week", value="counts", Sun:Sat) %>%
tidyr::spread(Bridge, counts) %>%
knitr::kable(caption="Bike Counts by Day of Week and Bridge (2nd Format)")
```
## Learning more
1. [Dataframe Manipulation with tidyr](http://swcarpentry.github.io/r-novice-gapminder/14-tidyr/)
1. [Data Wrangling Cheat sheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)
1. [Introduction to tidyr](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html)