-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path02_isolating-data.Rmd
213 lines (140 loc) · 8.42 KB
/
02_isolating-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
---
title: "Isolating Data"
output: github_document
---
[<<< Previous](data-wrangling.md) | [Next >>>](03_piping.md)
-----
```{r, results='hide', echo = FALSE, warning=FALSE, message=FALSE}
library(tidyverse)
```
```{r, results='hide', echo = FALSE, warning=FALSE, message=FALSE}
spotify <- read_csv("data/spotify.csv")
```
This type of task occurs often in Data Science: you need to extract data from a table before you can use it. You can do this task quickly with three functions that come in the dplyr package:
1. **select()** - which extracts columns from a data frame
1. **filter()** - which extracts rows from a data frame
Each function takes a data frame or tibble as it's first argument and returns a new data frame or tibble as its output.
## select()
[![Source: dplyr cheatsheet][1]][2]
[1]: images/select.png
[2]: https://rstudio.com/resources/cheatsheets/
`select()` extracts columns of a data frame and returns the columns as a new data frame. To use `select()`, pass it the name of a data frame to extract columns from, and then the names of the columns to extract. The column names do not need to appear in quotation marks or be prefixed with a `$`; `select()` knows to find them in the data frame that you supply.
**Exercise 1**
Use the example below to get a feel for `select()`. Can you extract just the `genre` column? How about the `genre` and `energy` columns?
```{r}
select(spotify, loudness)
```
-----
### select() helpers
You can also use a series of helpers with `select()`. For example, if you place a minus sign before a column name, `select()` will return every column but that column. The code below will return every column except lyrics.
```{r, eval=FALSE}
select(spotify, -lyrics)
```
The table below summarizes the other `select()` helpers that are available in dplyr. This is here for your reference and to demonstrate the flexibility of `select()`
Helper Function | Use | Example
----------------- | ------------------------------------------------- | -------
**-** | Columns except | `select(spotify, -loudness)`
**:** | Columns between (inclusive) | `select(spotify, loudness:danceability)`
**contains()** | Columns that contains a string | `select(spotify, contains("d"))`
**ends_with()** | Columns that ends with a string | `select(spotify, ends_with("y"))`
**matches()** | Columns that matches a regex | `select(spotify, matches("n"))`
**num_range()** | Columns with a numerical suffix in the range | Not applicable with `spotify`
**one_of()** | Columns whose name appear in the given set | `select(spotify, one_of(c("energy", "tempo")))`
**starts_with()** | Columns that starts with a string | `select(spotify, starts_with("e"))`
**Exercise 2**
Use the ```contains()``` helper function to select the columns that contain the letter "n". The structure should look like this: ```select(spotify, contains("d"))```. What variables are returned?
-----
### Go further
The `tidyr` package allows you to reshape your entire data frame, rather than perform simple operations with columns. It's super helpful if you need to, for instance, convert your *wide* data frame to a *long* (i.e. *tidy*) data frame, or convert your nested JSON data to a tidy data frame. For a short introduction, [check this out](https://tidyr.tidyverse.org/)
----
## filter()
[![Source: dplyr cheatsheet][3]][2]
[3]: images/filter.png
`filter()` extracts rows from a data frame and returns them as a new data frame. As with `select()`, the first argument of `filter()` should be a data frame to extract rows from. The arguments that follow should be logical tests; `filter()` will return every row for which the tests return `TRUE`.
### Filter in action
For example, the code chunk below returns every row with the genre "Alternative" in `spotify`.
```{r echo = TRUE}
filter(spotify, genre == "Alternative")
```
### Logical tests
To get the most from filter, you will need to know how to use R's logical test operators, which are summarised below.
Logical operator | tests | Example
---------------- | ------------------------------------ | ---------
**>** | Is `x` greater than `y`? | `x > y`
**>=** | Is `x` greater than or equal to `y`? | `x >= y`
**<** | Is `x` less than `y`? | `x < y`
**<=** | Is `x` less than or equal to `y`? | `x <= y`
**==** | Is `x` equal to `y`? | `x == y`
**!=** | Is `x` not equal to `y`? | `x != y`
**is.na()** | Is `x` an `NA`? | `is.na(x)`
**!is.na()** | Is `x` not an `NA`? | `!is.na(x)`
-----
**Exercise 3**
See if you can use the logical operators to manipulate our code below to show:
* All of the songs in the "Rap" genre
* All of the songs where tempo is less than or equal to 80
* *Stretch challenge*. All songs that have more than 40 characters in their lyrics. Hint: you will need to use the ```nchar()``` function.
```{r, eval=FALSE}
filter(spotify, genre == "Ska")
```
-----
### Two common mistakes
When you use logical tests, be sure to look out for two common mistakes. One appears in each code chunk below. Can you find them? When you spot a mistake, fix it and then run the chunk to confirm that it works.
```{r, eval = FALSE}
filter(spotify, genre = "Ska")
```
```{r, eval = FALSE}
filter(spotify, genre == Ska)
```
**Answer**
```{r}
filter(spotify, genre == "Ska")
```
### Two mistakes - Recap
When you use logical tests, be sure to look out for these two common mistakes:
1. using `=` instead of `==` to test for equality.
2. forgetting to use quotation marks when comparing strings, e.g. `genre == Ska`, instead of `genre == "Ska"`
### Boolean operators
You can find a complete list of base R's boolean operators in the table below. This is here for your reference.
Boolean operator | represents | Example
---------------- | ------------------------------------------ | ---------
**&** | Are _both_ `A` and `B` true? | `A & B`
`|` | Are _one or both_ of `A` and `B` true? | `A | B`
**!** | Is `A` _not_ true? | `!A`
**xor()** | Is _one and only one_ of `A` and `B` true? | `xor(A, B)`
**%in%** | Is `x` in the set of `a`, `b`, and `c`? | `x %in% c(a, b, c)`
**any()** | Are _any_ of `A`, `B`, or `C` true? | `any(A, B, C)`
**all()** | Are _all_ of `A`, `B`, or `C` true? | `all(A, B, C)`
These allow you to customize filtered searches according to a variety of criteria. We're not going to go into depth about all of these here. The basic Boolean operators we will focus on today are **&** and `|` (or).
If you provide more than one test to `filter()`, `filter()` will combine the tests with an **and** statement (`&`): it will only return the rows that satisfy all of the tests.
For instance,
```{r, eval=FALSE}
filter(spotify, genre == "Alternative", tempo > 80)
```
will only return songs that are classified as "Alternative" **and** are faster than 80 beats per minute. Run this code and see how many observations are returned.
Now, try this.
```{r, eval=FALSE}
filter(spotify, genre == "Alternative" | tempo > 80)
```
Many more observations are returned because the function is filtering for songs that are classified as "Alternative" **or** have a tempo greater than 80 beats per minute. A song only has to satisfy one of the criteria to pass the filter.
### Go further
The `stringr` package is very helpful for manipulating strings in R. It's part of the core tidyverse set of packages, so it is easy to integrate into your data wrangling pipeline. It takes some of the pain away from forming your own `regex` expressions, but is flexible enough to accomodate complex cases. I use `stringr` regularly within `filter()` and other data manipulation functions. If you're interested, check out [this light introduction](https://stringr.tidyverse.org/).
-----
## Answers
**Exercise 1**
```{r, eval=FALSE}
select(spotify, genre)
select(spotify, genre, energy)
```
**Exercise 2**
```{r}
spotify %>% select(contains("n"))
```
**Exercise 3**
```{r, eval=FALSE}
filter(spotify, genre == "Rap")
filter(spotify, tempo <= 80)
filter(spotify, nchar(lyrics) > 40)
```
-----
[<<< Previous](data-wrangling.md) | [Next >>>](03_piping.md)