-
Notifications
You must be signed in to change notification settings - Fork 63
/
Copy pathintro.Rmd
388 lines (291 loc) · 11.7 KB
/
intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
---
title: "Intro to dplyr and ggplot2"
---
*Note: this tutorial originally comes from the wonderful people at [`software carpentry`](https://software-carpentry.org/lessons/).*
First, we need to make sure that we have the necessary packages installed.
```{r,eval=FALSE}
install.packages('tidyr') # Includes a lot of pacakges, including dplyr and ggplot2
install.packages('gapminder') # Has some fun data to play with.
```
Once they are, load the packages.
```{r}
library(gapminder)
library(tidyr)
library(dplyr)
library(ggplot2)
```
I always like to use the `str` (which stands for structure) command to look what's in a data frame:
```{r}
str(gapminder)
```
And some summary stats:
```{r}
summary(gapminder)
```
Manipulation of dataframes means many things to many researchers, we often
select certain observations (rows) or variables (columns), we often group the
data by a certain variable(s), or we even calculate summary statistics. We can
do these operations using the normal base R operations:
```{r}
mean(gapminder[gapminder$continent == "Africa", ]$gdpPercap)
mean(gapminder[gapminder$continent == "Americas", ]$gdpPercap)
mean(gapminder[gapminder$continent == "Asia", ]$gdpPercap)
```
But this isn't very *nice* because there is a fair bit of repetition. Repeating
yourself will cost you time, both now and later, and potentially introduce some
nasty bugs.
## The `dplyr` package
Luckily, the [`dplyr`](https://cran.r-project.org/web/packages/dplyr/dplyr.pdf)
package provides a number of very useful functions for manipulating dataframes
in a way that will reduce the above repetition, reduce the probability of making
errors, and probably even save you some typing. As an added bonus, you might
even find the `dplyr` grammar easier to read.
Here we're going to cover 6 of the most commonly used functions as well as using
pipes (`%>%`) to combine them.
1. `select()`
2. `filter()`
3. `group_by()`
4. `summarize()`
5. `mutate()`
# Using select()
If, for example, we wanted to move forward with only a few of the variables in
our dataframe we could use the `select()` function. This will keep only the
variables you select.
```{r}
select(gapminder,year,country,gdpPercap)
```
If we open up `year_country_gdp` we'll see that it only contains the year,
country and gdpPercap. Above we used 'normal' grammar, but the strengths of
`dplyr` lie in combining several functions using pipes. Since the pipes grammar
is unlike anything we've seen in R before, let's repeat what we've done above
using pipes.
```{r}
gapminder %>% select(year,country,gdpPercap)
```
To help you understand why we wrote that in that way, let's walk through it step
by step. First we summon the gapminder dataframe and pass it on, using the pipe
symbol `%>%`, to the next step, which is the `select()` function. In this case
we don't specify which data object we use in the `select()` function since in
gets that from the previous pipe. **Fun Fact**: There is a good chance you have
encountered pipes before in the shell. In R, a pipe symbol is `%>%` while in the
shell it is `|` but the concept is the same!
## Using filter()
If we now wanted to move forward with the above, but only with European
countries, we can combine `select` and `filter`
```{r}
gapminder %>%
filter(continent =="Asia") %>%
select(year,country,gdpPercap)
```
## Challenge 1
Write a command that
will produce a dataframe that has the African values for `lifeExp`, `country`
and `year`, but not for other Continents. How many rows does your dataframe
have?
## Solution to Challenge 1
```{r}
#TODO: Your code here
```
## Using group_by() and summarize()
Now, we were supposed to be reducing the error prone repetitiveness of what can
be done with base R, but up to now we haven't done that since we would have to
repeat the above for each continent. Instead of `filter()`, which will only pass
observations that meet your criteria (in the above: `continent=="Europe"`), we
can use `group_by()`, which will essentially use every unique criteria that you
could have used in filter.
```{r}
str(gapminder)
str(gapminder %>% group_by(continent))
```
You will notice that the structure of the dataframe where we used `group_by()`
(`grouped_df`) is not the same as the original `gapminder` (`data.frame`). A
`grouped_df` can be thought of as a `list` where each item in the `list` is a
`data.frame` which contains only the rows that correspond to the a particular
value `continent` (at least in the example above).
## Using summarize()
The above was a bit on the uneventful side because `group_by()` much more
exciting in conjunction with `summarize()`. This will allow use to create new
variable(s) by using functions that repeat for each of the continent-specific
data frames. That is to say, using the `group_by()` function, we split our
original dataframe into multiple pieces, then we can run functions
(e.g. `mean()` or `sd()`) within `summarize()`.
```{r}
gapminder %>%
group_by(continent) %>%
summarize(mean_gdpPercap=mean(gdpPercap))
```
That allowed us to calculate the mean gdpPercap for each continent, but it gets
even better.
## Challenge 2
Calculate the average life expectancy per country. Which has the longest average life
expectancy and which has the shortest average life expectancy?
## Solution to Challenge 2
```{r}
#TODO: Your code here
```
The function `group_by()` allows us to group by multiple variables. Let's group by `year` and `continent`.
```{r}
gapminder %>%
group_by(continent,year) %>%
summarize(mean_gdpPercap=mean(gdpPercap))
```
That is already quite powerful, but it gets even better! You're not limited to defining 1 new variable in `summarize()`.
```{r}
gapminder %>%
group_by(continent,year) %>%
summarize(mean_gdpPercap=mean(gdpPercap),
sd_gdpPercap=sd(gdpPercap),
mean_pop=mean(pop),
sd_pop=sd(pop))
```
## count() and n()
A very common operation is to count the number of observations for each
group. The `dplyr` package comes with two related functions that help with this.
For instance, if we wanted to check the number of countries included in the
dataset for the year 2002, we can use the `count()` function. It takes the name
of one or more columns that contain the groups we are interested in, and we can
optionally sort the results in descending order by adding `sort=TRUE`:
```{r}
gapminder %>%
filter(year == 2002) %>%
count(continent, sort = TRUE)
```
If we need to use the number of observations in calculations, the `n()` function
is useful. For instance, if we wanted to get the standard error of the life
expectency per continent:
```{r}
gapminder %>%
group_by(continent) %>%
summarize(se_pop = sd(lifeExp)/sqrt(n()))
```
## Using mutate()
We can also create new variables prior to (or even after) summarizing information using `mutate()`.
```{r}
gapminder %>%
mutate(gdp_billion=gdpPercap*pop/10^9) %>%
group_by(continent,year) %>%
summarize(mean_gdpPercap=mean(gdpPercap),
sd_gdpPercap=sd(gdpPercap),
mean_pop=mean(pop),
sd_pop=sd(pop),
mean_gdp_billion=mean(gdp_billion),
sd_gdp_billion=sd(gdp_billion))
```
## Connect mutate with logical filtering: ifelse
When creating new variables, we can hook this with a logical condition. A simple combination of
`mutate()` and `ifelse()` facilitates filtering right where it is needed: in the moment of creating something new.
This easy-to-read statement is a fast and powerful way of discarding certain data (even though the overall dimension
of the data frame will not change) or for updating values depending on this given condition.
```{r}
## keeping all data but "filtering" after a certain condition
# calculate GDP only for people with a life expectation above 25
gapminder %>%
mutate(gdp_billion = ifelse(lifeExp > 25, gdpPercap * pop / 10^9, NA)) %>%
group_by(continent, year) %>%
summarize(mean_gdpPercap = mean(gdpPercap),
sd_gdpPercap = sd(gdpPercap),
mean_pop = mean(pop),
sd_pop = sd(pop),
mean_gdp_billion = mean(gdp_billion),
sd_gdp_billion = sd(gdp_billion))
## updating only if certain condition is fullfilled
# for life expectations above 40 years, the gpd to be expected in the future is scaled
gapminder %>%
mutate(gdp_futureExpectation = ifelse(lifeExp > 40, gdpPercap * 1.5, gdpPercap)) %>%
group_by(continent, year) %>%
summarize(mean_gdpPercap = mean(gdpPercap),
mean_gdpPercap_expected = mean(gdp_futureExpectation))
```
## Combining `dplyr` and `ggplot2`
`ggplot2` is a powerful graphics package for producing beautiful, comprehensive, plots and graphs. Once you understand how it works,
it's actually very simple to use.
Let's use `filter` to just give us one country's data, and then pass that along to `ggplot2` to make a line plot.
```{r}
gapminder %>%
filter(country == "Canada") %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line()
```
We can even plot more than one country.
```{r}
gapminder %>%
filter(country %in% c("Canada", "Germany")) %>%
ggplot(aes(x = year, y = lifeExp, color=country)) +
geom_line(size=2)
```
We can plot them separately, in what `ggplot2` calls "facets."
```{r}
gapminder %>%
filter(country %in% c("Canada", "Germany")) %>%
ggplot(aes(x = year, y = lifeExp, color=continent)) +
geom_line() + facet_wrap ( ~ country)
```
```{r}
gapminder %>%
# Get the start letter of each country
mutate(startsWith = substr(country, start = 1, stop = 1)) %>%
# Filter countries that start with "A" or "Z"
filter(startsWith %in% c("A", "Z")) %>%
# Make the plot
ggplot(aes(x = year, y = lifeExp, color = continent)) +
geom_line() +
facet_wrap( ~ country)
```
Using `dplyr` functions also helps us simplify things, for example we could
combine the first two steps:
```{r}
gapminder %>%
# Filter countries that start with "A" or "Z"
filter(substr(country, start = 1, stop = 1) %in% c("A", "Z")) %>%
# Make the plot
ggplot(aes(x = year, y = lifeExp, color = continent)) +
geom_line() +
facet_wrap( ~ country)
```
## Challenge 3
Calculate the average life expectancy in 2002 of 2 randomly selected countries
for each continent. Then arrange the continent names in reverse order.
**Hint:** Use the `dplyr` functions `arrange()` and `sample_n()`, they have
similar syntax to other dplyr functions.
## Solutions to challenges
#### Challenge 1
```{r, echo=FALSE}
gapminder %>%
filter(continent=="Africa") %>%
select(year,country,lifeExp)
# As with last time, first we pass the gapminder dataframe to the `filter()`
# function, then we pass the filtered version of the gapminder dataframe to the
# `select()` function. **Note:** The order of operations is very important in this
# case. If we used 'select' first, filter would not be able to find the variable
# continent since we would have removed it in the previous step.
```
#### Challenge 2
```{r, echo=FALSE}
lifeExp_bycountry <- gapminder %>%
group_by(country) %>%
summarize(mean_lifeExp=mean(lifeExp))
lifeExp_bycountry %>%
filter(mean_lifeExp == min(mean_lifeExp) | mean_lifeExp == max(mean_lifeExp))
```
```{r}
# Another way to do this is to use the `dplyr` function `arrange()`, which
# arranges the rows in a data frame according to the order of one or more
# variables from the data frame. It has similar syntax to other functions from
# the `dplyr` package. You can use `desc()` inside `arrange()` to sort in
# descending order.
lifeExp_bycountry %>%
arrange(mean_lifeExp) %>%
head(1)
lifeExp_bycountry %>%
arrange(desc(mean_lifeExp)) %>%
head(1)
```
#### Challenge 3
```{r, echo=FALSE}
gapminder %>%
filter(year==2002) %>%
group_by(continent) %>%
sample_n(2) %>%
summarize(mean_lifeExp=mean(lifeExp)) %>%
arrange(desc(mean_lifeExp))
```