forked from ArabR/from-Zero-to-hero-in-r
-
Notifications
You must be signed in to change notification settings - Fork 0
/
session5-draft.Rmd
338 lines (263 loc) · 8.23 KB
/
session5-draft.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
---
title: "Session 5 Summary"
author: "Batool Almarzouq"
date: "`r Sys.Date()`"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Loading package
```{r}
## load the tidyverse
library(tidyverse)
library(here)
```
## Working directory
```{r}
getwd()
```
## Loading the data
```{r}
interviews <- read_csv(here("data", "SAFI_clean.csv"), na = "NULL")
```
## inspect the data
```{r}
head(interviews)
view(interviews)
glimpse(interviews)
```
## Select column
```{r}
interviews$no_membrs
```
## Selecting columns and filtering rows
```{r}
interviews %>%
select(village, no_membrs, months_lack_food)
interviews %>%
select(village:respondent_wall_type)
```
To choose rows based on specific criteria, we can use the `filter()` function.
```{r, purl = FALSE}
# filters observations where village name is "Chirodzo"
interviews %>%
filter(village == "Chirodzo")
```
We can also form "and" statements with the `&` operator instead of commas:
```{r}
interviews %>%
filter(village == "Chirodzo" &
rooms > 1 &
no_meals > 2)
```
To form "or" statements we use the logical operator for "or," which is the vertical bar (|):
```{r}
# filters observations with "|" logical operator
# output dataframe satisfies AT LEAST ONE of the specified conditions
interviews %>%
filter(village == "Chirodzo" | village == "Ruaca")
```
The last option, *pipes*, are a recent addition to R. Pipes let you take the
output of one function and send it directly to the next
```{r}
interviews %>%
filter(village == "Chirodzo") %>%
select(village:respondent_wall_type)
```
If we want to create a new object with this smaller version of the data, we
can assign it a new name:
```{r, purl = FALSE}
interviews_ch <- interviews %>%
filter(village == "Chirodzo") %>%
select(village:respondent_wall_type)
interviews_ch
```
> ## Exercise (10 mins)
>
> Using pipes, subset the `interviews` data to include interviews
> where respondents were members of an irrigation association
> (`memb_assoc`) and retain only the columns `affect_conflicts`,
> `liv_count`, and `no_meals`.
>
> > ## Solution
> >
> > ```{r}
> > interviews %>%
> > filter(memb_assoc == "yes") %>%
> > select(affect_conflicts, liv_count, no_meals)
> > ```
> {: .solution}
{: .challenge}
### Mutate
Frequently you'll want to create new columns based on the values in existing
columns.
We might be interested in the ratio of number of household members
to rooms used for sleeping (i.e. avg number of people per room):
```{r, purl = FALSE}
interviews %>%
mutate(people_per_room = no_membrs / rooms)
```
We may be interested in investigating whether being a member of an irrigation association had any effect on the ratio of household members
to rooms. To look at this relationship, we will first remove data from our dataset where the respondent didn't answer the question of whether they were a member of an irrigation association.
These cases are recorded as "NULL" in the dataset.
To remove these cases, we could insert a `filter()` in the chain:
```{r, purl = FALSE}
interviews %>%
filter(!is.na(memb_assoc)) %>%
mutate(people_per_room = no_membrs / rooms)
```
The `!` symbol negates this and says we only want values of `FALSE`, where `memb_assoc` **is
not** missing.
> ## Exercise (15 mins)
>
> Create a new dataframe from the `interviews` data that meets the following
> criteria: contains only the `village` column and a new column called
> `total_meals` containing a value that is equal to the total number of meals
> served in the household per day on average (`no_membrs` times `no_meals`).
> Only the rows where `total_meals` is greater than 20 should be shown in the
> final dataframe.
>
> **Hint**: think about how the commands should be ordered to produce this data
> frame!
>
> > ## Solution
> >
> > ``` {r}
> > interviews_total_meals <- interviews %>%
> > mutate(total_meals = no_membrs * no_meals) %>%
> > filter(total_meals > 20) %>%
> > select(village, total_meals)
> > ```
> {: .solution}
{: .challenge}
#### Counting
For example, if we wanted to count the number of rows of data for
each village, we would do:
```{r, purl = FALSE}
interviews %>%
count(village)
```
For convenience, `count()` provides the `sort` argument to get results in
decreasing order:
```{r, purl = FALSE}
interviews %>%
count(village, sort = TRUE)
```
### Split-apply-combine data analysis and the summarize() function
Many data analysis tasks can be approached using the *split-apply-combine*
paradigm: split the data into groups, apply some analysis to each group, and
then combine the results. **`dplyr`** makes this very easy through the use of
the `group_by()` function.
So to compute the average household size by
village:
```{r, purl = FALSE}
interviews %>%
group_by(village) %>%
summarize(mean_no_membrs = mean(no_membrs))
```
You may also have noticed that the output from these calls doesn't run off the
screen anymore. It's one of the advantages of `tbl_df` over dataframe.
You can also group by multiple columns:
```{r, purl = FALSE}
interviews %>%
group_by(village, memb_assoc) %>%
summarize(mean_no_membrs = mean(no_membrs))
```
Note that the output is a grouped tibble. To obtain an ungrouped tibble, use the
`ungroup` function:
```{r, purl = FALSE}
interviews %>%
group_by(village, memb_assoc) %>%
summarize(mean_no_membrs = mean(no_membrs)) %>%
ungroup()
```
When grouping both by `village` and `membr_assoc`, we see rows in our table for
respondents who did not specify whether they were a member of an irrigation
association. We can exclude those data from our table using a filter step.
```{r, purl = FALSE}
interviews %>%
filter(!is.na(memb_assoc)) %>%
group_by(village, memb_assoc) %>%
summarize(mean_no_membrs = mean(no_membrs))
```
Once the data are grouped, you can also summarize multiple variables at the same
time (and not necessarily on the same variable). For instance, we could add a
column indicating the minimum household size for each village for each group
(members of an irrigation association vs not):
```{r, purl = FALSE}
interviews %>%
filter(!is.na(memb_assoc)) %>%
group_by(village, memb_assoc) %>%
summarize(mean_no_membrs = mean(no_membrs),
min_membrs = min(no_membrs))
```
It is sometimes useful to rearrange the result of a query to inspect the values.
For instance, we can sort on `min_membrs` to put the group with the smallest
household first:
```{r, purl = FALSE}
interviews %>%
filter(!is.na(memb_assoc)) %>%
group_by(village, memb_assoc) %>%
summarize(mean_no_membrs = mean(no_membrs),
min_membrs = min(no_membrs)) %>%
arrange(min_membrs)
```
To sort in descending order, we need to add the `desc()` function. If we want to
sort the results by decreasing order of minimum household size:
```{r, purl = FALSE}
interviews %>%
filter(!is.na(memb_assoc)) %>%
group_by(village, memb_assoc) %>%
summarize(mean_no_membrs = mean(no_membrs),
min_membrs = min(no_membrs)) %>%
arrange(desc(min_membrs))
```
> ## Exercise
>
> How many households in the survey have an average of
> two meals per day? Three meals per day? Are there any other numbers
> of meals represented?
>
> > ## Solution
> >
> > ```{r}
> > interviews %>%
> > count(no_meals)
> > ```
> {: .solution}
>
> Use `group_by()` and `summarize()` to find the mean, min, and max
> number of household members for each village. Also add the number of
> observations (hint: see `?n`).
>
> > ## Solution
> >
> > ```{r}
> > interviews %>%
> > group_by(village) %>%
> > summarize(
> > mean_no_membrs = mean(no_membrs),
> > min_no_membrs = min(no_membrs),
> > max_no_membrs = max(no_membrs),
> > n = n()
> > )
> > ```
> {: .solution}
>
> What was the largest household interviewed in each month?
>
> > ## Solution
> >
> > ```{r}
> > # if not already included, add month, year, and day columns
> > library(lubridate) # load lubridate if not already loaded
> > interviews %>%
> > mutate(month = month(interview_date),
> > day = day(interview_date),
> > year = year(interview_date)) %>%
> > group_by(year, month) %>%
> > summarize(max_no_membrs = max(no_membrs))
> > ```
> {: .solution}
{: .challenge}