forked from swcarpentry/r-novice-gapminder
-
Notifications
You must be signed in to change notification settings - Fork 0
/
14-tidyr.html
397 lines (393 loc) · 30.1 KB
/
14-tidyr.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<title>Software Carpentry: R for reproducible scientific analysis</title>
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-theme.css" />
<link rel="stylesheet" type="text/css" href="css/swc.css" />
<link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
<meta charset="UTF-8" />
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</head>
<body class="lesson">
<div class="container card">
<div class="banner">
<a href="http://software-carpentry.org" title="Software Carpentry">
<img alt="Software Carpentry banner" src="img/software-carpentry-banner.png" />
</a>
</div>
<article>
<div class="row">
<div class="col-md-10 col-md-offset-1">
<a href="index.html"><h1 class="title">R for reproducible scientific analysis</h1></a>
<h2 class="subtitle">Dataframe manipulation with tidyr</h2>
<section class="objectives panel panel-warning">
<div class="panel-heading">
<h2 id="learning-objectives"><span class="glyphicon glyphicon-certificate"></span>Learning Objectives</h2>
</div>
<div class="panel-body">
<ul>
<li>To be understand the concepts of ‘long’ and ‘wide’ data formats and be able to convert between them with <code>tidyr</code></li>
</ul>
</div>
</section>
<p>Researchers often want to manipulate their data from the ‘wide’ to the ‘long’ format, or vice-versa. The ‘long’ format is where:</p>
<ul>
<li>each column is a variable</li>
<li>each row is an observation</li>
</ul>
<p>In the ‘long’ format, you usually have 1 column for the observed variable and the other columns are ID variables.</p>
<p>For the ‘wide’ format each row is often a site/subject/patient and you have multiple observation variables containing the same type of data. These can be either repeated observations over time, or observation of multiple variables (or a mix of both). You may find data input may be simpler or some other applications may prefer the ‘wide’ format. However, many of <code>R</code>‘s functions have been designed assuming you have ’long’ format data. This tutorial will help you efficiently transform your data regardless of original format.</p>
<div class="figure">
<img src="fig/14-tidyr-fig1.png" alt="" />
</div>
<p>These data formats mainly affect readability. For humans, the wide format is often more intuitive since we can often see more of the data on the screen due to it’s shape. However, the long format is more machine readable and is closer to the formating of databases. The ID variables in our dataframes are similar to the fields in a database and observed variables are like the database values.</p>
<h2 id="getting-started">Getting started</h2>
<p>First install the packages if you haven’t already done so (you probably installed dplyr in the previous lesson):</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co">#install.packages("tidyr")</span>
<span class="co">#install.packages("dplyr")</span></code></pre></div>
<p>Load the packages</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(<span class="st">"tidyr"</span>)
<span class="kw">library</span>(<span class="st">"dplyr"</span>)</code></pre></div>
<p>First, lets look at the structure of our original gapminder dataframe:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">str</span>(gapminder)</code></pre></div>
<pre class="output"><code>'data.frame': 1704 obs. of 6 variables:
$ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ pop : num 8425333 9240934 10267083 11537966 13079460 ...
$ continent: chr "Asia" "Asia" "Asia" "Asia" ...
$ lifeExp : num 28.8 30.3 32 34 36.1 ...
$ gdpPercap: num 779 821 853 836 740 ...
</code></pre>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="challenge-1"><span class="glyphicon glyphicon-pencil"></span>Challenge 1</h2>
</div>
<div class="panel-body">
<p>Is gapminder a purely long, purely wide, or some intermediate format?</p>
</div>
</section>
<p>Sometimes, as with the gapminder dataset, we have multiple types of observed data. It is somewhere in between the purely ‘long’ and ‘wide’ data formats. We have 3 “ID variables” (<code>continent</code>, <code>country</code>, <code>year</code>) and 3 “Observation variables” (<code>pop</code>,<code>lifeExp</code>,<code>gdpPercap</code>). I usually prefer my data in this intermediate format in most cases despite not having ALL observations in 1 column given that all 3 observation variables have different units. There are few operations that would need us to stretch out this dataframe any longer (i.e. 4 ID variables and 1 Observation variable).</p>
<p>While using many of the functions in R, which are often vector based, you usually do not want to do mathematical operations on values with different units. For example, using the purely long format, a single mean for all of the values of population, life expectancy, and GDP would not be meaningful since it would return the mean of values with 3 incompatible units. The solution is that we first manipulate the data either by grouping (see the lesson on <code>dplyr</code>), or we change the structure of the dataframe. <strong>Note:</strong> Some plotting functions in R actually work better in the wide format data.</p>
<h2 id="from-wide-to-long-format-with-gather">From wide to long format with gather()</h2>
<p>Until now, we’ve been using the nicely formatted original gapminder dataset, but ‘real’ data (i.e. our own research data) will never be so well organized. Here let’s start with the wide format version of the gapminder dataset.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">str</span>(gap_wide)</code></pre></div>
<pre class="output"><code>'data.frame': 142 obs. of 38 variables:
$ continent : chr "Africa" "Africa" "Africa" "Africa" ...
$ country : chr "Algeria" "Angola" "Benin" "Botswana" ...
$ gdpPercap_1952: num 2449 3521 1063 851 543 ...
$ gdpPercap_1957: num 3014 3828 960 918 617 ...
$ gdpPercap_1962: num 2551 4269 949 984 723 ...
$ gdpPercap_1967: num 3247 5523 1036 1215 795 ...
$ gdpPercap_1972: num 4183 5473 1086 2264 855 ...
$ gdpPercap_1977: num 4910 3009 1029 3215 743 ...
$ gdpPercap_1982: num 5745 2757 1278 4551 807 ...
$ gdpPercap_1987: num 5681 2430 1226 6206 912 ...
$ gdpPercap_1992: num 5023 2628 1191 7954 932 ...
$ gdpPercap_1997: num 4797 2277 1233 8647 946 ...
$ gdpPercap_2002: num 5288 2773 1373 11004 1038 ...
$ gdpPercap_2007: num 6223 4797 1441 12570 1217 ...
$ lifeExp_1952 : num 43.1 30 38.2 47.6 32 ...
$ lifeExp_1957 : num 45.7 32 40.4 49.6 34.9 ...
$ lifeExp_1962 : num 48.3 34 42.6 51.5 37.8 ...
$ lifeExp_1967 : num 51.4 36 44.9 53.3 40.7 ...
$ lifeExp_1972 : num 54.5 37.9 47 56 43.6 ...
$ lifeExp_1977 : num 58 39.5 49.2 59.3 46.1 ...
$ lifeExp_1982 : num 61.4 39.9 50.9 61.5 48.1 ...
$ lifeExp_1987 : num 65.8 39.9 52.3 63.6 49.6 ...
$ lifeExp_1992 : num 67.7 40.6 53.9 62.7 50.3 ...
$ lifeExp_1997 : num 69.2 41 54.8 52.6 50.3 ...
$ lifeExp_2002 : num 71 41 54.4 46.6 50.6 ...
$ lifeExp_2007 : num 72.3 42.7 56.7 50.7 52.3 ...
$ pop_1952 : num 9279525 4232095 1738315 442308 4469979 ...
$ pop_1957 : num 10270856 4561361 1925173 474639 4713416 ...
$ pop_1962 : num 11000948 4826015 2151895 512764 4919632 ...
$ pop_1967 : num 12760499 5247469 2427334 553541 5127935 ...
$ pop_1972 : num 14760787 5894858 2761407 619351 5433886 ...
$ pop_1977 : num 17152804 6162675 3168267 781472 5889574 ...
$ pop_1982 : num 20033753 7016384 3641603 970347 6634596 ...
$ pop_1987 : num 23254956 7874230 4243788 1151184 7586551 ...
$ pop_1992 : num 26298373 8735988 4981671 1342614 8878303 ...
$ pop_1997 : num 29072015 9875024 6066080 1536536 10352843 ...
$ pop_2002 : int 31287142 10866106 7026113 1630347 12251209 7021078 15929988 4048013 8835739 614382 ...
$ pop_2007 : int 33333216 12420476 8078314 1639131 14326203 8390505 17696293 4369038 10238807 710960 ...
</code></pre>
<div class="figure">
<img src="fig/14-tidyr-fig2.png" alt="" />
</div>
<p>The first step towards getting our nice intermediate data format is to first convert from the wide to the long format. The <code>tidyr</code> function <code>gather()</code> will ‘gather’ your observation variables into a single variable.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">gap_long <-<span class="st"> </span>gap_wide %>%<span class="st"> </span><span class="kw">gather</span>(obstype_year,obs_values,<span class="kw">starts_with</span>(<span class="st">'pop'</span>),<span class="kw">starts_with</span>(<span class="st">'lifeExp'</span>),<span class="kw">starts_with</span>(<span class="st">'gdpPercap'</span>))
<span class="kw">str</span>(gap_long)</code></pre></div>
<pre class="output"><code>'data.frame': 5112 obs. of 4 variables:
$ continent : chr "Africa" "Africa" "Africa" "Africa" ...
$ country : chr "Algeria" "Angola" "Benin" "Botswana" ...
$ obstype_year: Factor w/ 36 levels "pop_1952","pop_1957",..: 1 1 1 1 1 1 1 1 1 1 ...
$ obs_values : num 9279525 4232095 1738315 442308 4469979 ...
</code></pre>
<p>Here we have used piping syntax which is similar to what we were doing in the previous lesson with dplyr. In fact, these are compatible and you can use a mix of tidyr and dplyr functions by piping them together</p>
<p>Inside <code>gather()</code> we first name the new column for the new ID variable (<code>obstype_year</code>), the name for the new amalgamated observation variable (<code>obs_value</code>), then the names of the old observation variable. We could have typed out all the observation variables, but as in the <code>select()</code> function (see <code>dplyr</code> lesson), we can use the <code>starts_with()</code> argument to select all variables that starts with the desired character sring. Gather also allows the alternative syntax of using the <code>-</code> symbol to identify which variables are not to be gathered (i.e. ID variables)</p>
<div class="figure">
<img src="fig/14-tidyr-fig3.png" alt="" />
</div>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">gap_long <-<span class="st"> </span>gap_wide %>%<span class="st"> </span><span class="kw">gather</span>(obstype_year,obs_values,-continent,-country)
<span class="kw">str</span>(gap_long)</code></pre></div>
<pre class="output"><code>'data.frame': 5112 obs. of 4 variables:
$ continent : chr "Africa" "Africa" "Africa" "Africa" ...
$ country : chr "Algeria" "Angola" "Benin" "Botswana" ...
$ obstype_year: Factor w/ 36 levels "gdpPercap_1952",..: 1 1 1 1 1 1 1 1 1 1 ...
$ obs_values : num 2449 3521 1063 851 543 ...
</code></pre>
<p>That may seem trivial with this particular dataframe, but sometimes you have 1 ID variable and 40 Observation variables with irregular variables names. The flexibility is a huge time saver!</p>
<p>Now <code>obstype_year</code> actually contains 2 pieces of information, the observation type (<code>pop</code>,<code>lifeExp</code>, or <code>gdpPercap</code>) and the <code>year</code>. We can use the <code>separate()</code> function to split the character strings into multiple variables</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">gap_long <-<span class="st"> </span>gap_long %>%<span class="st"> </span><span class="kw">separate</span>(obstype_year,<span class="dt">into=</span><span class="kw">c</span>(<span class="st">'obs_type'</span>,<span class="st">'year'</span>),<span class="dt">sep=</span><span class="st">"_"</span>)
gap_long$year <-<span class="st"> </span><span class="kw">as.integer</span>(gap_long$year)</code></pre></div>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="challenge-2"><span class="glyphicon glyphicon-pencil"></span>Challenge 2</h2>
</div>
<div class="panel-body">
<p>Using <code>gap_long</code>, calculate the mean life expectancy, population, and gdpPercap for each continent. <strong>Hint:</strong> use the <code>group_by()</code> and <code>summarize()</code> functions we learned in the <code>dplyr</code> lesson</p>
</div>
</section>
<h2 id="from-long-to-intermediate-format-with-spread">From long to intermediate format with spread()</h2>
<p>Now just to double-check our work, let’s use the opposite of <code>gather()</code> to spread our observation variables back out with the aptly named <code>spread()</code>. We can then spread our <code>gap_long()</code> to the original intermediate format or the widest format. Let’s start with the intermediate format.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">gap_normal <-<span class="st"> </span>gap_long %>%<span class="st"> </span><span class="kw">spread</span>(obs_type,obs_values)
<span class="kw">dim</span>(gap_normal)</code></pre></div>
<pre class="output"><code>[1] 1704 6
</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">dim</span>(gapminder)</code></pre></div>
<pre class="output"><code>[1] 1704 6
</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">names</span>(gap_normal)</code></pre></div>
<pre class="output"><code>[1] "continent" "country" "year" "gdpPercap" "lifeExp" "pop"
</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">names</span>(gapminder)</code></pre></div>
<pre class="output"><code>[1] "country" "year" "pop" "continent" "lifeExp" "gdpPercap"
</code></pre>
<p>Now we’ve got an intermediate dataframe <code>gap_normal</code> with the same dimensions as the original <code>gapminder</code>, but the order of the variables is different. Let’s fix that before checking if they are <code>all.equal()</code>.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">gap_normal <-<span class="st"> </span>gap_normal[,<span class="kw">names</span>(gapminder)]
<span class="kw">all.equal</span>(gap_normal,gapminder)</code></pre></div>
<pre class="output"><code>[1] "Component \"country\": 1704 string mismatches"
[2] "Component \"pop\": Mean relative difference: 1.634504"
[3] "Component \"continent\": 1212 string mismatches"
[4] "Component \"lifeExp\": Mean relative difference: 0.203822"
[5] "Component \"gdpPercap\": Mean relative difference: 1.162302"
</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">head</span>(gap_normal)</code></pre></div>
<pre class="output"><code> country year pop continent lifeExp gdpPercap
1 Algeria 1952 9279525 Africa 43.077 2449.008
2 Algeria 1957 10270856 Africa 45.685 3013.976
3 Algeria 1962 11000948 Africa 48.303 2550.817
4 Algeria 1967 12760499 Africa 51.407 3246.992
5 Algeria 1972 14760787 Africa 54.518 4182.664
6 Algeria 1977 17152804 Africa 58.014 4910.417
</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">head</span>(gapminder)</code></pre></div>
<pre class="output"><code> country year pop continent lifeExp gdpPercap
1 Afghanistan 1952 8425333 Asia 28.801 779.4453
2 Afghanistan 1957 9240934 Asia 30.332 820.8530
3 Afghanistan 1962 10267083 Asia 31.997 853.1007
4 Afghanistan 1967 11537966 Asia 34.020 836.1971
5 Afghanistan 1972 13079460 Asia 36.088 739.9811
6 Afghanistan 1977 14880372 Asia 38.438 786.1134
</code></pre>
<p>We’re almost there, the original was sorted by <code>country</code>, <code>continent</code>, then <code>year</code>.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">gap_normal <-<span class="st"> </span>gap_normal %>%<span class="st"> </span><span class="kw">arrange</span>(country,continent,year)
<span class="kw">all.equal</span>(gap_normal,gapminder)</code></pre></div>
<pre class="output"><code>[1] TRUE
</code></pre>
<p>That’s great! We’ve gone from the longest format back to the intermediate and we didn’t introduce any errors in our code.</p>
<p>Now lets convert the long all the way back to the wide. In the wide format, we will keep country and continent as ID variables and spread the observations across the 3 metrics (<code>pop</code>,<code>lifeExp</code>,<code>gdpPercap</code>) and time (<code>year</code>). First we need to create appropriate labels for all our new variables (time*metric combinations) and we also need to unify our ID variables to simplify the process of defining <code>gap_wide</code></p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">gap_temp <-<span class="st"> </span>gap_long %>%<span class="st"> </span><span class="kw">unite</span>(var_ID,continent,country,<span class="dt">sep=</span><span class="st">"_"</span>)
<span class="kw">str</span>(gap_temp)</code></pre></div>
<pre class="output"><code>'data.frame': 5112 obs. of 4 variables:
$ var_ID : chr "Africa_Algeria" "Africa_Angola" "Africa_Benin" "Africa_Botswana" ...
$ obs_type : chr "gdpPercap" "gdpPercap" "gdpPercap" "gdpPercap" ...
$ year : int 1952 1952 1952 1952 1952 1952 1952 1952 1952 1952 ...
$ obs_values: num 2449 3521 1063 851 543 ...
</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">gap_temp <-<span class="st"> </span>gap_long %>%
<span class="st"> </span><span class="kw">unite</span>(ID_var,continent,country,<span class="dt">sep=</span><span class="st">"_"</span>) %>%
<span class="st"> </span><span class="kw">unite</span>(var_names,obs_type,year,<span class="dt">sep=</span><span class="st">"_"</span>)
<span class="kw">str</span>(gap_temp)</code></pre></div>
<pre class="output"><code>'data.frame': 5112 obs. of 3 variables:
$ ID_var : chr "Africa_Algeria" "Africa_Angola" "Africa_Benin" "Africa_Botswana" ...
$ var_names : chr "gdpPercap_1952" "gdpPercap_1952" "gdpPercap_1952" "gdpPercap_1952" ...
$ obs_values: num 2449 3521 1063 851 543 ...
</code></pre>
<p>Using <code>unite()</code> we now have a single ID variable which is a combination of <code>continent</code>,<code>country</code>,and we have defined variable names. We’re now ready to pipe in <code>spread()</code></p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">gap_wide_new <-<span class="st"> </span>gap_long %>%<span class="st"> </span>
<span class="st"> </span><span class="kw">unite</span>(ID_var,continent,country,<span class="dt">sep=</span><span class="st">"_"</span>) %>%
<span class="st"> </span><span class="kw">unite</span>(var_names,obs_type,year,<span class="dt">sep=</span><span class="st">"_"</span>) %>%
<span class="st"> </span><span class="kw">spread</span>(var_names,obs_values)
<span class="kw">str</span>(gap_wide_new)</code></pre></div>
<pre class="output"><code>'data.frame': 142 obs. of 37 variables:
$ ID_var : chr "Africa_Algeria" "Africa_Angola" "Africa_Benin" "Africa_Botswana" ...
$ gdpPercap_1952: num 2449 3521 1063 851 543 ...
$ gdpPercap_1957: num 3014 3828 960 918 617 ...
$ gdpPercap_1962: num 2551 4269 949 984 723 ...
$ gdpPercap_1967: num 3247 5523 1036 1215 795 ...
$ gdpPercap_1972: num 4183 5473 1086 2264 855 ...
$ gdpPercap_1977: num 4910 3009 1029 3215 743 ...
$ gdpPercap_1982: num 5745 2757 1278 4551 807 ...
$ gdpPercap_1987: num 5681 2430 1226 6206 912 ...
$ gdpPercap_1992: num 5023 2628 1191 7954 932 ...
$ gdpPercap_1997: num 4797 2277 1233 8647 946 ...
$ gdpPercap_2002: num 5288 2773 1373 11004 1038 ...
$ gdpPercap_2007: num 6223 4797 1441 12570 1217 ...
$ lifeExp_1952 : num 43.1 30 38.2 47.6 32 ...
$ lifeExp_1957 : num 45.7 32 40.4 49.6 34.9 ...
$ lifeExp_1962 : num 48.3 34 42.6 51.5 37.8 ...
$ lifeExp_1967 : num 51.4 36 44.9 53.3 40.7 ...
$ lifeExp_1972 : num 54.5 37.9 47 56 43.6 ...
$ lifeExp_1977 : num 58 39.5 49.2 59.3 46.1 ...
$ lifeExp_1982 : num 61.4 39.9 50.9 61.5 48.1 ...
$ lifeExp_1987 : num 65.8 39.9 52.3 63.6 49.6 ...
$ lifeExp_1992 : num 67.7 40.6 53.9 62.7 50.3 ...
$ lifeExp_1997 : num 69.2 41 54.8 52.6 50.3 ...
$ lifeExp_2002 : num 71 41 54.4 46.6 50.6 ...
$ lifeExp_2007 : num 72.3 42.7 56.7 50.7 52.3 ...
$ pop_1952 : num 9279525 4232095 1738315 442308 4469979 ...
$ pop_1957 : num 10270856 4561361 1925173 474639 4713416 ...
$ pop_1962 : num 11000948 4826015 2151895 512764 4919632 ...
$ pop_1967 : num 12760499 5247469 2427334 553541 5127935 ...
$ pop_1972 : num 14760787 5894858 2761407 619351 5433886 ...
$ pop_1977 : num 17152804 6162675 3168267 781472 5889574 ...
$ pop_1982 : num 20033753 7016384 3641603 970347 6634596 ...
$ pop_1987 : num 23254956 7874230 4243788 1151184 7586551 ...
$ pop_1992 : num 26298373 8735988 4981671 1342614 8878303 ...
$ pop_1997 : num 29072015 9875024 6066080 1536536 10352843 ...
$ pop_2002 : num 31287142 10866106 7026113 1630347 12251209 ...
$ pop_2007 : num 33333216 12420476 8078314 1639131 14326203 ...
</code></pre>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="challenge-3"><span class="glyphicon glyphicon-pencil"></span>Challenge 3</h2>
</div>
<div class="panel-body">
<p>Take this 1 step further and create a <code>gap_ludicrously_wide</code> format data by spreading over countries, year and the 3 metrics? <strong>Hint</strong> this new dataframe should only have 5 rows.</p>
</div>
</section>
<p>Now we have a great ‘wide’ format dataframe, but the <code>ID_var</code> could be more usable, let’s separate it into 2 variables with <code>separate()</code></p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">gap_wide_betterID <-<span class="st"> </span><span class="kw">separate</span>(gap_wide_new,ID_var,<span class="kw">c</span>(<span class="st">"continent"</span>,<span class="st">"country"</span>),<span class="dt">sep=</span><span class="st">"_"</span>)
gap_wide_betterID <-<span class="st"> </span>gap_long %>%<span class="st"> </span>
<span class="st"> </span><span class="kw">unite</span>(ID_var,continent,country,<span class="dt">sep=</span><span class="st">"_"</span>) %>%
<span class="st"> </span><span class="kw">unite</span>(var_names,obs_type,year,<span class="dt">sep=</span><span class="st">"_"</span>) %>%
<span class="st"> </span><span class="kw">spread</span>(var_names,obs_values) %>%
<span class="st"> </span><span class="kw">separate</span>(ID_var,<span class="kw">c</span>(<span class="st">"continent"</span>,<span class="st">"country"</span>),<span class="dt">sep=</span><span class="st">"_"</span>)
<span class="kw">str</span>(gap_wide_betterID)</code></pre></div>
<pre class="output"><code>'data.frame': 142 obs. of 38 variables:
$ continent : chr "Africa" "Africa" "Africa" "Africa" ...
$ country : chr "Algeria" "Angola" "Benin" "Botswana" ...
$ gdpPercap_1952: num 2449 3521 1063 851 543 ...
$ gdpPercap_1957: num 3014 3828 960 918 617 ...
$ gdpPercap_1962: num 2551 4269 949 984 723 ...
$ gdpPercap_1967: num 3247 5523 1036 1215 795 ...
$ gdpPercap_1972: num 4183 5473 1086 2264 855 ...
$ gdpPercap_1977: num 4910 3009 1029 3215 743 ...
$ gdpPercap_1982: num 5745 2757 1278 4551 807 ...
$ gdpPercap_1987: num 5681 2430 1226 6206 912 ...
$ gdpPercap_1992: num 5023 2628 1191 7954 932 ...
$ gdpPercap_1997: num 4797 2277 1233 8647 946 ...
$ gdpPercap_2002: num 5288 2773 1373 11004 1038 ...
$ gdpPercap_2007: num 6223 4797 1441 12570 1217 ...
$ lifeExp_1952 : num 43.1 30 38.2 47.6 32 ...
$ lifeExp_1957 : num 45.7 32 40.4 49.6 34.9 ...
$ lifeExp_1962 : num 48.3 34 42.6 51.5 37.8 ...
$ lifeExp_1967 : num 51.4 36 44.9 53.3 40.7 ...
$ lifeExp_1972 : num 54.5 37.9 47 56 43.6 ...
$ lifeExp_1977 : num 58 39.5 49.2 59.3 46.1 ...
$ lifeExp_1982 : num 61.4 39.9 50.9 61.5 48.1 ...
$ lifeExp_1987 : num 65.8 39.9 52.3 63.6 49.6 ...
$ lifeExp_1992 : num 67.7 40.6 53.9 62.7 50.3 ...
$ lifeExp_1997 : num 69.2 41 54.8 52.6 50.3 ...
$ lifeExp_2002 : num 71 41 54.4 46.6 50.6 ...
$ lifeExp_2007 : num 72.3 42.7 56.7 50.7 52.3 ...
$ pop_1952 : num 9279525 4232095 1738315 442308 4469979 ...
$ pop_1957 : num 10270856 4561361 1925173 474639 4713416 ...
$ pop_1962 : num 11000948 4826015 2151895 512764 4919632 ...
$ pop_1967 : num 12760499 5247469 2427334 553541 5127935 ...
$ pop_1972 : num 14760787 5894858 2761407 619351 5433886 ...
$ pop_1977 : num 17152804 6162675 3168267 781472 5889574 ...
$ pop_1982 : num 20033753 7016384 3641603 970347 6634596 ...
$ pop_1987 : num 23254956 7874230 4243788 1151184 7586551 ...
$ pop_1992 : num 26298373 8735988 4981671 1342614 8878303 ...
$ pop_1997 : num 29072015 9875024 6066080 1536536 10352843 ...
$ pop_2002 : num 31287142 10866106 7026113 1630347 12251209 ...
$ pop_2007 : num 33333216 12420476 8078314 1639131 14326203 ...
</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">all.equal</span>(gap_wide,gap_wide_betterID)</code></pre></div>
<pre class="output"><code>[1] TRUE
</code></pre>
<p>There and back again!</p>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="solution-to-challenge-1"><span class="glyphicon glyphicon-pencil"></span>Solution to Challenge 1</h2>
</div>
<div class="panel-body">
<p>The original gapminder data.frame is in an intermediate format. It is not purely long since it had multiple observation variables (<code>pop</code>,<code>lifeExp</code>,<code>gdpPercap</code>).</p>
</div>
</section>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="solution-to-challenge-2"><span class="glyphicon glyphicon-pencil"></span>Solution to Challenge 2</h2>
</div>
<div class="panel-body">
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">gap_long %>%<span class="st"> </span><span class="kw">group_by</span>(continent,obs_type) %>%
<span class="st"> </span><span class="kw">summarize</span>(<span class="dt">means=</span><span class="kw">mean</span>(obs_values))</code></pre></div>
<pre class="output"><code>Source: local data frame [15 x 3]
Groups: continent [?]
continent obs_type means
(chr) (chr) (dbl)
1 Africa gdpPercap 2.193755e+03
2 Africa lifeExp 4.886533e+01
3 Africa pop 9.916003e+06
4 Americas gdpPercap 7.136110e+03
5 Americas lifeExp 6.465874e+01
6 Americas pop 2.450479e+07
7 Asia gdpPercap 7.902150e+03
8 Asia lifeExp 6.006490e+01
9 Asia pop 7.703872e+07
10 Europe gdpPercap 1.446948e+04
11 Europe lifeExp 7.190369e+01
12 Europe pop 1.716976e+07
13 Oceania gdpPercap 1.862161e+04
14 Oceania lifeExp 7.432621e+01
15 Oceania pop 8.874672e+06
</code></pre>
</div>
</section>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="solution-to-challenge-3"><span class="glyphicon glyphicon-pencil"></span>Solution to Challenge 3</h2>
</div>
<div class="panel-body">
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">gap_ludicrously_wide <-<span class="st"> </span>gap_long %>%<span class="st"> </span>
<span class="st"> </span><span class="kw">unite</span>(var_names,obs_type,year,country,<span class="dt">sep=</span><span class="st">"_"</span>) %>%
<span class="st"> </span><span class="kw">spread</span>(var_names,obs_values)</code></pre></div>
</div>
</section>
<h2 id="other-great-resources">Other great resources</h2>
<p><a href="https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf">Data Wrangling Cheat sheet</a> <a href="https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html">Introduction to tidyr</a></p>
</div>
</div>
</article>
<div class="footer">
<a class="label swc-blue-bg" href="http://software-carpentry.org">Software Carpentry</a>
<a class="label swc-blue-bg" href="https://github.com/swcarpentry/lesson-template">Source</a>
<a class="label swc-blue-bg" href="mailto:[email protected]">Contact</a>
<a class="label swc-blue-bg" href="LICENSE.html">License</a>
</div>
</div>
<!-- Javascript placed at the end of the document so the pages load faster -->
<script src="http://software-carpentry.org/v5/js/jquery-1.9.1.min.js"></script>
<script src="css/bootstrap/bootstrap-js/bootstrap.js"></script>
</body>
</html>