forked from swcarpentry/r-novice-gapminder
-
Notifications
You must be signed in to change notification settings - Fork 0
/
12-plyr.html
358 lines (352 loc) · 19.5 KB
/
12-plyr.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<title>Software Carpentry: R for reproducible scientific analysis</title>
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-theme.css" />
<link rel="stylesheet" type="text/css" href="css/swc.css" />
<link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
<meta charset="UTF-8" />
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</head>
<body class="lesson">
<div class="container card">
<div class="banner">
<a href="http://software-carpentry.org" title="Software Carpentry">
<img alt="Software Carpentry banner" src="img/software-carpentry-banner.png" />
</a>
</div>
<article>
<div class="row">
<div class="col-md-10 col-md-offset-1">
<a href="index.html"><h1 class="title">R for reproducible scientific analysis</h1></a>
<h2 class="subtitle">Split-apply-combine</h2>
<section class="objectives panel panel-warning">
<div class="panel-heading">
<h2 id="learning-objectives"><span class="glyphicon glyphicon-certificate"></span>Learning Objectives</h2>
</div>
<div class="panel-body">
<ul>
<li>To be able to use the split-apply-combine strategy for data analysis</li>
</ul>
</div>
</section>
<p>Previously we looked at how you can use functions to simplify your code. We defined the <code>calcGDP</code> function, which takes the gapminder dataset, and multiplies the population and GDP per capita column. We also defined additional arguments so we could filter by <code>year</code> and <code>country</code>:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Takes a dataset and multiplies the population column</span>
<span class="co"># with the GDP per capita column.</span>
calcGDP <-<span class="st"> </span>function(dat, <span class="dt">year=</span><span class="ot">NULL</span>, <span class="dt">country=</span><span class="ot">NULL</span>) {
if(!<span class="kw">is.null</span>(year)) {
dat <-<span class="st"> </span>dat[dat$year %in%<span class="st"> </span>year, ]
}
if (!<span class="kw">is.null</span>(country)) {
dat <-<span class="st"> </span>dat[dat$country %in%<span class="st"> </span>country,]
}
gdp <-<span class="st"> </span>dat$pop *<span class="st"> </span>dat$gdpPercap
new <-<span class="st"> </span><span class="kw">cbind</span>(dat, <span class="dt">gdp=</span>gdp)
<span class="kw">return</span>(new)
}</code></pre></div>
<p>A common task you’ll encounter when working with data, is that you’ll want to run calculations on different groups within the data. In the above, we were simply calculating the GDP by multiplying two columns together. But what if we wanted to calculated the mean GDP per continent?</p>
<p>We could run <code>calcGPD</code> and then take the mean of each continent:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">withGDP <-<span class="st"> </span><span class="kw">calcGDP</span>(gapminder)
<span class="kw">mean</span>(withGDP[withGDP$continent ==<span class="st"> "Africa"</span>, <span class="st">"gdp"</span>])</code></pre></div>
<pre class="output"><code>[1] 20904782844
</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">mean</span>(withGDP[withGDP$continent ==<span class="st"> "Americas"</span>, <span class="st">"gdp"</span>])</code></pre></div>
<pre class="output"><code>[1] 379262350210
</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">mean</span>(withGDP[withGDP$continent ==<span class="st"> "Asia"</span>, <span class="st">"gdp"</span>])</code></pre></div>
<pre class="output"><code>[1] 227233738153
</code></pre>
<p>But this isn’t very <em>nice</em>. Yes, by using a function, you have reduced a substantial amount of repetition. That <strong>is</strong> nice. But there is still repetition. Repeating yourself will cost you time, both now and later, and potentially introduce some nasty bugs.</p>
<p>We could write a new function that is flexible like <code>calcGDP</code>, but this also takes a substantial amount of effort and testing to get right.</p>
<p>The abstract problem we’re encountering here is know as “split-apply-combine”:</p>
<div class="figure">
<img src="fig/splitapply.png" alt="Split apply combine" />
<p class="caption">Split apply combine</p>
</div>
<p>We want to <em>split</em> our data into groups, in this case continents, <em>apply</em> some calculations on that group, then optionally <em>combine</em> the results together afterwards.</p>
<h2 id="the-plyr-package">The <code>plyr</code> package</h2>
<p>For those of you who have used R before, you might be familiar with the <code>apply</code> family of functions. While R’s built in functions do work, we’re going to introduce you to another method for solving the “split-apply-combine” problem. The <a href="http://had.co.nz/plyr/">plyr</a> package provides a set of functions that we find more user friendly for solving this problem.</p>
<p>We installed this package in an earlier challenge. Let’s load it now:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(plyr)</code></pre></div>
<p>Plyr has functions for operating on <code>lists</code>, <code>data.frames</code> and <code>arrays</code> (matrices, or n-dimensional vectors). Each function performs:</p>
<ol style="list-style-type: decimal">
<li>A <strong>split</strong>ting operation</li>
<li><strong>Apply</strong> a function on each split in turn.</li>
<li>Re<strong>combine</strong> output data as a single data object.</li>
</ol>
<p>The functions are named based on the data structure they expect as input, and the data structure you want returned as output: [a]rray, [l]ist, or [d]ata.frame. The first letter corresponds to the input data structure, the second letter to the output data structure, and then the rest of the function is named “ply”.</p>
<p>This gives us 9 core functions **ply. There are an additional three functions which will only perform the split and apply steps, and not any combine step. They’re named by their input data type and represent null output by a <code>_</code> (see table)</p>
<p>Note here that plyr’s use of “array” is different to R’s, an array in ply can include a vector or matrix.</p>
<div class="figure">
<img src="fig/full_apply_suite.png" alt="Full apply suite" />
<p class="caption">Full apply suite</p>
</div>
<p>Each of the xxply functions (<code>daply</code>, <code>ddply</code>, <code>llply</code>, <code>laply</code>, …) has the same structure and has 4 key features and structure:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">xxply</span>(.data, .variables, .fun)</code></pre></div>
<ul>
<li>The first letter of the function name gives the input type and the second gives the output type.</li>
<li>.data - gives the data object to be processed</li>
<li>.variables - identifies the splitting variables</li>
<li>.fun - gives the function to be called on each piece</li>
</ul>
<p>Now we can quickly calculate the mean GDP per continent:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ddply</span>(
<span class="dt">.data =</span> <span class="kw">calcGDP</span>(gapminder),
<span class="dt">.variables =</span> <span class="st">"continent"</span>,
<span class="dt">.fun =</span> function(x) <span class="kw">mean</span>(x$gdp)
)</code></pre></div>
<pre class="output"><code> continent V1
1 Africa 20904782844
2 Americas 379262350210
3 Asia 227233738153
4 Europe 269442085301
5 Oceania 188187105354
</code></pre>
<p>Let’s walk through what just happened:</p>
<ul>
<li>The <code>ddply</code> function feeds in a <code>data.frame</code> (function starts with <strong>d</strong>) and returns another <code>data.frame</code> (2nd letter is a <strong>d</strong>) i</li>
<li>the first argument we gave was the data.frame we wanted to operate on: in this case the gapminder data. We called <code>calcGDP</code> on it first so that it would have the additional <code>gdp</code> column added to it.</li>
<li>The second argument indicated our split criteria: in this case the “continent” column. Note that we just gave the name of the column, not the actual column itself like we’ve done previously with subsetting. Plyr takes care of these implementation details for you.</li>
<li>The third argument is the function we want to apply to each grouping of the data. We had to define our own short function here: each subset of the data gets stored in <code>x</code>, the first argument of our function. This is an anonymous function: we haven’t defined it elsewhere, and it has no name. It only exists in the scope of our call to <code>ddply</code>.</li>
</ul>
<p>What if we want a different type of output data structure?:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">dlply</span>(
<span class="dt">.data =</span> <span class="kw">calcGDP</span>(gapminder),
<span class="dt">.variables =</span> <span class="st">"continent"</span>,
<span class="dt">.fun =</span> function(x) <span class="kw">mean</span>(x$gdp)
)</code></pre></div>
<pre class="output"><code>$Africa
[1] 20904782844
$Americas
[1] 379262350210
$Asia
[1] 227233738153
$Europe
[1] 269442085301
$Oceania
[1] 188187105354
attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
continent
1 Africa
2 Americas
3 Asia
4 Europe
5 Oceania
</code></pre>
<p>We called the same function again, but changed the second letter to an <code>l</code>, so the output was returned as a list.</p>
<p>We can specify multiple columns to group by:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ddply</span>(
<span class="dt">.data =</span> <span class="kw">calcGDP</span>(gapminder),
<span class="dt">.variables =</span> <span class="kw">c</span>(<span class="st">"continent"</span>, <span class="st">"year"</span>),
<span class="dt">.fun =</span> function(x) <span class="kw">mean</span>(x$gdp)
)</code></pre></div>
<pre class="output"><code> continent year V1
1 Africa 1952 5992294608
2 Africa 1957 7359188796
3 Africa 1962 8784876958
4 Africa 1967 11443994101
5 Africa 1972 15072241974
6 Africa 1977 18694898732
7 Africa 1982 22040401045
8 Africa 1987 24107264108
9 Africa 1992 26256977719
10 Africa 1997 30023173824
11 Africa 2002 35303511424
12 Africa 2007 45778570846
13 Americas 1952 117738997171
14 Americas 1957 140817061264
15 Americas 1962 169153069442
16 Americas 1967 217867530844
17 Americas 1972 268159178814
18 Americas 1977 324085389022
19 Americas 1982 363314008350
20 Americas 1987 439447790357
21 Americas 1992 489899820623
22 Americas 1997 582693307146
23 Americas 2002 661248623419
24 Americas 2007 776723426068
25 Asia 1952 34095762661
26 Asia 1957 47267432088
27 Asia 1962 60136869012
28 Asia 1967 84648519224
29 Asia 1972 124385747313
30 Asia 1977 159802590186
31 Asia 1982 194429049919
32 Asia 1987 241784763369
33 Asia 1992 307100497486
34 Asia 1997 387597655323
35 Asia 2002 458042336179
36 Asia 2007 627513635079
37 Europe 1952 84971341466
38 Europe 1957 109989505140
39 Europe 1962 138984693095
40 Europe 1967 173366641137
41 Europe 1972 218691462733
42 Europe 1977 255367522034
43 Europe 1982 279484077072
44 Europe 1987 316507473546
45 Europe 1992 342703247405
46 Europe 1997 383606933833
47 Europe 2002 436448815097
48 Europe 2007 493183311052
49 Oceania 1952 54157223944
50 Oceania 1957 66826828013
51 Oceania 1962 82336453245
52 Oceania 1967 105958863585
53 Oceania 1972 134112109227
54 Oceania 1977 154707711162
55 Oceania 1982 176177151380
56 Oceania 1987 209451563998
57 Oceania 1992 236319179826
58 Oceania 1997 289304255183
59 Oceania 2002 345236880176
60 Oceania 2007 403657044512
</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">daply</span>(
<span class="dt">.data =</span> <span class="kw">calcGDP</span>(gapminder),
<span class="dt">.variables =</span> <span class="kw">c</span>(<span class="st">"continent"</span>, <span class="st">"year"</span>),
<span class="dt">.fun =</span> function(x) <span class="kw">mean</span>(x$gdp)
)</code></pre></div>
<pre class="output"><code> year
continent 1952 1957 1962 1967
Africa 5992294608 7359188796 8784876958 11443994101
Americas 117738997171 140817061264 169153069442 217867530844
Asia 34095762661 47267432088 60136869012 84648519224
Europe 84971341466 109989505140 138984693095 173366641137
Oceania 54157223944 66826828013 82336453245 105958863585
year
continent 1972 1977 1982 1987
Africa 15072241974 18694898732 22040401045 24107264108
Americas 268159178814 324085389022 363314008350 439447790357
Asia 124385747313 159802590186 194429049919 241784763369
Europe 218691462733 255367522034 279484077072 316507473546
Oceania 134112109227 154707711162 176177151380 209451563998
year
continent 1992 1997 2002 2007
Africa 26256977719 30023173824 35303511424 45778570846
Americas 489899820623 582693307146 661248623419 776723426068
Asia 307100497486 387597655323 458042336179 627513635079
Europe 342703247405 383606933833 436448815097 493183311052
Oceania 236319179826 289304255183 345236880176 403657044512
</code></pre>
<p>You can use these functions in place of <code>for</code> loops (and its usually faster to do so): just write the body of the for loop in the anonymous function:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">d_ply</span>(
<span class="dt">.data=</span>gapminder,
<span class="dt">.variables =</span> <span class="st">"continent"</span>,
<span class="dt">.fun =</span> function(x) {
meanGDPperCap <-<span class="st"> </span><span class="kw">mean</span>(x$gdpPercap)
<span class="kw">print</span>(<span class="kw">paste</span>(
<span class="st">"The mean GDP per capita for"</span>, <span class="kw">unique</span>(x$continent),
<span class="st">"is"</span>, <span class="kw">format</span>(meanGDPperCap, <span class="dt">big.mark=</span><span class="st">","</span>)
))
}
)</code></pre></div>
<pre class="output"><code>[1] "The mean GDP per capita for Africa is 2,193.755"
[1] "The mean GDP per capita for Americas is 7,136.11"
[1] "The mean GDP per capita for Asia is 7,902.15"
[1] "The mean GDP per capita for Europe is 14,469.48"
[1] "The mean GDP per capita for Oceania is 18,621.61"
</code></pre>
<aside class="callout panel panel-info">
<div class="panel-heading">
<h2 id="tip-printing-numbers"><span class="glyphicon glyphicon-pushpin"></span>Tip: printing numbers</h2>
</div>
<div class="panel-body">
<p>The <code>format</code> function can be used to make numeric values “pretty” for printing out in messages.</p>
</div>
</aside>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="challenge-1"><span class="glyphicon glyphicon-pencil"></span>Challenge 1</h2>
</div>
<div class="panel-body">
<p>Calculate the average life expectancy per continent. Which has the longest? Which had the shortest?</p>
</div>
</section>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="challenge-2"><span class="glyphicon glyphicon-pencil"></span>Challenge 2</h2>
</div>
<div class="panel-body">
<p>Calculate the average life expectancy per continent and year. Which had the longest and shortest in 2007? Which had the greatest change in between 1952 and 2007?</p>
</div>
</section>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="advanced-challenge"><span class="glyphicon glyphicon-pencil"></span>Advanced Challenge</h2>
</div>
<div class="panel-body">
<p>Calculate the difference in mean life expectancy between the years 1952 and 2007 from the output of challenge 2 using one of the <code>plyr</code> functions.</p>
</div>
</section>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="alternate-challenge-if-class-seems-lost"><span class="glyphicon glyphicon-pencil"></span>Alternate Challenge if class seems lost</h2>
</div>
<div class="panel-body">
<p>Without running them, which of the following will calculate the average life expectancy per continent:</p>
<ol style="list-style-type: decimal">
<li></li>
</ol>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ddply</span>(
<span class="dt">.data =</span> gapminder,
<span class="dt">.variables =</span> gapminder$continent,
<span class="dt">.fun =</span> function(dataGroup) {
<span class="kw">mean</span>(dataGroup$lifeExp)
}
)</code></pre></div>
<ol start="2" style="list-style-type: decimal">
<li></li>
</ol>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ddply</span>(
<span class="dt">.data =</span> gapminder,
<span class="dt">.variables =</span> <span class="st">"continent"</span>,
<span class="dt">.fun =</span> <span class="kw">mean</span>(dataGroup$lifeExp)
)</code></pre></div>
<ol start="3" style="list-style-type: decimal">
<li></li>
</ol>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ddply</span>(
<span class="dt">.data =</span> gapminder,
<span class="dt">.variables =</span> <span class="st">"continent"</span>,
<span class="dt">.fun =</span> function(dataGroup) {
<span class="kw">mean</span>(dataGroup$lifeExp)
}
)</code></pre></div>
<ol start="4" style="list-style-type: decimal">
<li></li>
</ol>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">adply</span>(
<span class="dt">.data =</span> gapminder,
<span class="dt">.variables =</span> <span class="st">"continent"</span>,
<span class="dt">.fun =</span> function(dataGroup) {
<span class="kw">mean</span>(dataGroup$lifeExp)
}
)</code></pre></div>
</div>
</section>
</div>
</div>
</article>
<div class="footer">
<a class="label swc-blue-bg" href="http://software-carpentry.org">Software Carpentry</a>
<a class="label swc-blue-bg" href="https://github.com/swcarpentry/lesson-template">Source</a>
<a class="label swc-blue-bg" href="mailto:[email protected]">Contact</a>
<a class="label swc-blue-bg" href="LICENSE.html">License</a>
</div>
</div>
<!-- Javascript placed at the end of the document so the pages load faster -->
<script src="http://software-carpentry.org/v5/js/jquery-1.9.1.min.js"></script>
<script src="css/bootstrap/bootstrap-js/bootstrap.js"></script>
</body>
</html>