forked from clauswilke/dataviz
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathvisualizing_amounts.Rmd
515 lines (431 loc) · 31.8 KB
/
visualizing_amounts.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
```{r echo = FALSE, message = FALSE}
# run setup script
source("_common.R")
library(forcats)
```
# Visualizing amounts {#visualizing-amounts}
In many scenarios, we are interested in the magnitude of some set of numbers. For example, we might want to visualize the total sales volume of different brands of cars, or the total number of people living in different cities, or the age of Olympians performing different sports. In all these cases, we have a set of categories (e.g., brands of cars, cities, or sports) and a quantitative value for each category. I refer to these cases as visualizing amounts, because the main emphasis in these visualizations will be on the magnitude of the quantitative values. The standard visualization in this scenario is the bar plot, which comes in several variations, including simple bars as well as grouped and stacked bars. Alternatives to the bar plot are the dot plot and the heatmap.
## Bar plots
To motivate the concept of a bar plot, consider the total ticket sales for the most popular movies on a given weekend. Table \@ref(tab:boxoffice-gross) shows the top-five weekend gross ticket sales on the Christmas weekend of 2017. The movie "Star Wars: The Last Jedi" was by far the most popular movie on that weekend, outselling the fourth- and fifth-ranked movies "The Greatest Showman" and "Ferdinand" by almost a factor of 10.
```{r boxoffice-gross}
# source: Box Office Mojo
# URL: http://www.boxofficemojo.com/weekend/chart/?view=&yr=2017&wknd=51&p=.htm
# downloaded: 2018-02-11
boxoffice <- data.frame(rank = 1:5,
title = c("Star Wars: The Last Jedi", "Jumanji: Welcome to the Jungle", "Pitch Perfect 3", "The Greatest Showman", "Ferdinand"),
title_short = c("Star Wars", "Jumanji", "Pitch Perfect 3", "Greatest Showman", "Ferdinand"),
amount = c(71565498, 36169328, 19928525, 8805843, 7316746),
amount_text = c("$71,565,498", "$36,169,328", "$19,928,525", "$8,805,843", "$7,316,746"))
boxoffice_display <- boxoffice %>%
mutate(A = " ", B = " ", C = " ") %>%
select(A, rank, title, amount_text, B, C) %>%
rename(` ` = A,
Rank = rank,
Title = title,
`Weekend gross` = amount_text,
` ` = B,
` ` = C)
knitr::kable(
boxoffice_display,
caption = 'Highest grossing movies for the weekend of December 22-24, 2017. Data source: Box Office Mojo (http://www.boxofficemojo.com/). Used with permission', booktabs = TRUE,
row.names = FALSE,
align = c('c', 'c', 'l', 'r', 'c', 'c')#,
#format = "html",
#table.attr = "style = \"width: 75%\""
)
```
This kind of data is commonly visualized with vertical bars. For each movie, we draw a bar that starts at zero and extends all the way to the dollar value for that movie's weekend gross (Figure \@ref(fig:boxoffice-vertical)). This visualization is called a *bar plot* or *bar chart*.
(ref:boxoffice-vertical) Highest grossing movies for the weekend of December 22-24, 2017, displayed as a bar plot. Data source: Box Office Mojo (http://www.boxofficemojo.com/). Used with permission
```{r boxoffice-vertical, fig.width = 5*6/4.2, fig.asp = .5, fig.cap = '(ref:boxoffice-vertical)'}
boxoffice %>%
ggplot(aes(x = fct_reorder(title_short, rank), y = amount)) +
geom_col(fill = "#56B4E9", width = 0.6, alpha = 0.9) +
scale_y_continuous(expand = c(0, 0),
breaks = c(0, 2e7, 4e7, 6e7),
labels = c("0", "20", "40", "60"),
name = "weekend gross (million USD)") +
scale_x_discrete(name = NULL,
expand = c(0, 0.4)) +
coord_cartesian(clip = "off") +
theme_dviz_hgrid(12, rel_small = 1) +
theme(
#axis.ticks.length = grid::unit(0, "pt"),
axis.line.x = element_blank(),
axis.ticks.x = element_blank()
)
```
One problem we commonly encounter with vertical bars is that the labels identifying each bar take up a lot of horizontal space. In fact, I had to make Figure \@ref(fig:boxoffice-vertical) fairly wide and space out the bars so that I could place the movie titles underneath. To save horizontal space, we could place the bars closer together and rotate the labels (Figure \@ref(fig:boxoffice-rot-axis-tick-labels)). However, I am not a big proponent of rotated labels. I find the resulting plots awkward and difficult to read. And, in my experience, whenever the labels are too long to place horizontally they also don't look good rotated.
(ref:boxoffice-rot-axis-tick-labels) Highest grossing movies for the weekend of December 22-24, 2017, displayed as a bar plot with rotated axis tick labels. Rotated axis tick labels tend to be difficult to read and require awkward space use underneath the plot. For these reasons, I generally consider plots with rotated tick labels to be ugly. Data source: Box Office Mojo (http://www.boxofficemojo.com/). Used with permission
```{r boxoffice-rot-axis-tick-labels, fig.asp = 0.85, fig.cap = '(ref:boxoffice-rot-axis-tick-labels)'}
boxoffice %>%
ggplot(aes(x = fct_reorder(title_short, rank), y = amount)) +
geom_col(fill = "#56B4E9", alpha = 0.9) +
scale_y_continuous(expand = c(0, 0),
breaks = c(0, 2e7, 4e7, 6e7),
labels = c("0", "20", "40", "60"),
name = "weekend gross (million USD)") +
scale_x_discrete(name = NULL) +
coord_cartesian(clip = "off") +
theme_dviz_hgrid(rel_small = 1) +
theme(
#axis.ticks.length = grid::unit(0, "pt"),
axis.line.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
plot.margin = margin(3, 7, 3, 1.5)
) -> p_box_axrot
stamp_ugly(p_box_axrot)
```
The better solution for long labels is usually to swap the *x* and the *y* axis, so that the bars run horizontally (Figure \@ref(fig:boxoffice-horizontal)). After swapping the axes, we obtain a compact figure in which all visual elements, including all text, are horizontally oriented. As a result, the figure is much easier to read than Figure \@ref(fig:boxoffice-rot-axis-tick-labels) or even Figure \@ref(fig:boxoffice-vertical).
(ref:boxoffice-horizontal) Highest grossing movies for the weekend of December 22-24, 2017, displayed as a horizontal bar plot. Data source: Box Office Mojo (http://www.boxofficemojo.com/). Used with permission
```{r boxoffice-horizontal, fig.cap = '(ref:boxoffice-horizontal)'}
ggplot(boxoffice, aes(x = fct_reorder(title_short, desc(rank)), y = amount)) +
geom_col(fill = "#56B4E9", alpha = 0.9) +
scale_y_continuous(limits = c(0, 7.5e7),
expand = c(0, 0),
breaks = c(0, 2e7, 4e7, 6e7),
labels = c("0", "20", "40", "60"),
name = "weekend gross (million USD)") +
scale_x_discrete(name = NULL,
expand = c(0, 0.5)) +
coord_flip(clip = "off") +
theme_dviz_vgrid(rel_small = 1) +
theme(
#axis.ticks.length = grid::unit(0, "pt"),
axis.line.y = element_blank(),
axis.ticks.y = element_blank()
)
```
Regardless of whether we place bars vertically or horizontally, we need to pay attention to the order in which the bars are arranged. I often see bar plots where the bars are arranged arbitrarily or by some criterion that is not meaningful in the context of the figure. Some plotting programs arrange bars by default in alphabetic order of the labels, and other, similarly arbitrary arrangements are possible (Figure \@ref(fig:boxoffice-horizontal-bad-order)). In general, the resulting figures are more confusing and less intuitive than figures where bars are arranged in order of their size.
(ref:boxoffice-horizontal-bad-order) Highest grossing movies for the weekend of December 22-24, 2017, displayed as a horizontal bar plot. Here, the bars have been placed in descending order of the lengths of the movie titles. This arrangement of bars is arbitrary, it doesn't serve a meaningful purpose, and it makes the resulting figure much less intuitive than Figure \@ref(fig:boxoffice-horizontal). Data source: Box Office Mojo (http://www.boxofficemojo.com/). Used with permission
```{r boxoffice-horizontal-bad-order, fig.cap = '(ref:boxoffice-horizontal-bad-order)'}
p <- ggplot(boxoffice, aes(x = factor(title_short, levels = title_short[c(2, 1, 5, 3, 4)]),
y = amount)) +
geom_col(fill = "#56B4E9", alpha = 0.9) +
scale_y_continuous(limits = c(0, 7.5e7),
expand = c(0, 0),
breaks = c(0, 2e7, 4e7, 6e7),
labels = c("0", "20", "40", "60"),
name = "weekend gross (million USD)") +
scale_x_discrete(name = NULL,
expand = c(0, 0.5)) +
coord_flip(clip = "off") +
theme_dviz_vgrid(rel_small = 1) +
theme(
#axis.ticks.length = grid::unit(0, "pt"),
axis.line.y = element_blank(),
axis.ticks.y = element_blank()
)
stamp_bad(p)
```
We should only rearrange bars, however, when there is no natural ordering to the categories the bars represent. Whenever there is a natural ordering (i.e., when our categorical variable is an ordered factor) we should retain that ordering in the visualization. For example, Figure \@ref(fig:income-by-age) shows the median annual income in the U.S. by age groups. In this case, the bars should be arranged in order of increasing age. Sorting by bar height while shuffling the age groups makes no sense (Figure \@ref(fig:income-by-age-sorted)).
(ref:income-by-age) 2016 median U.S. annual household income versus age group. The 45--54 year age group has the highest median income. Data source: United States Census Bureau
```{r income-by-age, fig.cap = '(ref:income-by-age)'}
income_by_age %>% filter(race == "all") %>%
ggplot(aes(x = age, y = median_income)) +
geom_col(fill = "#56B4E9", alpha = 0.9) +
scale_y_continuous(expand = c(0, 0),
name = "median income (USD)",
breaks = c(0, 20000, 40000, 60000),
labels = c("$0", "$20,000", "$40,000", "$60,000")) +
xlab("age (years)") +
coord_cartesian(clip = "off") +
theme_dviz_hgrid() +
theme(
#axis.ticks.length = grid::unit(0, "pt"),
axis.ticks.x = element_blank(),
axis.line = element_blank(),
plot.margin = margin(3, 7, 3, 1.5)
)
```
(ref:income-by-age-sorted) 2016 median U.S. annual household income versus age group, sorted by income. While this order of bars looks visually appealing, the order of the age groups is now confusing. Data source: United States Census Bureau
```{r income-by-age-sorted, fig.cap = '(ref:income-by-age-sorted)'}
income_by_age %>% filter(race == "all") %>%
ggplot(aes(x = fct_reorder(age, desc(median_income)), y = median_income)) +
geom_col(fill = "#56B4E9", alpha = 0.9) +
scale_y_continuous(
expand = c(0, 0),
name = "median income (USD)",
breaks = c(0, 20000, 40000, 60000),
labels = c("$0", "$20,000", "$40,000", "$60,000")
) +
coord_cartesian(clip = "off") +
xlab("age (years)") +
theme_dviz_hgrid() +
theme(
#axis.ticks.length = grid::unit(0, "pt"),
axis.ticks.x = element_blank(),
axis.line = element_blank(),
plot.margin = margin(3, 7, 3, 1.5)
) -> p_income_sorted
stamp_bad(p_income_sorted)
```
```{block type='rmdtip', echo=TRUE}
Pay attention to the bar order. If the bars represent unordered categories, order them by ascending or descending data values.
```
## Grouped and stacked bars
All examples from the previous subsection showed how a quantitative amount varied with respect to one categorical variable. Frequently, however, we are interested in two categorical variables at the same time. For example, the U.S. Census Bureau provides median income levels broken down by both age and race. We can visualize this dataset with a *grouped bar plot* (Figure \@ref(fig:income-by-age-race-dodged)). In a grouped bar plot, we draw a group of bars at each position along the *x* axis, determined by one categorical variable, and then we draw bars within each group according to the other categorical variable.
(ref:income-by-age-race-dodged) 2016 median U.S. annual household income versus age group and race. Age groups are shown along the *x* axis, and for each age group there are four bars, corresponding to the median income of Asian, white, Hispanic, and black people, respectively. Data source: United States Census Bureau
```{r income-by-age-race-dodged, fig.width = 5.5*6/4.2, fig.asp = 0.5, fig.cap = '(ref:income-by-age-race-dodged)'}
income_by_age %>% filter(race %in% c("white", "asian", "black", "hispanic")) %>%
mutate(race = fct_relevel(race, c("asian", "white", "hispanic", "black")),
race = fct_recode(race, Asian = "asian", Hispanic = "hispanic"),
age = fct_recode(age, "≥ 75" = "> 74")) -> income_df
# Take the darkest four colors from 5-class ColorBrewer palette "PuBu"
colors_four = RColorBrewer::brewer.pal(5, "PuBu")[5:2]
ggplot(income_df, aes(x = age, y = median_income, fill = race)) +
geom_col(position = "dodge", alpha = 0.9) +
scale_y_continuous(
expand = c(0, 0),
name = "median income (USD)",
breaks = c(0, 20000, 40000, 60000, 80000, 100000),
labels = c("$0", "$20,000", "$40,000", "$60,000", "$80,000", "$100,000")
) +
scale_fill_manual(values = colors_four, name = NULL) +
coord_cartesian(clip = "off") +
xlab("age (years)") +
theme_dviz_hgrid() +
theme(
axis.line.x = element_blank(),
axis.ticks.x = element_blank()
) -> p_income_race_dodged
#stamp_ugly(p_income_race_dodged)
p_income_race_dodged
```
Grouped bar plots show a lot of information at once and they can be confusing. In fact, even though I have not labeled Figure \@ref(fig:income-by-age-race-dodged) as bad or ugly, I find it difficult to read. In particular, it is difficult to compare median incomes across age groups for a given racial group. So this figure is only appropriate if we are primarily interested in the differences in income levels among racial groups, separately for specific age groups. If we care more about the overall pattern of income levels among racial groups, it may be preferable to show race along the *x* axis and show ages as distinct bars within each racial group (Figure \@ref(fig:income-by-race-age-dodged)).
(ref:income-by-race-age-dodged) 2016 median U.S. annual household income versus age group and race. In contrast to Figure \@ref(fig:income-by-age-race-dodged), now race is shown along the *x* axis, and for each race we show seven bars according to the seven age groups. Data source: United States Census Bureau
```{r income-by-race-age-dodged, fig.width = 5.5*6/4.2, fig.asp = 0.4, fig.cap = '(ref:income-by-race-age-dodged) '}
# Take the darkest seven colors from 8-class ColorBrewer palette "PuBu"
colors_seven = RColorBrewer::brewer.pal(8, "PuBu")[2:8]
ggplot(income_df, aes(x = race, y = median_income, fill = age)) +
geom_col(position = "dodge", alpha = 0.9) +
scale_y_continuous(
expand = c(0, 0),
name = "median income (USD)",
breaks = c(0, 20000, 40000, 60000, 80000, 100000),
labels = c("$0", "$20,000", "$40,000", "$60,000", "$80,000", "$100,000")
) +
scale_fill_manual(values = colors_seven, name = "age (yrs)") +
coord_cartesian(clip = "off") +
xlab(label = NULL) +
theme_dviz_hgrid() +
theme(
axis.line.x = element_blank(),
axis.ticks.x = element_blank(),
legend.title.align = 0.5
) -> p_income_age_dodged
p_income_age_dodged
```
Both Figures \@ref(fig:income-by-age-race-dodged) and \@ref(fig:income-by-race-age-dodged) encode one categorical variable by position along the *x* axis and the other by bar color. And in both cases, the encoding by position is easy to read while the encoding by bar color requires more mental effort, as we have to mentally match the colors of the bars against the colors in the legend. We can avoid this added mental effort by showing four separate regular bar plots rather than one grouped bar plot (Figure \@ref(fig:income-by-age-race-faceted)). Which of these various options we choose is ultimately a matter of taste. I would likely choose Figure \@ref(fig:income-by-age-race-faceted), because it circumvents the need for different bar colors.
(ref:income-by-age-race-faceted) 2016 median U.S. annual household income versus age group and race. Instead of displaying this data as a grouped bar plot, as in Figures \@ref(fig:income-by-age-race-dodged) and \@ref(fig:income-by-race-age-dodged), we now show the data as four separate regular bar plots. This choice has the advantage that we don't need to encode either categorical variable by bar color. Data source: United States Census Bureau
```{r income-by-age-race-faceted, fig.width = 5.5*6/4.2, fig.cap = '(ref:income-by-age-race-faceted)'}
income_df %>%
mutate(age = fct_recode(age, "15–24" = "15 to 24", "25–34" = "25 to 34", "35–44" = "35 to 44",
"45–54" = "45 to 54", "55–64" = "55 to 64", "65–74" = "65 to 74")) -> income_age_abbrev_df
ggplot(income_age_abbrev_df, aes(x = age, y = median_income)) +
geom_col(fill = "#56B4E9", alpha = 0.9) +
scale_y_continuous(
expand = c(0, 0),
name = "median income (USD)",
breaks = c(0, 20000, 40000, 60000, 80000, 100000),
labels = c("$0", "$20,000", "$40,000", "$60,000", "$80,000", "$100,000")
) +
coord_cartesian(clip = "off") +
xlab(label = "age (years)") +
facet_wrap(~race, scales = "free_x") +
theme_dviz_hgrid(14) +
theme(
#axis.ticks.length = grid::unit(0, "pt"),
axis.ticks.x = element_blank(),
axis.line = element_blank(),
strip.text = element_text(size = 14),
panel.spacing.y = grid::unit(14, "pt")
) -> p_income_age_faceted
p_income_age_faceted
```
Instead of drawing groups of bars side-by-side, it is sometimes preferable to stack bars on top of each other. Stacking is useful when the sum of the amounts represented by the individual stacked bars is in itself a meaningful amount. So, while it would not make sense to stack the median income values of Figure \@ref(fig:income-by-age-race-dodged) (the sum of two median income values is not a meaningful value), it might make sense to stack the weekend gross values of Figure \@ref(fig:boxoffice-vertical) (the sum of the weekend gross values of two movies is the total gross for the two movies combined). Stacking is also appropriate when the individual bars represent counts. For example, in a dataset of people, we can either count men and women separately or we can count them together. If we stack a bar representing a count of women on top of a bar representing a count of men, then the combined bar height represents the total count of people regardless of gender.
I will demonstrate this principle using a dataset about the passengers of the transatlantic ocean liner Titanic, which sank on April 15, 1912. On board were approximately 1300 passengers, not counting crew. The passengers were traveling in one of three classes (1st, 2nd, or 3rd), and there were almost twice as many male as female passengers on the ship. To visualize the breakdown of passengers by class and gender, we can draw separate bars for each class and gender and stack the bars representing women on top of the bars representing men, separately for each class (Figure \@ref(fig:titanic-passengers-by-class-sex)). The combined bars represent the total number of passengers in each class.
(ref:titanic-passengers-by-class-sex) Numbers of female and male passengers on the Titanic traveling in 1st, 2nd, and 3rd class.
```{r titanic-passengers-by-class-sex, fig.width = 5.5, fig.cap = '(ref:titanic-passengers-by-class-sex)'}
titanic_groups <- titanic_all %>% filter(class != "*") %>%
select(class, sex) %>%
group_by(class, sex) %>%
tally() %>% arrange(class, desc(sex)) %>%
mutate(sex = factor(sex, levels = c("female", "male"))) %>%
group_by(class) %>%
mutate(nlabel = cumsum(n) - n/2) %>%
ungroup() %>%
mutate(class = paste(class, "class"))
ggplot(titanic_groups, aes(x = class, y = n, fill = sex)) +
geom_col(position = "stack", color = "white", size = 1, width = 1) +
geom_text(
aes(y = nlabel, label = n), color = "white", size = 14/.pt,
family = dviz_font_family
) +
scale_x_discrete(expand = c(0, 0), name = NULL) +
scale_y_continuous(expand = c(0, 0), breaks = NULL, name = NULL) +
scale_fill_manual(
values = c("#D55E00", "#0072B2"),
breaks = c("female", "male"),
labels = c("female passengers ", "male passengers"),
name = NULL
) +
coord_cartesian(clip = "off") +
theme_dviz_grid() +
theme(
panel.grid.major = element_blank(),
axis.ticks = element_blank(),
axis.text = element_text(size = 14),
legend.position = "bottom",
legend.justification = "center",
legend.background = element_rect(fill = "white"),
legend.spacing.x = grid::unit(4.5, "pt"),
legend.spacing.y = grid::unit(0, "cm"),
legend.box.spacing = grid::unit(7, "pt")
)
```
Figure \@ref(fig:titanic-passengers-by-class-sex) differs from the previous bar plots I have shown in that there is no explicit *y* axis. I have instead shown the actual numerical values that each bar represents. Whenever a plot is meant to display only a small number of different values, it makes sense to add the actual numbers to the plot. This substantially increases the amount of information conveyed by the plot without adding much visual noise, and it removes the need for an explicit *y* axis.
## Dot plots and heatmaps
Bars are not the only option for visualizing amounts. One important limitation of bars is that they need to start at zero, so that the bar length is proportional to the amount shown. For some datasets, this can be impractical or may obscure key features. In this case, we can indicate amounts by placing dots at the appropriate locations along the *x* or *y* axis.
Figure \@ref(fig:Americas-life-expect) demonstrates this visualization approach for a dataset of life expectancies in 25 countries in the Americas. The citizens of these countries have life expectancies between 60 and 81 years, and each individual life expectancy value is shown with a blue dot at the appropriate location along the *x* axis. By limiting the axis range to the interval from 60 to 81 years, the figure highlights the key features of this dataset: Canada has the highest life expectancy among all listed countries, and Bolivia and Haiti have much lower life expectancies than all other countries. If we had used bars instead of dots (Figure \@ref(fig:Americas-life-expect-bars)), we'd have made a much less compelling figure. Because the bars are so long in this figure, and they all have nearly the same length, the eye is drawn to the middle of the bars rather than to their end points, and the figure fails to convey its message.
(ref:Americas-life-expect) Life expectancies of countries in the Americas, for the year 2007. Data source: Gapminder project
```{r Americas-life-expect, fig.width = 6., fig.asp = .9, fig.cap = '(ref:Americas-life-expect)'}
library(gapminder)
df_Americas <- gapminder %>% filter(year == 2007, continent == "Americas")
ggplot(df_Americas, aes(x = lifeExp, y = fct_reorder(country, lifeExp))) +
geom_point(color = "#0072B2", size = 3) +
scale_x_continuous(
name = "life expectancy (years)",
limits = c(59.7, 81.5),
expand = c(0, 0)
) +
scale_y_discrete(name = NULL, expand = c(0, 0.5)) +
theme_dviz_grid(12, rel_small = 1) +
theme(
#axis.ticks.length = grid::unit(0, "pt"),
#axis.title = element_text(size = 12),
plot.margin = margin(18, 6, 3, 1.5)
)
```
(ref:Americas-life-expect-bars) Life expectancies of countries in the Americas, for the year 2007, shown as bars. This dataset is not suitable for being visualized with bars. The bars are too long and they draw attention away from the key feature of the data, the differences in life expectancy among the different countries. Data source: Gapminder project
```{r Americas-life-expect-bars, fig.width = 6., fig.asp = .9, fig.cap = '(ref:Americas-life-expect-bars)'}
life_bars <- ggplot(df_Americas, aes(y = lifeExp, x = fct_reorder(country, lifeExp))) +
geom_col(fill = "#56B4E9", alpha = 0.9) +
scale_y_continuous(
name = "life expectancy (years)",
limits = c(0, 85),
expand = c(0, 0)
) +
scale_x_discrete(name = NULL, expand = c(0, 0.5)) +
coord_flip(clip = "off") +
theme_dviz_vgrid(12, rel_small = 1) +
theme(
#axis.ticks.length = grid::unit(0, "pt"),
axis.line.y = element_blank(),
axis.ticks.y = element_blank(),
#axis.title = element_text(size = 12),
plot.margin = margin(18, 6, 3, 1.5)
)
stamp_bad(life_bars)
```
Regardless of whether we use bars or dots, however, we need to pay attention to the ordering of the data values. In Figures \@ref(fig:Americas-life-expect) and \@ref(fig:Americas-life-expect-bars), the countries are ordered in descending order of life expectancy. If we instead ordered them alphabetically, we'd end up with a disordered cloud of points that is confusing and fails to convey a clear message (Figure \@ref(fig:Americas-life-expect-bad)).
(ref:Americas-life-expect-bad) Life expectancies of countries in the Americas, for the year 2007. Here, the countries are ordered alphabetically, which causes a dots to form a disordered cloud of points. This makes the figure difficult to read, and therefore it deserves to be labeled as "bad." Data source: Gapminder project
```{r Americas-life-expect-bad, fig.width = 6., fig.asp = .9, fig.cap = '(ref:Americas-life-expect-bad)'}
p <- ggplot(df_Americas, aes(x = lifeExp, y = fct_rev(country))) +
geom_point(color = "#0072B2", size = 3) +
scale_x_continuous(name = "life expectancy (years)",
limits = c(59.7, 81.5),
expand = c(0, 0)) +
scale_y_discrete(name = NULL, expand = c(0, 0.5)) +
theme_dviz_grid(12, rel_small = 1) +
theme(#axis.ticks.length = grid::unit(0, "pt"),
#axis.title = element_text(size = 12),
plot.margin = margin(18, 6, 3, 1.5))
stamp_bad(p)
```
All examples so far have represented amounts by location along a position scale, either through the end point of a bar or the placement of a dot. For very large datasets, neither of these options may be appropriate, because the resulting figure would become too busy. We had already seen in Figure \@ref(fig:income-by-age-race-dodged) that just seven groups of four data values can result in a figure that is complex and not that easy to read. If we had 20 groups of 20 data values, a similar figure would likely be highly confusing.
As an alternative to mapping data values onto positions via bars or dots, we can map data values onto colors. Such a figure is called a *heatmap*. Figure \@ref(fig:internet-over-time) uses this approach to show the percentage of internet users over time in 20 countries and for 23 years, from 1994 to 2016. While this visualization makes it harder to determine the exact data values shown (e.g., what's the exact percentage of internet users in the United States in 2015?), it does an excellent job of highlighting broader trends. We can see clearly in which countries internet use began early and which it did not, and we can also see clearly which countries have high internet penetration in the final year covered by the dataset (2016).
(ref:internet-over-time) Internet adoption over time, for select countries. Color represents the percent of internet users for the respective country and year. Countries were ordered by percent internet users in 2016. Data source: World Bank
```{r internet-over-time, fig.width = 5.5*6/4.2, fig.cap = '(ref:internet-over-time)'}
country_list = c("United States", "China", "India", "Japan", "Algeria",
"Brazil", "Germany", "France", "United Kingdom", "Italy", "New Zealand",
"Canada", "Mexico", "Chile", "Argentina", "Norway", "South Africa", "Kenya",
"Israel", "Iceland")
internet_short <- filter(internet, country %in% country_list) %>%
mutate(users = ifelse(is.na(users), 0, users))
internet_summary <- internet_short %>%
group_by(country) %>%
summarize(year1 = min(year[users > 0]),
last = users[n()]) %>%
arrange(last, desc(year1))
internet_short <- internet_short %>%
mutate(country = factor(country, levels = internet_summary$country))
ggplot(filter(internet_short, year > 1993),
aes(x = year, y = country, fill = users)) +
geom_tile(color = "white", size = 0.25) +
scale_fill_viridis_c(
option = "A", begin = 0.05, end = 0.98,
limits = c(0, 100),
name = "internet users / 100 people",
guide = guide_colorbar(
direction = "horizontal",
label.position = "bottom",
title.position = "top",
ticks = FALSE,
barwidth = grid::unit(3.5, "in"),
barheight = grid::unit(0.2, "in")
)
) +
scale_x_continuous(expand = c(0, 0), name = NULL) +
scale_y_discrete(name = NULL, position = "right") +
theme_dviz_open(12) +
theme(
axis.line = element_blank(),
axis.ticks = element_blank(),
axis.ticks.length = grid::unit(1, "pt"),
legend.position = "top",
legend.justification = "left",
legend.title.align = 0.5,
legend.title = element_text(size = 12*12/14)
)
```
As is the case with all other visualization approaches discussed in this chapter, we need to pay attention to the ordering of the categorical data values when making heatmaps. In Figure \@ref(fig:internet-over-time), countries are ordered by the percentage of internet users in 2016. This ordering places the United Kingdom, Japan, Canada, and Germany above the United States, because all these countries have higher internet penetration in 2016 than the United States does, even though the United States saw significant internet use at an earlier time. Alternatively, we could order countries by how early they started to see significant internet usage. In Figure \@ref(fig:internet-over-time2), countries are ordered by the year in which internet usage first rose to above 20%. In this figure, the United States falls into the third position from the top, and it stands out for having relatively low internet usage in 2016 compared to how early internet usage started there. A similar pattern can be seen for Italy. Israel and France, by contrast, started relatively late but gained ground rapidly.
(ref:internet-over-time2) Internet adoption over time, for select countries. Countries were ordered by the year in which their internet usage first exceeded 20%. Data source: World Bank
```{r internet-over-time2, fig.width = 5.5*6/4.2, fig.cap = '(ref:internet-over-time2)'}
internet_summary <- internet_short %>%
group_by(country) %>%
summarize(year1 = min(year[users > 20]),
last = users[n()]) %>%
arrange(desc(year1), last)
internet_short <- internet_short %>%
mutate(country = factor(country, levels = internet_summary$country))
ggplot(filter(internet_short, year > 1993),
aes(x = year, y = country, fill = users)) +
geom_tile(color = "white", size = 0.25) +
scale_fill_viridis_c(
option = "A", begin = 0.05, end = 0.98,
limits = c(0, 100),
name = "internet users / 100 people",
guide = guide_colorbar(
direction = "horizontal",
label.position = "bottom",
title.position = "top",
ticks = FALSE,
barwidth = grid::unit(3.5, "in"),
barheight = grid::unit(0.2, "in")
)
) +
scale_x_continuous(expand = c(0, 0), name = NULL) +
scale_y_discrete(name = NULL, position = "right") +
theme_dviz_open(12) +
theme(
axis.line = element_blank(),
axis.ticks = element_blank(),
axis.ticks.length = grid::unit(1, "pt"),
legend.position = "top",
legend.justification = "left",
legend.title.align = 0.5,
legend.title = element_text(size = 12*12/14)
)
```
Both Figures \@ref(fig:internet-over-time) and \@ref(fig:internet-over-time2) are valid representations of the data. Which one is preferred depends on the story we want to convey. If our story is about internet usage in 2016, then Figures \@ref(fig:internet-over-time) is probably the better choice. If, however, our story is about how early or late adoption of the internet relates to current-day usage, then Figure \@ref(fig:internet-over-time2) is preferable.