forked from SurgicalInformatics/healthyr_book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path04_plotting.Rmd
871 lines (623 loc) · 35.5 KB
/
04_plotting.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
---
output:
pdf_document: default
html_document: default
editor_options:
chunk_output_type: inline
---
```{r setup, include=FALSE}
library(tidyverse)
library(gapminder)
library(kableExtra)
# text width is 4.6 inch, but markdown scales down nicely even if we include widths and heights greater than 4.6
knitr::opts_chunk$set(fig.width = 2.8, fig.height = 2.8, cache = TRUE)
```
# Different types of plots {#chap04-h1}
\index{plots@\textbf{plots}}
\index{plots@\textbf{plots}!ggplot2}
> What I cannot create, I do not understand.
> Richard Feynman
There are a few different plotting packages in R, but the most elegant and versatile one is __ggplot2__^[The name of the package is __ggplot2__, but the function is called `ggplot()`. For everything you've ever wanted to know about the grammar of graphics in R, see @wickham2016.].
**gg** stands for **g**rammar of **g**raphics which means that we can make a plot by describing it one component at a time.
In other words, we build a plot by adding layers to it.
This does not have to be many layers, the simplest `ggplot()` consists of just two components:
* the variables to be plotted;
* a geometrical object (e.g., point, line, bar, box, etc.).
`ggplot()` calls geometrical objects *geoms*.
Figure \@ref(fig:chap04-fig-steps) shows some example steps for building a scatter plot, including changing its appearance ('theme') and faceting - an efficient way of creating separate plots for subgroups.
```{r, message=F, include = FALSE}
library(tidyverse)
library(gapminder)
gapdata2007 <- gapminder %>%
filter(year == 2007) %>%
mutate(gdpPercap_thousands = gdpPercap/1000)
```
\index{functions@\textbf{functions}!ggplot}
```{r chap04-fig-steps, fig.height=12, fig.width=8, echo = FALSE, fig.cap = "Example steps for building and modifying a ggplot. (1) Initialising the canvas and defining variables, (2) adding points, (3) colouring points by continent, (4) changing point type, (5) faceting, (6) changing the plot theme and the scale of the x variable."}
library(patchwork)
step1 <- gapdata2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp))
step2 <- gapdata2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point()
step3 <- gapdata2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp, colour = continent)) +
geom_point() +
theme(legend.position = "none")
step4 <- gapdata2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp, colour = continent)) +
geom_point(shape = 1)
step5 <- gapdata2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp, colour = continent)) +
geom_point(shape = 1) +
facet_wrap(~continent)
step5 <- gapdata2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp, colour = continent)) +
geom_point(shape = 1) +
facet_wrap(~continent)
step6 <- gapdata2007 %>%
ggplot(aes(x = gdpPercap/1000, y = lifeExp, colour = continent)) +
geom_point(shape = 1) +
facet_wrap(~continent) +
theme_light()
(step1 | step2) / (step3 | step4) / step5 / step6 + plot_annotation(tag_levels = "1", tag_prefix = "(", tag_suffix = ")")
```
\clearpage
## Get the data {#chap04-data}
We are using the gapminder dataset (https://www.gapminder.org/data) that has been put into an R package by @bryan2017 so we can load it with `library(gapminder)`.
```{r, message=F, include = TRUE}
library(tidyverse)
library(gapminder)
glimpse(gapminder)
```
The dataset includes `r nrow(gapminder)` observations (rows) of `r ncol(gapminder)` variables (columns: `r colnames(gapminder)`).
`country`, `continent`, and `year` could be thought of as grouping variables, whereas lifeExp (life expectancy), pop (population), and gdpPercap (Gross Domestic Product per capita) are values.
The years in this dataset span `r range(gapminder$year) %>% paste(collapse = " to ")` with 5-year intervals (so a total of `r gapminder$year %>% n_distinct()` different years).
It includes `r gapminder$country %>% n_distinct()` countries from `r gapminder$continent %>% n_distinct()` continents (`r gapminder$continent %>% unique()`).
You can check that all of the numbers quoted above are correct with these lines:
```{r, results = "hide"}
library(tidyverse)
library(gapminder)
gapminder$year %>% unique()
gapminder$country %>% n_distinct()
gapminder$continent %>% unique()
```
Let's create a new shorter tibble called `gapdata2007` that only includes data for the year 2007.
```{r}
gapdata2007 <- gapminder %>%
filter(year == 2007)
gapdata2007
```
The new tibble - `gapdata2007` - now shows up in your Environment tab, whereas `gapminder` does not.
Running `library(gapminder)` makes it available to use (so the funny line below is not necessary for any of the code in this chapter to work), but to have it appear in your normal Environment tab you'll need to run this funny looking line:
```{r}
# loads the gapminder dataset from the package environment
# into your Global Environment
gapdata <- gapminder
```
Both `gapdata` and `gapdata2007` now show up in the Environment tab and can be clicked on/quickly viewed as usual.
## Anatomy of ggplot explained {#chap04-gganatomy}
\index{plots@\textbf{plots}!anatomy of a plot}
\index{plots@\textbf{plots}!aes}
We will now explain the six steps shown in Figure \@ref(fig:chap04-fig-steps).
Note that you only need the first two to make a plot, the rest are just to show you further functionality and optional customisations.
**(1)** Start by defining the variables, e.g., `ggplot(aes(x = var1, y = var2))`:
```{r, fig.keep = 'none'}
gapdata2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp))
```
This creates the first plot in Figure \@ref(fig:chap04-fig-steps).
Although the above code is equivalent to:
```{r, fig.keep = 'none'}
ggplot(gapdata2007, aes(x = gdpPercap, y = lifeExp))
```
We tend to put the data first and then use the pipe (`%>%`) to send it to the `ggplot()` function.
This becomes useful when we add further data wrangling functions between the data and the `ggplot()`.
For example, our plotting pipelines often look like this:
```{r, fig.keep = 'none', eval = FALSE}
data %>%
filter(...) %>%
mutate(...) %>%
ggplot(aes(...)) +
...
```
The lines that come before the `ggplot()` function are piped, whereas from `ggplot()` onwards you have to use +.
This is because we are now adding different layers and customisations to the same plot.
`aes()` stands for **aes**thetics - things we can see.
Variables are always inside the `aes()` function, which in return is inside a `ggplot()`.
Take a moment to appreciate the double closing brackets `))` - the first one belongs to `aes()`, the second one to `ggplot()`.
**(2)** Choose and add a geometrical object
Let's ask `ggplot()` to draw a point for each observation by adding `geom_point()`:
```{r, fig.keep = 'none'}
gapdata2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point()
```
We have now created the second plot in Figure \@ref(fig:chap04-fig-steps), a scatter plot.
If we copy the above code and change just one thing - the `x` variable from `gdpPercap` to `continent` (which is a categorical variable) - we get what's called a strip plot.
This means we are now plotting a continuous variable (`lifeExp`) against a categorical one (`continent`).
But the thing to note is that the rest of the code stays exactly the same, all we did was change the `x = `.
```{r chap04-fig-stripplot, fig.width=3, fig.cap = "A strip plot using `geom_point()`."}
gapdata2007 %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_point()
```
**(3)** specifying further variables inside `aes()`
Going back to the scatter plot (`lifeExp` vs `gdpPercap`), let's use `continent` to give the points some colour.
We can do this by adding `colour = continent` inside the `aes()`:
```{r, fig.width=4, fig.keep = 'none'}
gapdata2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp, colour = continent)) +
geom_point()
```
This creates the third plot in Figure \@ref(fig:chap04-fig-steps). It uses the default colour scheme and will automatically include a legend.
Still with just two lines of code (`ggplot(...)` + `geom_point()`).
**(4)** specifying aesthetics outside `aes()`
It is very important to understand the difference between including `ggplot` arguments inside or outside of the `aes()` function.
<!-- Graphical example of the following suggested by SK in comments -->
The main aesthetics (things we can see) are: **x**, **y**, **colour**, **fill**, **shape**, **size**, and any of these could appear inside or outside the `aes()` function.
Press F1 on, e.g., `geom_point()`, to see the full list of aesthetics that can be used with this geom (this opens the Help tab).
If F1 is hard to summon on your keyboard, type in and run `?geom_point`.
Variables (so columns of your dataset) have to be defined inside `aes()`.
Whereas to apply a modification on everything, we can set an aesthetic to a constant value outside of `aes()`.
For example, Figure \@ref(fig:chap04-fig-shapes) shows a selection of the point shapes built into R. The default shape used by `geom_point()` is number 16.
```{r chap04-fig-shapes, echo = FALSE, fig.width = 6, fig.cap="A selection of shapes for plotting. Shapes 0, 1, and 2 are hollow, whereas for shapes 21, 22, and 23 we can define both a colour and a fill (for the shapes, colour is the border around the fill).", warning = FALSE}
shapes <- tibble(shape = c(0, 1, 2, 4, 8, 15, 16, 17, 21, 22, 23), x = 0:10)
ggplot(shapes, aes(x, y = 1)) +
geom_text(aes(label = shape), hjust = 0, nudge_x = -0.55) +
geom_point(aes(shape = shape), size = 5, fill = "#2171b5") +
scale_shape_identity() +
theme_void()
```
To make all of the points in our figure hollow, let's set their shape to 1.
We do this by adding `shape = 1` inside the `geom_point()`:
```{r, fig.keep = 'none'}
gapdata2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp, colour = continent)) +
geom_point(shape = 1)
```
This creates the fourth plot in Figure \@ref(fig:chap04-fig-steps).
<!-- This doesn't flow nicely to my eye, perhaps an extra space before (5)? SK -->
**(5)** From one plot to multiple with a single extra line
Faceting is a way to efficiently create the same plot for subgroups within the dataset.
For example, we can separate each continent into its own facet by adding `facet_wrap(~continent)` to our plot:
```{r fig.height=3.5, fig.width=7, fig.keep = 'none'}
gapdata2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp, colour = continent)) +
geom_point(shape = 1) +
facet_wrap(~continent)
```
This creates the fifth plot in Figure \@ref(fig:chap04-fig-steps).
Note that we have to use the tilde (~) in `facet_wrap()`.
There is a similar function called `facet_grid()` that will create a grid of plots based on two grouping variables, e.g., `facet_grid(var1~var2)`.
Furthermore, facets are happy to quickly separate data based on a condition (so something you would usually use in a filter).
```{r chap04-fig-facetcond, fig.width=6, fig.cap = "Using a filtering condition (e.g., population > 50 million) directly inside a `facet_wrap()`."}
gapdata2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp, colour = continent)) +
geom_point(shape = 1) +
facet_wrap(~pop > 50000000)
```
On this plot, the facet `FALSE` includes countries with a population less than 50 million people, and the facet `TRUE` includes countries with a population greater than 50 million people.
The tilde (~) in R denotes dependency.
It is mostly used by statistical models to define dependent and explanatory variables and you will see it a lot in the second part of this book.
**(6)** Grey to white background - changing the theme
Overall, we can customise every single thing on a ggplot.
Font type, colour, size or thickness or any lines or numbers, background, you name it.
But a very quick way to change the appearance of a ggplot is to apply a different theme.
The signature ggplot theme has a light grey background and white grid lines (Figure \@ref(fig:chap04-fig-themes)).
```{r chap04-fig-themes, echo = FALSE, fig.width=5, fig.height=3.5, fig.cap = "Some of the built-in ggplot themes (1) default (2) `theme_bw()`, (3) `theme_dark()`, (4) `theme_classic()`."}
themeplot <- gapminder %>%
ggplot(aes(x = year, y = lifeExp)) +
facet_wrap(~"LABEL")
(themeplot + (themeplot+theme_bw()))/((themeplot + theme_dark()) + (themeplot+theme_classic())) + plot_annotation(tag_levels = "1", tag_prefix = "(", tag_suffix = ")")
```
As a final step, we are adding `theme_bw()` ("background white") to give the plot a different look.
We have also divided the gdpPercap by 1000 (making the units "thousands of dollars per capita").
Note that you can apply calculations directly on ggplot variables (so how we've done `x = gdpPercap/1000` here).
```{r fig.height=3.5, fig.width=7, fig.keep = 'none'}
gapdata2007 %>%
ggplot(aes(x = gdpPercap/1000, y = lifeExp, colour = continent)) +
geom_point(shape = 1) +
facet_wrap(~continent) +
theme_bw()
```
This creates the last plot in Figure \@ref(fig:chap04-fig-steps).
This is how `ggplot()` works - you can build a plot by adding or modifying things one by one.
## Set your theme - grey vs white
If you find yourself always adding the same theme to your plot (i.e., we really like the `+ theme_bw()`), you can use `theme_set()` so your chosen theme is applied to every plot you draw:
```{r}
theme_set(theme_bw())
```
In fact, we usually have these two lines at the top of every script:
```{r}
library(tidyverse)
theme_set(theme_bw())
```
Furthermore, we can customise anything that appears in a `ggplot()` from axis fonts to the exact grid lines, and much more.
That's what Chapter \@ref(finetuning): Fine tuning plots is all about, but here we are focussing on the basic functionality and how different geoms work.
But from now on,`+ theme_bw()` is automatically applied on everything we make.
## Scatter plots/bubble plots
\index{plots@\textbf{plots}!scatter}
\index{plots@\textbf{plots}!bubble}
The ggplot anatomy (Section \@ref(chap04-gganatomy)) covered both scatter and strip plots (both created with `geom_point()`).
Another cool thing about this geom is that adding a size aesthetic makes it into a bubble plot.
For example, let's size the points by population.
As you would expect from a "grammar of graphics plot", this is as simple as adding `size = pop` as an aesthetic:
```{r, fig.keep = 'none'}
gapdata2007 %>%
ggplot(aes(x = gdpPercap/1000, y = lifeExp, size = pop)) +
geom_point()
```
With increased bubble sizes, there is some overplotting, so let's make the points hollow (`shape = 1`) and slightly transparent (`alpha = 0.5`):
```{r, fig.keep = 'none'}
gapdata2007 %>%
ggplot(aes(x = gdpPercap/1000, y = lifeExp, size = pop)) +
geom_point(shape = 1, alpha = 0.5)
```
The resulting bubble plots are shown in Figure \@ref(fig:chap04-fig-bubble).
```{r chap04-fig-bubble, echo = FALSE, fig.width=0.6*10, fig.height = 0.6*4, fig.cap = "Turn the scatter plot from Figure \\@ref(fig:chap04-fig-steps):(2) to a bubble plot by (1) adding `size = pop` inside the `aes()`, (2) make the points hollow and transparent."}
theme_set(theme_bw())
p1 <- gapdata2007 %>%
ggplot(aes(x = gdpPercap/1000, y = lifeExp, size = pop)) +
geom_point()
p2 <- gapdata2007 %>%
ggplot(aes(x = gdpPercap/1000, y = lifeExp, size = pop)) +
geom_point(shape = 1, alpha = 0.5) +
theme(legend.position = "none")
p1 + p2 + plot_annotation(tag_levels = "1", tag_prefix = "(", tag_suffix = ")")
```
Alpha is an aesthetic to make geoms transparent, its values can range from 0 (invisible) to 1 (solid).
## Line plots/time series plots
\index{plots@\textbf{plots}!line}
\index{plots@\textbf{plots}!path}
\index{plots@\textbf{plots}!time-series}
Let's plot the life expectancy in the United Kingdom over time (Figure \@ref(fig:chap04-fig-lineplot)):
```{r chap04-fig-lineplot, fig.width = 4, fig.height = 1.5, fig.cap="`geom_line()`- Life expectancy in the United Kingdom over time."}
gapdata %>%
filter(country == "United Kingdom") %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line()
```
As a recap, the steps in the code above are:
* Send `gapdata` into a `filter()`;
* inside the `filter()`, our condition is `country == "United Kingdom"`;
* We initialise `ggplot()` and define our main variables: `aes(x = year, y = lifeExp)`;
* we are using a new geom - `geom_line()`.
This is identical to how we used `geom_point()`.
In fact, by just changing `line` to `point` in the code above works - and instead of a continuous line you'll get a point at every 5 years as in the dataset.
But what if we want to draw multiple lines, e.g., for each country in the dataset?
Let's send the whole dataset to `ggplot()` and `geom_line()`:
```{r, fig.keep = 'none'}
gapdata %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line()
```
<!-- EMH: I've used A, B, C for plot facet labels. Please can we change here as mine are in Illustrator -->
The reason you see this weird zigzag in Figure \@ref(fig:chap04-fig-zigzag) (1) is that, using the above code, `ggplot()` does not know which points to connect with which.
Yes, you know you want a line for each country, but you haven't told it that.
So for drawing multiple lines, we need to add a `group` aesthetic, in this case `group = country`:
```{r, fig.keep = 'none'}
gapdata %>%
ggplot(aes(x = year, y = lifeExp, group = country)) +
geom_line()
```
```{r chap04-fig-zigzag, echo = FALSE, fig.width = 7, fig.height=3, fig.cap = "The 'zig-zag plot' is a common mistake: Using `geom_line()` (1) without a `group` specified, (2) after adding `group = country`."}
theme_set(theme_bw())
p1 <- gapdata %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line()
p2 <- gapdata %>%
ggplot(aes(x = year, y = lifeExp, group = country)) +
geom_line()
p1 + p2 + plot_annotation(tag_levels = "1", tag_prefix = "(", tag_suffix = ")")
```
This code works as expected (Figure \@ref(fig:chap04-fig-zigzag) (2)) - yes there is a lot of overplotting but that's just because we've included `r gapdata$country %>% n_distinct()` lines on a single plot.
### Exercise {#chap04-ex-lineplot}
Follow the step-by-step instructions to transform Figure \@ref(fig:chap04-fig-zigzag)(2) into \@ref(fig:chap04-fig-lineplot2).
```{r chap04-fig-lineplot2, fig.width=0.8*10, echo = FALSE, fig.height=0.8*4, fig.cap = "Lineplot exercise."}
gapdata %>%
ggplot(aes(x = year, y = lifeExp, group = country, colour=continent)) +
geom_line() +
facet_wrap(~continent) +
theme_bw() +
scale_colour_brewer(palette = "Paired")
```
* Colour lines by continents: add `colour = continent` inside `aes()`;
* Continents on separate facets: `+ facet_wrap(~continent)`;
* Use a nicer colour scheme: `+ scale_colour_brewer(palette = "Paired")`.
## Bar plots
\index{plots@\textbf{plots}!bar}
\index{plots@\textbf{plots}!column}
There are two geoms for making bar plots - `geom_col()` and `geom_bar()` and the examples below will illustrate when to use which one.
In short: if your data is already summarised or includes values for `y` (height of the bars), use `geom_col()`.
If, however, you want `ggplot()` to count up the number of rows in your dataset, use `geom_bar()`.
For example, with patient-level data (each row is a patient) you'll probably want to use `geom_bar()`, with data that is already somewhat aggregated, you'll use `geom_col()`.
There is no harm in trying one, and if it doesn't work, trying the other.
### Summarised data
* `geom_col()` requires two variables `aes(x = , y = )`
* `x` is categorical, `y` is continuous (numeric)
Let's plot the life expectancies in 2007 in these three countries:
```{r, fig.keep = 'none'}
gapdata2007 %>%
filter(country %in% c("United Kingdom", "France", "Germany")) %>%
ggplot(aes(x = country, y = lifeExp)) +
geom_col()
```
This gives us Figure \@ref(fig:chap04-fig-col):1.
We have also created another cheeky one using the same code but changing the scale of the y axis to be more dramatic (Figure \@ref(fig:chap04-fig-col):2).
```{r chap04-fig-col, echo = FALSE, fig.width=6.5, fig.cap = "Bar plots using `geom_col()`: (1) using the code example, (2) same plot but with `+ coord_cartesian(ylim=c(79, 81))` to manipulate the scale into something a lot more dramatic."}
theme_set(theme_bw())
p1 <- gapdata2007 %>%
filter(country %in% c("United Kingdom", "France", "Germany")) %>%
ggplot(aes(x = country, y = lifeExp)) +
geom_col()
p2 <- gapdata2007 %>%
filter(country %in% c("United Kingdom", "France", "Germany")) %>%
ggplot(aes(x = country, y = lifeExp)) +
geom_col() +
coord_cartesian(ylim=c(79, 81))
p1 + p2 + plot_annotation(tag_levels = "1", tag_prefix = "(", tag_suffix = ")")
```
\FloatBarrier
### Countable data
* `geom_bar()` requires a single variable `aes(x = )`
* this `x` should be a categorical variable
* `geom_bar()` then counts up the number of observations (rows) for this variable and plots them as bars.
Our `gapdata2007` tibble has a row for each country (see end of Section \@ref(chap04-data) to remind yourself).
Therefore, if we use the `count()` function on the `continent` variable, we are counting up the number of countries on each continent (in this dataset^[The number of countries in this dataset is 142, whereas the United Nations have 193 member states.]):
```{r}
gapdata2007 %>%
count(continent)
```
So `geom_bar()` basically runs the `count()` function and plots it (see how the bars on Figure \@ref(fig:chap04-fig-bar) are the same height as the values from `count(continent)`).
```{r chap04-fig-bar, fig.width=6.4, echo = FALSE, fig.cap = "`geom_bar()` counts up the number of observations for each group. (1) `gapdata2007 %>% ggplot(aes(x = continent)) + geom_bar()`, (2) same + a little bit of magic to reveal the underlying data."}
theme_set(theme_bw())
p1 = gapdata2007 %>%
ggplot(aes(x = continent)) +
geom_bar()
p2 = gapdata2007 %>%
ggplot(aes(x = continent, colour = country)) +
geom_bar(fill = NA) +
theme(legend.position = "none")
p1 + p2 + plot_annotation(tag_levels = "1", tag_prefix = "(", tag_suffix = ")")
```
The first barplot in Figure \@ref(fig:chap04-fig-bar) is produced with just this:
```{r, fig.keep = 'none'}
gapdata2007 %>%
ggplot(aes(x = continent)) +
geom_bar()
```
Whereas on the second one, we've asked `geom_bar()` to reveal the components (countries) in a colourful way:
```{r, fig.keep = 'none'}
gapdata2007 %>%
ggplot(aes(x = continent, colour = country)) +
geom_bar(fill = NA) +
theme(legend.position = "none")
```
We have added `theme(legend.position = "none")` to remove the legend - it includes all 142 countries and is not very informative in this case.
We're only including the colours for a bit of fun.
We're also removing the fill by setting it to NA (`fill = NA`).
Note how we defined `colour = country` inside the `aes()` (as it's a variable), but we put the fill inside `geom_bar()` as a constant.
This was explained in more detail in steps (3) and (4) in the ggplot anatomy Section (\@ref(chap04-gganatomy)).
### `colour` vs `fill`
\index{plots@\textbf{plots}!colour}
\index{plots@\textbf{plots}!fill}
Figure \@ref(fig:chap04-fig-bar) also reveals the difference between a colour and a fill.
Colour is the border around a geom, whereas fill is inside it.
Both can either be set based on a variable in your dataset (this means `colour = ` or `fill = ` needs to be inside the `aes()` function), or they could be set to a fixed colour.
R has an amazing knowledge of colour.
In addition to knowing what is "white", "yellow", "red", "green" etc. (meaning we can simply do `geom_bar(fill = "green")`), it also knows what "aquamarine", "blanchedalmond", "coral", "deeppink", "lavender", "deepskyblue" look like (amongst many many others; search the internet for "R colours" for a full list).
<!-- HEX code website for reference include? -->
We can also use Hex colour codes, for example, `geom_bar(fill = "#FF0099")` is a very pretty pink.
Every single colour in the world can be represented with a Hex code, and the codes are universally known by most plotting or image making programmes.
Therefore, you can find Hex colour codes from a lot of places on the internet, or https://www.color-hex.com just to name one.
### Proportions {#chap04-proportions}
Whether using `geom_bar()` or `geom_col()`, we can use fill to display proportions within bars.
Furthermore, sometimes it's useful to set the x value to a constant - to get everything plotted together rather than separated by a variable.
So we are using `aes(x = "Global", fill = continent)`.
Note that "Global" could be any word - since it's quoted `ggplot()` won't go looking for it in the dataset (Figure \@ref(fig:chap04-fig-proportions)):
```{r chap04-fig-proportions, fig.cap = "Number of countries in the gapminder datatset with proportions using the `fill = continent` aesthetic."}
gapdata2007 %>%
ggplot(aes(x = "Global", fill = continent)) +
geom_bar()
```
There are more examples of bar plots in Chapter \@ref(chap08-h1).
### Exercise {#chap04-ex-barplot}
Create Figure \@ref(fig:chap04-fig-bar-exercise) of life expectancies in European countries (year 2007).
```{r chap04-fig-bar-exercise, fig.width = 4.5, echo = FALSE, fig.height=5, fig.cap = "Barplot exercise. Life expectancies in European countries in year 2007 from the gapminder dataset."}
gapdata %>%
filter(year == 2007) %>%
filter(continent == "Europe") %>%
ggplot(aes(x = fct_reorder(country, lifeExp), y = lifeExp)) +
geom_col(colour = "deepskyblue", fill = NA) +
coord_flip() +
theme_classic()
```
Hints:
* If `geom_bar()` doesn't work try `geom_col()` or vice versa.
* `coord_flip()` to make the bars horizontal (it flips the `x` and `y` axes).
* `x = country` gets the country bars plotted in alphabetical order, use `x = fct_reorder(country, lifeExp)` still inside the `aes()` to order the bars by their `lifeExp` values. Or try one of the other variables (`pop`, `gdpPercap`) as the second argument to `fct_reorder()`.
* when using `fill = NA`, you also need to include a colour; we're using `colour = "deepskyblue"` inside the `geom_col()`.
\FloatBarrier
## Histograms
\index{plots@\textbf{plots}!histogram}
A histogram displays the distribution of values within a continuous variable.
In the example below, we are taking the life expectancy (`aes(x = lifeExp)`) and telling the histogram to count the observations up in "bins" of 10 years (`geom_histogram(binwidth = 10)`, Figure \@ref(fig:chap04-fig-hist)):
```{r include=FALSE}
# don't understand what keeps resetting it! patchwork?
theme_set(theme_bw())
```
```{r chap04-fig-hist, fig.width=4, fig.cap = "`geom_histogram()` - The distribution of life expectancies in different countries around the world in year 2007."}
gapdata2007 %>%
ggplot(aes(x = lifeExp)) +
geom_histogram(binwidth = 10)
```
We can see that most countries in the world have a life expectancy of ~70-80 years (in 2007), and that the distribution of life expectancies globally is not normally distributed.
Setting the binwidth is optional, using just `geom_histogram()` works well too - by default, it will divide your data into 30 bins.
There are more examples of histograms in Chapter \@ref(chap06-h1). There are two other geoms that are useful for plotting distributions: `geom_density()` and `geom_freqpoly()`.
## Box plots
\index{plots@\textbf{plots}!boxplot}
Box plots are our go to method for quickly visualising summary statistics of a continuous outcome variable (such as life expectancy in the gapminder dataset, Figure \@ref(fig:chap04-fig-boxplot)).
Box plots include:
* the median (middle line in the box)
* inter-quartile range (IQR, top and bottom parts of the boxes - this is where 50% of your data is)
* whiskers (the black lines extending to the lowest and highest values that are still within 1.5*IQR)
* outliers (any observations out with the whiskers)
```{r chap04-fig-boxplot, fig.width=3, fig.height=2.75, fig.cap = "`geom_boxplot()` - Boxplots of life expectancies within each continent in year 2007."}
gapdata2007 %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_boxplot()
```
## Multiple geoms, multiple `aes()`
One of the coolest things about `ggplot()` is that we can plot multiple geoms on top of each other!
Let's add individual data points on top of the box plots:
```{r include=FALSE}
# don't understand what keeps resetting it! patchwork?
theme_set(theme_bw())
```
```{r, fig.keep = 'none'}
gapdata2007 %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_boxplot() +
geom_point()
```
This makes Figure \@ref(fig:chap04-fig-multigeoms)(1).
```{r chap04-fig-multigeoms, echo=FALSE, fig.width=0.8*10, fig.height=0.8*8, fig.cap = "Multiple geoms together. (1) `geom_boxplot() + geom_point()`, (2) `geom_boxplot() + geom_jitter()`, (3) colour aesthetic inside `ggplot(aes())`, (4) colour aesthetic inside `geom_jitter(aes())`."}
theme_set(theme_bw())
p1 <- gapdata2007 %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_boxplot() +
geom_point()
p2 <- gapdata2007 %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_boxplot() +
geom_jitter()
p3 <- gapdata2007 %>%
ggplot(aes(x = continent, y = lifeExp, colour = continent)) +
geom_boxplot() +
geom_jitter(position = position_jitter(seed = 1)) +
theme(legend.position = "none")
p4 <- gapdata2007 %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_boxplot() +
geom_jitter(aes(colour = continent), position = position_jitter(seed = 1))
p1 + p2 + p3 + p4 + plot_annotation(tag_levels = "1", tag_prefix = "(", tag_suffix = ")")
```
The only thing we've changed in (2) is replacing `geom_point()` with `geom_jitter()` - this spreads the points out to reduce overplotting.
But what's really exciting is the difference between (3) and (4) in Figure \@ref(fig:chap04-fig-multigeoms). Spot it!
\index{plots@\textbf{plots}!jitter}
```{r, fig.keep = 'none'}
# (3)
gapdata2007 %>%
ggplot(aes(x = continent, y = lifeExp, colour = continent)) +
geom_boxplot() +
geom_jitter()
# (4)
gapdata2007 %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_boxplot() +
geom_jitter(aes(colour = continent))
```
This is new: `aes()` inside a geom, not just at the top!
In the code for (4) you can see `aes()` in two places - at the top and inside the `geom_jitter()`.
And `colour = continent` was only included in the second `aes()`.
This means that the jittered points get a colour, but the box plots will be drawn without (so just black).
This is exactly* what we see on \@ref(fig:chap04-fig-multigeoms).
*Nerd alert: the variation added by `geom_jitter()` is random, which means that when you recreate the same plots the points will appear in slightly different locations to ours. To make identical ones, add `position = position_jitter(seed = 1)` inside `geom_jitter()`.
### Worked example - three geoms together
Let's combine three geoms by including text labels on top of the box plot + points from above.
We are creating a new tibble called `label_data` filtering for the maximum life expectancy countries at each continent (`group_by(continent)`):
```{r}
label_data <- gapdata2007 %>%
group_by(continent) %>%
filter(lifeExp == max(lifeExp)) %>%
select(country, continent, lifeExp)
# since we filtered for lifeExp == max(lifeExp)
# these are the maximum life expectancy countries at each continent:
label_data
```
The first two geoms are from the previous example (`geom_boxplot()` and `geom_jitter()`).
Note that `ggplot()` plots them in the order they are in the code - so box plots at the bottom, jittered points on the top.
We are then adding `geom_label()` with its own data option (`data = label_data`) as well as a new aesthetic (`aes(label = country)`, Figure \@ref(fig:chap04-fig-labels)):
```{r include=FALSE}
# don't understand what keeps resetting it! patchwork?
theme_set(theme_bw())
```
```{r chap04-fig-labels, fig.width=5, fig.height=4, fig.cap = "Three geoms together on a single plot: `geom_boxplot()`, `geom_jitter()`, and `geom_label()`."}
gapdata2007 %>%
ggplot(aes(x = continent, y = lifeExp)) +
# First geom - boxplot
geom_boxplot() +
# Second geom - jitter with its own aes(colour = )
geom_jitter(aes(colour = continent)) +
# Third geom - label, with its own dataset (label_data) and aes(label = )
geom_label(data = label_data, aes(label = country))
```
A few suggested experiments to try with the 3-geom plot code above:
* remove `data = label_data, ` from `geom_label()` and you'll get all 142 labels (so it will plot a label for the whole `gapdata2007` dataset);
* change from `geom_label()` to `geom_text()` - it works similarly but doesn't have the border and background behind the country name;
* change `label = country` to `label = lifeExp`, this plots the maximum value, rather than the country name.
## All other types of plots
In this chapter we have introduced some of the most common geoms, as well as explained how `ggplot()` works.
In fact, ggplot has 56 different geoms for you to use; see its documentation for a full list: https://ggplot2.tidyverse.org.
With the ability of combining multiple geoms together on the same plot, the possibilities really are endless.
Furthermore, the plotly Graphic Library (https://plot.ly/ggplot2/) can make some of your ggplots interactive, meaning you can use your mouse to hover over the point or zoom and subset interactively.
The two most important things to understand about `ggplot()` are:
* Variables (columns in your dataset) need to be inside `aes()`;
* `aes()` can be both at the top - `data %>% ggplot(aes())` - as well as inside a geom (e.g., `geom_point(aes())`).
This distinction is useful when combining multiple geoms.
All your geoms will "know about" the top-level `aes()` variables, but including `aes()` variables inside a specific geom means it only applies to that one.
## Solutions
Solution to Exercise \@ref(chap04-ex-lineplot):
```{r, fig.keep = 'none'}
library(tidyverse)
library(gapminder)
gapminder %>%
ggplot(aes(x = year,
y = lifeExp,
group = country,
colour = continent)) +
geom_line() +
facet_wrap(~continent) +
theme_bw() +
scale_colour_brewer(palette = "Paired")
```
Solution to Exercise \@ref(chap04-ex-barplot):
```{r, fig.keep = 'none'}
library(tidyverse)
library(gapminder)
gapminder %>%
filter(year == 2007) %>%
filter(continent == "Europe") %>%
ggplot(aes(x = fct_reorder(country, lifeExp), y = lifeExp)) +
geom_col(colour = "deepskyblue", fill = NA) +
coord_flip() +
theme_classic()
```
## Extra: Advanced examples
There are two examples of how just a few lines of `ggplot()` code and the basic geoms introduced in this chapter can be used to make very different things.
Let your imagination fly free when using `ggplot()`!
Figure \@ref(fig:chap04-fig-adv1) shows how the life expectancies in European countries have increased by plotting a square (`geom_point(shape = 15)`) for each observation (year) in the dataset.
<!-- Explain `fun=max` below? -->
```{r chap04-fig-adv1, fig.width=7, fig.height=5, fig.cap = "Increase in European life expectancies over time. Using `fct_reorder()` to order the countries on the y-axis by life expectancy (rather than alphabetically which is the default)."}
gapdata %>%
filter(continent == "Europe") %>%
ggplot(aes(y = fct_reorder(country, lifeExp, .fun=max),
x = lifeExp,
colour = year)) +
geom_point(shape = 15, size = 2) +
scale_colour_distiller(palette = "Greens", direction = 1) +
theme_bw()
```
In Figure \@ref(fig:chap04-fig-adv2), we're using `group_by(continent)` followed by `mutate(country_number = seq_along(country))` to create a new column with numbers 1, 2, 3, etc., for countries within continents.
We are then using these as `y` coordinates for the text labels (`geom_text(aes(y = country_number...`).
```{r chap04-fig-adv2, fig.height=11, fig.width=10, fig.cap = "List of countries on each continent as in the gapminder dataset."}
gapdata2007 %>%
group_by(continent) %>%
mutate(country_number = seq_along(country)) %>%
ggplot(aes(x = continent)) +
geom_bar(aes(colour = continent), fill = NA, show.legend = FALSE) +
geom_text(aes(y = country_number, label = country), vjust = 1)+
geom_label(aes(label = continent), y = -1) +
theme_void()
```