forked from jtr13/cc20
-
Notifications
You must be signed in to change notification settings - Fork 0
/
pairs_and_ggpairs.Rmd
199 lines (160 loc) · 7.45 KB
/
pairs_and_ggpairs.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
# Customized Plot Matrix: pairs and ggpairs
Yibai Liu
```{r}
library(GGally)
library(ggplot2)
library(vcd)
```
## Overview: Things we can do with pairs() and ggpairs()
When our data contains multivariate variables, it is important to evaluate associations between these variables before modeling. We can create scatterplot matrices, correlation matrix, as well as mosaic pairs plots to get an idea about if and how these variables are correlated with each other. In this tutorial, I would plot using a base r function **pairs()** and a function **ggpairs()** from the `GGally` package, which both functions provide methods to generate customized plot matrices.
Plots for different purposes:
- Scatterplot matrix: correlations between continuous variables
- Mosaic pairs plot: correlations between categorical variables
## Scatterplot matrix for continuous variables
### Plot with pairs()
#### Basic scatterplot matrix of the `mtcars dataset (all numeric variables)
``` {r}
data(mtcars)
pairs(~., data = mtcars,
main = "Scatterplot Matrix of `mtcars`")
```
We notice that there are some numeric variables actually discrete or representing categories, so we can trim all discrete and categorical variables, and only plot continuous variables in the matrix.
#### Continuous variables only
```{r}
pairs(mtcars[, c(1,3:7)], main = "Scatterplot Matrix of `mtcars`")
```
#### Change color, shape, size of points, as well as labels and gaps of the plot
```{r }
pairs(mtcars[, c(1,3:7)],
col = "blue", # Change color
pch = 19, # Change shape of points
cex = 0.8, # Change size of points
labels = c("Miles","Displacement","Horsepower",
"Rear axle ratio","Weight","1/4 mile time"), # Change labels
gap = 0.3, # Change gaps in between
main = "Scatterplot Matrix of `mtcars`")
```
#### Add a smoother to each scatterplot
```{r}
pairs(mtcars[, c(1,3:7)],
lower.panel = panel.smooth, # Add a smoother for the lower panel
col = "blue",
pch = 19,
cex = 0.8,
labels = c("Miles","Displacement","Horsepower",
"Rear axle ratio","Weight","1/4 mile time"),
gap = 0.3,
main = "Scatterplot Matrix of `mtcars`")
```
#### Separate groups using different colors
**Tip**: You can also highlight a certain level of a categorical variable by simply turn other levels to grey.
```{r}
mtcars$vs <- as.factor(mtcars$vs)
pairs(mtcars[, c(1,3:7)],
col = c("blue","red")[mtcars$vs], # Group by variable `vs`
pch = 19,
cex = 0.8,
labels = c("Miles","Displacement","Horsepower",
"Rear axle ratio","Weight","1/4 mile time"),
gap = 0.3,
main = "Scatterplot Matrix of `mtcars` Grouped by Engine")
```
By separating data points by `vs` or the engine type, we can see that two groups form distinct clusters for many of the variables.
#### Choose panel display
If the plot seems dominated by too many points, you can turn off one of the panels.
```{r}
pairs(mtcars[, c(1,3:7)],
col = c("blue","red")[mtcars$vs],
pch = 19,
cex = 0.8,
labels = c("Miles","Displacement","Horsepower",
"Rear axle ratio","Weight","1/4 mile time"),
gap = 0.3,
upper.panel = NULL, # Turn off the upper panel above the diagonal
main = "Scatterplot Matrix of `mtcars`")
```
#### Customize your own plot matrix
The plot matrix is consisted of multiple panels, e.g. the upper panel, lower panel, diagonal panel, etc.
You can customize each panel and make your own plot.
```{r}
#Panel of correlations
panel.corr <- function(x, y){
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- round(cor(x, y), digits=3)
txt <- paste0("Corr: ", r)
text(0.5, 0.5, txt, cex = 1)
}
#Panel of histograms
panel.hist <- function(x, ...){
usr <- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE)
breaks <- h$breaks
len <- length(breaks)
y <- h$counts/max(h$counts)
rect(breaks[-len], 0, breaks[-1], y, col = "lightblue")
}
#Panel of scatterplots
panel.scat <- function(x, y){
points(x,y, pch = 19, cex = 1, col = "coral")
}
#Plot
pairs(mtcars[, c(1,3:7)],
lower.panel = panel.scat,
upper.panel = panel.corr,
diag.panel = panel.hist,
labels = c("Miles","Displacement","Horsepower",
"Rear axle ratio","Weight","1/4 mile time"),
gap = 0.3,
main = "Scatterplot matrix of `mtcars`")
```
### Plot with ggpairs() from `GGally` package
#### Basic ggpairs() plot
```{r}
# You need both ggplot2 and GGally packages loaded to use ggpairs()
ggpairs(mtcars[, c(1,3:7)],
columnLabels = c("Miles","Displacement","Horsepower",
"Rear axle ratio","Weight","1/4 mile time"),
upper = list(continuous = wrap('cor', size = 4)),
title = "Scatterplot matrix of `mtcars`")
```
#### Separate groups using different colors
```{r}
ggpairs(mtcars[, c(1,3:7)],
columnLabels = c("Miles","Displacement","Horsepower",
"Rear axle ratio","Weight","1/4 mile time"),
aes(color = mtcars$vs), # Separate data by levels of vs
upper = list(continuous = wrap('cor', size = 3)),
lower = list(combo = wrap("facethist", bins = 30)),
diag = list(continuous = wrap("densityDiag", alpha = 0.5)),
title = "Scatterplot matrix of `mtcars` Grouped by Engine")
```
## Categorical variables
There are some categorical variables in the dataset `mtcars`. We can turn these variables and also discrete variables into factors
### Factorize discrete/categorical variables
```{r}
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$am <- as.factor(mtcars$am)
mtcars$gear <- as.factor(mtcars$gear)
```
Now we can create a mosaic pairs plot with pairs_diagonal_mosaic() in the `vcd` package
### pairs() plot for categorical variables
```{r}
# You need to load the `vcd` package
p <- mtcars[, c(2,8:10)]
pairs(table(p), # Here the data needs to be in a table format
diag_panel = pairs_diagonal_mosaic(offset_varnames=-2.5)) #move down variable labels
```
### Highlight correlations with shade colors
```{r}
pairs(table(p),
diag_panel = pairs_diagonal_mosaic(offset_varnames=-2.5), #move down variable names
upper_panel_args = list(shade = TRUE), #set shade colors
lower_panel_args = list(shade = TRUE))
```
## Outside sources
You can check out the following links to find more interesting ways to customize your plot matrix.
[R pairs & ggpairs Plot Functions] https://statisticsglobe.com/r-pairs-plot-example/#:~:text=The%20pairs%20R%20function%20returns,pairs%20command%20is%20shown%20above.
[ggpairs() r documentation] https://www.rdocumentation.org/packages/GGally/versions/1.5.0/topics/ggpairs
[Pairs plot for contingency tables] http://finzi.psych.upenn.edu/R/library/vcd/html/pairs.table.html