-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.Rmd
363 lines (282 loc) · 15.6 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
options(digits=3)
```
<!-- badges: start -->
[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-green.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable)
[![License](https://img.shields.io/badge/license-GPL%20%28%3E=%202%29-brightgreen.svg?style=flat)](https://www.gnu.org/licenses/gpl-2.0.html)
[![CRAN](https://www.r-pkg.org/badges/version/VisCollin)](https://cran.r-project.org/package=VisCollin)
[![Last Commit](https://img.shields.io/github/last-commit/friendly/VisCollin)](https://github.com/friendly/VisCollin)
[![Downloads](https://cranlogs.r-pkg.org/badges/statquotes?color=brightgreen)](https://www.r-pkg.org:443/pkg/VisCollin)
<!-- badges: end -->
# VisCollin <img src="man/figures/logo.png" style="float:right; height:200px;" />
**Visualizing Collinearity Diagnostics**
Version `r packageDescription("VisCollin")$Version`
The `VisCollin` package provides
methods to calculate diagnostics for multicollinearity among predictors in a linear or
generalized linear model. It also provides methods to visualize those diagnostics following Friendly & Kwan (2009),
"Where’s Waldo: Visualizing Collinearity Diagnostics", _The American Statistician_, **63**, 56–65.
These include:
* better **tabular presentation** of collinearity diagnostics that highlight the important numbers.
* a semi-graphic **tableplot** of the diagnostics to make warning and danger levels more salient and
* a **collinearity biplot** of the _smallest dimensions_ of predictor space, where collinearity is most apparent.
## Installation
+-------------------+---------------------------------------------------+
| CRAN version | `install.packages("VisCollin")` |
+-------------------+---------------------------------------------------+
| Development | `remotes::install_github("friendly/VisCollin")` |
| version | |
+-------------------+---------------------------------------------------+
## Tutorial example
```{r load}
library(VisCollin)
library(dplyr)
library(tidyr)
library(car)
library(corrplot)
```
This example uses the `cars` data set containing various measures of size and performance on 406 models of automobiles from 1982. Interest is focused on predicting gas mileage, `mpg`.
```{r cars}
data(cars)
str(cars)
```
### Fit a model
Fit a model predicting gas mileage (`mpg`) from the number of cylinders, engine displacement, horsepower, weight,
time to
accelerate from 0 -- 60 mph and model year (1970--1982). Perhaps surprisingly, only `weight` and `year` appear to
significantly predict gas mileage. What's going on here?
```{r cars-mod}
cars.mod <- lm (mpg ~ cylinder + engine + horse + weight + accel + year,
data=cars)
Anova(cars.mod)
```
`lmtest::coeftest()` shows the coefficients, $\hat{\beta_j}$, their standard errors $s(\hat{\beta_j})$
and associated $t$ statistics, $t = \hat{\beta_j} / s(\hat{\beta_j})$. As we will see,
the standard errors of the non-significant predictors have been inflated due to high multiple correlations
among the predictors, making the $t$ statistics smaller.
```{r coeftest}
lmtest::coeftest(cars.mod)
```
### Correlation matrix
It is often recommended to examine the correlation matrix of the predictors to diagnose collinearity problems.
In the general case, this advice is misguided, because it is not the 0-order correlations that matter, but rather the
**multiple correlations** predicting each independent variable from the others, $R_{x_j | \text{others}}$.
Nonetheless, it is instructive to examine the correlations.
```{r cars-R}
R <- cars |>
select(cylinder:year) |>
tidyr::drop_na() |>
cor()
100 * R |> round(digits = 2)
```
Or, better yet, use `corrplot::corrplot.mixed()` to visualize them, using color and shading of glyphs,
```{r cars-corrgram}
#| fig.width = 6,
#| fig.height = 6,
#| out.width = "60%"
corrplot.mixed(R, lower = "square", upper = "ellipse", tl.col = "black")
```
The message here seems to be that there are two clusters of predictors with
high correlations: {`cylinder`, `engine`, `horse` and `weight`},
and {`accel`, `year`}.
### Variance inflation factors
Variance inflation factors measure the effect of multicollinearity on the standard errors of the
estimated coefficients and are proportional to $1 / (1 - R^2_{x_j | \text{others}})$.
We check the variance inflation factors, using `car::vif()`. We see that most predictors have very high
VIFs, indicating moderately severe multicollinearity.
```{r cars-vif}
vif(cars.mod)
sqrt(vif(cars.mod))
```
According to $\sqrt{\text{VIF}}$, the standard error of `cylinder` has been
multiplied by 3.26 and it's $t$-value divided by this number,
compared with the case when all predictors are
uncorrelated. `engine`, `horse` and `weight` suffer a similar fate.
### Collinearity diagnostics
The diagnostic measures introduced by Belsley (1991) are based on the eigenvalues $\lambda_1, \lambda_2, \dots \lambda_p$
of the correlation matrix $R_{X}$ of the predictors (preferably centered and scaled, and not including the constant term
for the intercept), and the corresponding eigenvectors in the columns of $\mathbf{V}_{p \times p}$.
`colldiag()` calculates:
* **Condition indices**:
The smallest of the eigenvalues, those for which $\lambda_j \approx 0$,
indicate collinearity and the number of small values indicates the number of near collinear relations.
Because the sum of the eigenvalues, $\Sigma \lambda_i = p$ increases with the number
of predictors $p$, it is useful to scale them
all in relation to the largest. This leads to _condition indices_, defined as
$\kappa_j = \sqrt{ \lambda_1 / \lambda_j}$. These have the property that the resulting numbers
have common interpretations regardless of the number of predictors.
+ For completely uncorrelated predictors, all $\kappa_j = 1$.
+ $\kappa_j \rightarrow \infty$ as any $\lambda_k \rightarrow 0$.
+ In terms of the eigen-decomposition, variance inflation factors can be expressed as
$$
\text{VIF}_j = \sum_{k=1}^{p} \frac{V^2_{jk}}{\lambda_k} \; .
$$
* **Variance decomposition proportions**:
Large VIFs indicate variables that are involved in _some_ nearly collinear
relations, but they don't indicate _which_ other variable(s) each is involved with.
For this purpose, Belsley et. al. (1980) and Belsley (1991) proposed calculation of
the proportions of variance of each variable associated with each principal component
as a decomposition of the coefficient variance for each dimension.
For the current model, the usual display contains both the condition indices and
variance proportions. However, even for a small example, it is often difficult to know
what numbers to pay attention to.
```{r colldiag1}
(cd <- colldiag(cars.mod, center=TRUE))
```
Belsley (1991) recommends that the sources of collinearity be diagnosed
(a) only for those components with large $\kappa_j$, and
(b) for those components for which the variance proportion is large (say, $\ge 0.5$) on _two_ or more predictors.
The print method for `"colldiag"` objects has a `fuzz` argument controlling this.
```{r colldiag2}
print(cd, fuzz = 0.5)
```
The mystery is solved: There are two nearly collinear relations among the predictors, corresponding to the two
smallest dimensions.
* Dimension 5 reflects the high correlation between horsepower and weight,
* Dimension 6 reflects the high correlation between number of cylinders and engine displacement.
Note that the high variance proportion for `year` (0.787) on the second component creates no problem and
should be ignored because (a) the condition index is low and (b) it shares nothing with other predictors.
### Tableplot
The simplified tabular display above can be improved to make the patterns of collinearity more
visually apparent and to signify warnings directly to the eyes.
A "tableplot" (Kwan et-al., 2009) is a semi-graphic display that presents numerical information in a table
using shapes proportional to the value in a cell and other visual attributes (shape type, color fill, and so forth)
to encode other information.
For collinearity diagnostics, these show:
* the condition indices,
using using _squares_ whose background color is red for condition indices > 10,
green for values > 5 and green otherwise, reflecting danger, warning and OK respectively.
The value of the condition index is encoded within this using a white square whose side is proportional to the value
(up to some maximum value, `cond.max`).
* Variance decomposition proportions are shown by filled _circles_ whose radius is proportional to those values
and are filled (by default) with shades ranging from white through pink to red. Rounded values of those diagnostics
are printed in the cells.
The tableplot below encodes all the information from the values of `colldiag()` printed above
(but using `prop.col` color breaks such that variance proportions < 0.3 are shaded white).
The visual message is that one should attend to collinearities with large condition indices **and**
large variance proportions implicating two or more predictors.
<!-- ```{r cars-tableplot0} -->
<!-- knitr::include_graphics("man/figures/cars-tableplot.png") -->
<!-- ``` -->
```{r cars-tableplot}
tableplot(cd, title = "Tableplot of cars data", cond.max = 30 )
```
### Collinearity biplot
The standard biplot (Gabriel, 1971; Gower& Hand 1996) can be regarded as
a multivariate analog of a scatterplot,
obtained by projecting a
multivariate sample into a low-dimensional space
(typically of 2 or 3 dimensions)
accounting for the greatest variance in the data.
With the symmetric (PCA) scaling used here, this is
equivalent to a plot of principal component scores of the mean-centered
matrix $\widetilde{\mathbf{X}} = \mathbf{X} - \bar{\mathbf{X}}$ of predictors
for the observations (shown as points or case labels),
together with principal component coefficients
for the variables (shown as vectors) in the same 2D (or 3D) space.
However the standard biplot is less useful for visualizing the relations among the
predictors that lead to nearly collinear relations. Instead, biplots of
the **smallest dimensions** show these relations directly, and can show other
features of the data as well, such as outliers and leverage points.
We use `prcomp(X, scale.=TRUE)` to obtain the PCA of the correlation matrix
of the predictors:
```{r cars-pca}
cars.X <- cars |>
select(where(is.numeric)) |>
select(-mpg) |>
tidyr::drop_na()
cars.pca <- prcomp(cars.X, scale. = TRUE)
cars.pca
```
The standard deviations above are the square roots $\sqrt{\lambda_j}$ of the eigenvalues of the
correlation matrix, and are returned in the `sdev` component of the `"prcomp"` object.
The eigenvectors are returned in the `rotation` component, whose directions are arbitrary.
```{r cars-biplot-prep}
# Make labels for dimensions include % of variance
pct <- 100 *(cars.pca$sdev^2) / sum(cars.pca$sdev^2)
lab <- glue::glue("Dimension {1:6} ({round(pct, 2)}%)")
# Direction of eigenvectors is arbitrary. Reflect them
cars.pca$rotation <- -cars.pca$rotation
```
The collinearity biplot is then constructed as follows:
```{r cars-biplot}
#| fig.show = "hold",
#| out.width = "100%"
op <- par(lwd = 2, xpd = NA )
biplot(cars.pca,
choices=6:5, # only the last two dimensions
scale=0.5, # symmetric biplot scaling
cex=c(0.6, 1), # character sizes for points and vectors
col = c("black", "blue"),
expand = 1.7, # expand variable vectors for visibility
xlab = lab[6],
ylab = lab[5],
xlim = c(-0.7, 0.5),
ylim = c(-0.8, 0.5)
)
par(op)
```
The projections of the variable vectors
on the Dimension 5 and Dimension 6 axes are proportional to
their variance proportions shown above.
The relative lengths of these variable vectors can be considered
to indicate the extent to which each variable contributes to collinearity
for these two near-singular dimensions.
Thus, we see again that Dimension 6
is largely determined by `engine` size, with a substantial (negative) relation
to `cylinder`. Dimension 5 has its' strongest relations to `weight`
and `horse`.
Moreover, there is one observation, #20, that stands out as
an outlier in predictor space, far from the centroid.
It turns out that this vehicle, a Buick Estate wagon, is an early-year (1970) American behemoth,
with an 8-cylinder, 455 cu. in, 225 horse-power engine, and able to go from 0 to 60 mph
in 10 sec.
(Its MPG is only slightly under-predicted from the regression model, however.)
### Remedies for collinearity: What to do?
Collinearity is often a **data** problem, for which there is no magic cure. Nevertheless there are some
general guidelines and useful techniques to address this problem.
* **Pure prediction**: If we are only interested in predicting / explaining an outcome,
and not the model coefficients or which are "significant", collinearity can be largely ignored.
The fitted values are unaffected by collinearity.
* **structural collinearity**: Sometimes collinearity results from structural relations among the variables:
+ For example, polynomial terms, like $x, x^2, x^3$ or interaction terms like $x_1, x_2, x_1 * x_2$
are necessarily correlated. A simple cure is to _center_ the predictors at their means, using
$x - \bar{x}, (x - \bar{x})^2, (x - \bar{x})^3$ or
$(x_1 - \bar{x}_1), (x_2 - \bar{x}_2), (x_1 - \bar{x}_1) * (x_2 - \bar{x}_2)$
+ When some predictors share a common cause, as in GNP or population in time-series or cross-national data,
you can reduce collinearity by re-defining predictors to reflect _per capita measures_.
* **Model re-specification**:
+ Drop one or more regressors that have a high VIF if they are not deemed to be essential
+ Replace highly correlated regressors with linear combination(s) of them. For example, two related variables,
$x_1$ and $x_2$ can be replaced without any loss of information by replacing them with their sum and
difference, $z_1 = x_1 + x_2$ and $z_2 = x_1 - x_2$.
* **Statistical remedies**:
+ Transform the predictors to uncorrelated principal components
+ use **regularization methods** such as ridge regression and lasso, which correct for collinearity by
introducing shrinking coefficients towards 0, introducing
a small amount of bias, . See the [genridge](https://CRAN.R-project.org/package=genridge) package and its [`pkgdown` documentation](https://friendly.github.io/genridge/) for visualization methods.
+ use Bayesian regression; if multicollinearity prevents a regression coefficient from being estimated precisely, then a prior on that coefficient will help to reduce its posterior variance.
## References
Belsley, D.A., Kuh, E. and Welsch, R. (1980).
_Regression Diagnostics_, New York: John Wiley & Sons.
Belsley, D.A. (1991).
_Conditioning diagnostics, collinearity and weak data in regression_.
New York: John Wiley & Sons.
Friendly, M., & Kwan, E. (2009).
"Where’s Waldo: Visualizing Collinearity Diagnostics." _The American Statistician_, **63**, 56–65.
Online: [https://www.datavis.ca/papers/viscollin-tast.pdf](https://www.datavis.ca/papers/viscollin-tast.pdf).
Supp. materials: [https://www.datavis.ca/papers/viscollin/](https://www.datavis.ca/papers/viscollin/)
Gabriel, K. R. (1971). The Biplot Graphic Display of Matrices with Application to Principal Components Analysis. _Biometrics_, **58**, 453–467.
Gower, J. C., & Hand, D. J. (1996). _Biplots_. London: Chapman & Hall.
Kwan, E., Lu, I. R. R., & Friendly, M. (2009). Tableplot: A new tool for assessing precise predictions. _Zeitschrift Für Psychologie / Journal of Psychology_, **217**, 38–48.