-
Notifications
You must be signed in to change notification settings - Fork 24
/
bivariate.qmd
422 lines (298 loc) · 22 KB
/
bivariate.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
---
output: html_document
editor_options:
chunk_output_type: console
---
# Fits and residuals
```{r echo=FALSE}
source("libs/Common.R")
```
```{r echo = FALSE}
pkg_ver(c("ggplot2"))
```
Bivariate data are datasets that store two variables measured from a same observation (e.g. wind speed *and* temperature at each location). This differs from univariate data where only one variable is measured for each observation (e.g. temperature at each location). This chapter will explore common fitting strategies applied to bivariate data.
## Scatter plot
A scatter plot is a popular visualization tool used to compare values between two variables. Sometimes one variable is deemed *dependent* on another variable; the latter being the *independent* variable. The dependent variable is also sometimes referred to as the *response* variable and the independent variable as the *factor* (this is not to be confused with the *factor* data type used in R as a grouping variable). The dependent variable is usually plotted on the y-axis and the independent variable is usually plotted on the x-axis. Other times, one does not seek a dependent-independent relationship between variables but is simply interested in studying the relationship between the tow variables.
A scatter plot can be generated using the base plotting environment as follows. Here, we'll plot Cleveland's Ganglion dataset.[^1]
```{r fig.width=2.5, fig.height=2.5, small.mar=TRUE}
df <- read.csv("http://mgimond.github.io/ES218/Data/ganglion.csv")
plot(cp.ratio ~ area, dat = df)
```
Or, using `ggplot2`, as follows:
```{r fig.width=2.5, fig.height=2.5}
library(ggplot2)
ggplot(df, aes(x = area, y = cp.ratio)) + geom_point()
```
The data represent the ratio between the ganglion cell density of a cat's central retina to that of its peripheral density (variable `cp.ratio`), and the cat's retina surface area (`area`) during its early development (ranging from 35 to 62 days of gestation).
## Fitting the data
Scatter plots are a good first start in visualizing bivariate data, but this is sometimes not enough. Our eyes need "guidance" to help perceive patterns. Another visual aid involves **fitting** the data with a line. We will explore two fitting strategies: the *parametric* fit and the *non-parametric* fit.
### Parametric fit
A parametric fit is one where we impose a structure to the data--an example of which is a *straight* line (also referred to as a 1^st^ order polynomial fit). The parametric fit is by far the most popular fitting strategy used in the realm of data analysis and statistics.
#### Fitting a straight line
A straight line is the simplest fit one can make to bivariate data. A popular method for fitting a straight line is the least-squares method. We'll use R's `lm()` function which provides us with a slope and intercept for the best fit line.
This can be implemented in the base plotting environment as follows:
```{r fig.height=2.5, fig.width=2.5, small.mar=TRUE}
M <- lm(cp.ratio ~ area, dat = df)
plot(cp.ratio ~ area, dat = df)
abline(M, col = "red")
```
In the `ggplot2` plotting environment, we can make use of the `stat_smooth` function to generate the regression line.
```{r fig.width=2.5, fig.height=2.5}
library(ggplot2)
ggplot(df, aes(x = area, y = cp.ratio)) + geom_point() +
stat_smooth(method ="lm", se = FALSE)
```
The `se = FALSE` option prevents R from drawing a confidence envelope around the regression line. Confidence envelopes are used in inferential statistics (a topic not covered in this course).
The straight line is a **first order polynomial** with two parameters, $a$ and $b$, that define an equation that best describes the relationship between the two variables:
$$
y = a + b (x)
$$
where $a$ and $b$ can be extracted from the regression model object `M` as follows:
```{r}
coef(M)
```
Thus $a$ = 0.014 and $b$ = 0.11.
#### Fitting a 2^nd^ order polynomial
A second order polynomial is a three parameter function ($a$, $b$ and $c$) whose equation $y = a + bx + cx^2$ defines a **curve** that best fits the data. We define such a relationship in R using the formula `cp.ratio ~ area + I(area^2)`. The *identity* function `I()` preserves the arithmetic interpretation of `area^2` as part of the model. Our new `lm` expression and resulting coefficients follow:
```{r}
M2 <- lm(cp.ratio ~ area + I(area^2) , dat = df)
coef(M2)
```
The quadratic fit is thus,
$$
y = 2.87 - 0.012 x + 0.000839 x^2
$$
In using the base plot environment, we cannot use `abline` to plot the predicted 2^nd^ order polynomial curve since `abline` only draws straight lines. We will need to construct the line manually using the `predict` and `lines` functions.
```{r fig.height=2.5, fig.width=2.5, small.mar=TRUE}
plot(cp.ratio ~ area, dat=df)
x.pred <- data.frame( area = seq(min(df$area), max(df$area), length.out = 50) )
y.pred <- predict(M2, x.pred)
lines(x.pred$area, y.pred, col = "red")
```
To generate the same plot in `ggplot2`, simply pass the formula as an argument to `stat_smooth`:
```{r fig.width=2.5, fig.height=2.5}
ggplot(df, aes(x = area, y = cp.ratio)) + geom_point() +
stat_smooth(method = "lm", se = FALSE, formula = y ~ x + I(x^2) )
```
### Non-parametric fits
Non-parametric fit applies to the family of fitting strategies that *do not* impose a structure on the data. Instead, they are designed to let the dataset reveal its inherent structure. One explored in this course is the *loess* fit.
#### Loess
A flexible curve fitting option is the **loess** curve (short for **lo**cal regr**ess**ion; also known as the *local weighted regression*). Unlike the parametric approach to fitting a curve, the loess does **not** impose a structure on the data. The loess curve fits small segments of a regression lines across the range of x-values, then links the mid-points of these regression lines to generate the *smooth* curve. The range of x-values that contribute to each localized regression lines is defined by the **span** parameter, $\alpha$, which usually ranges from 0.2 to 1 (but, it can be greater than 1 for smaller datasets). The larger the $\alpha$ value, the smoother the curve. The other parameter that defines a loess curve is $\lambda$: it defines the **polynomial order** of the localized regression line. This is usually set to 1 (though `ggplot2`'s implementation of the loess defaults to a 2^nd^ order polynomial).
```{r echo=FALSE}
library(dplyr)
library(purrr)
alpha <- 0.5
strip.x <- nrow(df) * alpha # Number of points within band
f.plot <- function(start , line = FALSE, point = FALSE, bnds =FALSE,
w = FALSE, plot = TRUE, pts = FALSE, title = NULL){
# Find points closest to starting point
subset <- df %>%
mutate(dst = abs(area - start)) %>%
arrange(dst) %>%
mutate(j = row_number()) %>%
filter(j <= strip.x ) %>%
mutate(wt = dst / max(dst) * 3)
# Assign weights
wts <- dnorm(subset$wt)/ 0.3989423
# Regress with weights
M <- lm(cp.ratio ~ area, subset, weights = wts)
x.l <- coef(M)[1] + coef(M)[2] * start
# Plot by option
if(plot == TRUE){
plot(cp.ratio ~ area, df, yaxt='n', main = title,
axes = FALSE, pch=16, col = "grey90", cex = 1.6)
axis(side=1, at=c(seq(10,150,30)))
abline(v = start, lty = 2)
if(bnds == TRUE){
abline(v = c(min(subset$area),max(subset$area)), lty = 3, col = "grey")
rect(min(subset$area), 0, max(subset$area), 20, col = rgb(0,0,1,0.1),
border = rgb(0,0,1,0.1))
}
if(w == TRUE){
points(x = subset$area, y = subset$cp.ratio, col = rgb(0,0,1, wts),
pch = 16, cex=1.6)
}
if(line == TRUE){
clip(min(subset$area),max(subset$area),
min(df$cp.ratio),max(df$cp.ratio))
abline(M, col = "orange", lwd = 1.8)
}
if(point == TRUE){
points(x = start, y = x.l, pch = 16, col = "red", cex=1.8)
}
}
if(pts == TRUE){
return(c(start, x.l))
}
}
```
#### How a loess is constructed
Behind the scenes, each point (x~i~,y~i~) that defines the loess curve is constructed as follows:
a) A subset of data points closest to point x~i~ are identified (x~i~ i shown as the vertical dashed line in the figures below). The number of points in the subset is computed by multiplying the bandwidth $\alpha$ by the total number of observations. In our current example, $\alpha$ is set to `r alpha`. The number of points defining the subset is thus 0.5 * 14 = 7. The points are identified in the light blue area of the plot in panel (a) of the figure below.
b) The points in the subset are assigned weights. Greater weight is assigned to points closest to x~i~ and vice versa. The weights define the points' influence on the fitted line. Different weighting techniques can be implemented in a loess with the `gaussian` weight being the most common. Another weighting strategy we will also explore later in this course is the `symmetric` weight.
c) A regression line is fit to the subset of points. Points with smaller weights will have less leverage on the fitted line than points with larger weights. The fitted line can be either a first order polynomial fit or a second order polynomial fit.
d) Next, the value y~i~ from the regression line is computed. This is shown as the red dot in panel (d). This is one of the points that will define the shape of the loess.
```{r fig.height = 2.7, fig.width = 10, echo = FALSE}
OP <- par(mfrow = c(1, 4), mar=c(2,1,1,0), pty = "s")
f.plot(start = 50, bnds = TRUE, title = "(a) Subset points")
f.plot(start = 50, bnds = TRUE, w = TRUE, title = "(b) Assign weights")
f.plot(start = 50, bnds = TRUE, w = TRUE, line = TRUE, title = "(c) Fit line")
f.plot(start = 50, bnds = TRUE, w = TRUE, line = TRUE, point = TRUE,
title = "(d) Draw point")
par(OP)
```
The above steps are repeated for as many x~i~ values practically possible. Note that when x~i~ approaches an upper or lower x limit, the subset of points becomes skewed to one side of x~i~. For example, when estimating x~10~, the seven closest points to the right of x~10~ are selected. Likewise, for the upper bound x~140~, the seven closest points to the left of x~140~ are selected.
```{r fig.height = 2.5, fig.width = 10, echo = FALSE}
OP <- par(mfrow = c(1, 4), mar=c(2,1,1,0), pty = "s")
f.plot(start = 10, bnds = TRUE, w = TRUE, line = TRUE, point = TRUE,
title = expression(Left-most ~~ x[i] ))
f.plot(start = 140, bnds = TRUE, w = TRUE, line = TRUE, point = TRUE,
title = expression(Right-most ~~ x[i]) )
par(OP)
```
In the following example, just under 30 loess points are computed at equal intervals. This defines the shape of the loess.
```{r fig.height = 2.3, fig.width = 10, echo = FALSE}
OP <- par(mfrow = c(1, 4), mar=c(2,1,1,0), pty = "s")
l.pts <- seq(10,140,(110-20)/20) %>% map(function(x) f.plot(start = x, plot = FALSE, pts = TRUE)) %>%
do.call(rbind, .) %>% as.data.frame()
plot(cp.ratio ~ area, df, yaxt='n', main = NULL,
axes = FALSE, pch=16, col = "grey90", cex = 1.6)
axis(side=1, at=c(seq(10,150,30)))
points(l.pts$V1, l.pts$`(Intercept)`, pch = 16, col = "red",
cex=1)
par(OP)
```
It's more conventional to plot the line segments than it is to plot the points.
```{r fig.height = 2.3, fig.width = 10, echo = FALSE}
OP <- par(mfrow = c(1, 4), mar=c(2,1,1,0), pty = "s")
l.pts <- seq(10,140,(110-20)/20) %>% map(function(x) f.plot(start = x, plot = FALSE, pts = TRUE)) %>%
do.call(rbind, .) %>% as.data.frame()
plot(cp.ratio ~ area, df, yaxt='n', main = NULL,
axes = FALSE, pch=16, col = "grey90", cex = 1.6)
axis(side=1, at=c(seq(10,150,30)))
lines(l.pts$V1, l.pts$`(Intercept)`, col = "red")
par(OP)
```
<br>
#### Plotting a loess in R
The loess fit can be computed in R using the `loess()` function. It takes as arguments `span` ($\alpha$), and `degree` ($\lambda$).
```{r}
# Fit loess function
lo <- loess(cp.ratio ~ area, df, span = 0.5, degree = 1)
# Predict loess values for a range of x-values
lo.x <- seq(min(df$area), max(df$area), length.out = 50)
lo.y <- predict(lo, lo.x)
```
The modeled loess curve can be added to the scatter plot using the `lines` function.
```{r fig.width=2.5, fig.height=2.5, small.mar=TRUE}
plot(cp.ratio ~ area, dat = df)
lines(lo.x, lo.y, col = "red")
```
In `ggplot2` simply pass the `method="loess"` parameter to the `stat_smooth` function.
```{r eval=FALSE}
ggplot(df, aes(x = area, y = cp.ratio)) + geom_point() +
stat_smooth(method = "loess", se = FALSE, span = 0.5)
```
However, `ggplot` defaults to a second degree loess (i.e. the small regression line elements that define the loess are modeled using a 2^nd^ order polynomial and not a 1^st^ order polynomial). If a first order polynomial (`degree=1`) is desired, you need to include an argument list in the form of `method.args=list(degree=1)` to the `stat_smooth` function.
```{r fig.width=2.5, fig.height=2.5}
ggplot(df, aes(x = area, y = cp.ratio)) + geom_point() +
stat_smooth(method = "loess", se = FALSE, span = 0.5,
method.args = list(degree = 1) )
```
## Residuals
Fitting the data with a line is just the first step in EDA. Your next step should be to explore the residuals. The residuals are the distances (parallel to the y-axis) between the observed points and the fitted line. The closer the points are to the line (i.e. the smaller the residuals) the better the fit.
The residuals can be computed using the `residuals()` function. It takes as argument the model object. For example, to extract the residuals from the linear model `M` computed earlier type,
```{r}
residuals(M)
```
### Residual-dependence plot
One way to visualize the residuals is to create a **residual-dependence plot**--this plots the residuals as a function of the x-values. We'll do this using `ggplot` so that we can also fit a loess curve to help discern any pattern in the residuals.
```{r fig.height=2.5, fig.width=2.5, small.mar=TRUE}
df$residuals <- residuals(M)
ggplot(df, aes(x = area, y = residuals)) + geom_point() +
stat_smooth(method = "loess", se = FALSE, span = 1,
method.args = list(degree = 1) )
```
We are interested in identifying any pattern in the residuals. **If the model does a good job in fitting the data, the points should be uniformly distributed across the plot** and the loess fit should approximate a horizontal line. With the linear model `M`, we observe a convex pattern in the residuals suggesting that the linear model is not a good fit. We say that the residuals show *dependence* on the x values.
Next, we'll look at the residuals from the second order polynomial model `M2`.
```{r fig.height=2.5, fig.width=2.5, small.mar=TRUE}
df$residuals2 <- residuals(M2)
ggplot(df, aes(x = area, y = residuals2)) + geom_point() +
stat_smooth(method = "loess", se = FALSE, span = 1,
method.args = list(degree = 1) )
```
There is no indication of dependency between the residual and the `area` values. The second order polynomial is a large improvement over the first order polynomial. Let's look at the loess model.
```{r fig.height=2.5, fig.width=2.5, small.mar=TRUE}
df$residuals3 <- residuals(lo)
ggplot(df, aes(x = area, y = residuals3)) + geom_point() +
stat_smooth(method = "loess", se = FALSE, span = 1,
method.args = list(degree = 1) )
```
No surprise here. The loess model does a good job in capturing any overall pattern in the data. But note that this is to be expected given that the loess fit is a non-parametric model--it lets the data define its shape!
> You may ask *“if the loess model does such a good job in fitting the data, why bother with polynomial fits?”* If you are seeking to generate a predictive model that explains the relationship between the y and x variables, then a mathematically tractable model (like a polynomial model) should be sought. If the interest is simply in identifying a pattern in the data, then a loess fit is a good choice.
A model is deemed "well defined" when its residuals are constant over the full range of $x$. In essence, we expect a formula of the form:
$$
Y = a + bx + \varepsilon
$$
where $\varepsilon$ is a constant that does not vary as a function of varying $x$. This should sound quite familiar to you given that we've spent a good part of the univariate analysis section seeking a homogeneous spread in the data (i.e. a spread that did not change as a function of fitted group values). So what other diagnostic plots can we (and *should we*) generate from the fitted model? We explore such plots next.
### Spread-location plot
The `M2` and `lo` models do a good job in eliminating any dependence between residual and x-value. Next, we will check that **the residuals do not show a dependence with *fitted* y-values**. This is analogous to univariate analysis where we checked if residuals increased or decreased with increasing medians across categories. Here we will compare residuals to the fitted `cp.ratio` values (for a univariate analogy, think of the fitted line as representing a *level* across different segments along the x-axis). We'll generate a spread-level plot of model `M2`'s residuals (note that in the realm of regression analysis, such plot is often referred to as a **scale-location** plot). We'll also add a loess curve to help visualize any patterns in the plot.
```{r fig.height=2.5, fig.width=2.5, small.mar=TRUE}
sl2 <- data.frame( std.res = sqrt(abs(residuals(M2))),
fit = predict(M2))
ggplot(sl2, aes(x = fit, y =std.res)) + geom_point() +
stat_smooth(method = "loess", se = FALSE, span = 1,
method.args = list(degree = 1) )
```
The function `predict()` extracts the y-values from the fitted model `M2` and is plotted along the x-axis. It's clear from this plot that the residuals are not homogeneous; they increase as a function of increasing *fitted* CP ratio. The "bend" observed in the loess curve is most likely due to a single point at the far (right) end of the fitted range. Given that we have a small batch of numbers, a loess can be easily influenced by an outlier. We may therefore want to increase the loess span by setting `span = 2`.
```{r fig.height=2.5, fig.width=2.5, small.mar=TRUE}
ggplot(sl2, aes(x = fit, y = std.res)) + geom_point() +
stat_smooth(method = "loess", se = FALSE, span = 2,
method.args = list(degree = 1) )
```
The point's influence is reduced enough to convince us that the observed monotonic increase is real.
We learned with the univariate data that re-expressing values was one way to correct for any increasing or decreasing spreads across the range of fitted values. At this point, we may want to look into re-expressing the data. But, before we do, we'll explore another diagnostic plot: the *normal q-q plot*.
### Checking residuals for normality
If you are interested in conducting a hypothesis test (i.e. addressing the question *"is the slope significantly different from 0"*) you will likely want to check the residuals for normality since this is an assumption made when computing a confidence interval and a p-value.
In `ggplot`, we learned that a normal q-q plot could be generated using the `stat_qq` and `stat_qq_line` function.
```{r fig.height=2.5, fig.width=2.5}
ggplot(df, aes(sample = residuals2)) +
stat_qq(distribution = qnorm) +
stat_qq_line(distribution = qnorm, col = "blue")
```
Here, the residuals seem to stray a little from a normal distribution.
## Re-expressing the data
The monotone spread can be problematic if we are to characterize the spread of `cp.ratio` as being the same (homogeneous) across the full range of `area` values. To remedy this, we can re-express the `cp.ratio` values. Variables measured as ratios (such is the case with `cp.ratio`) are good candidates for log transformation. We will therefore fit a new linear model to the data after transforming the y-value.
```{r}
df.log <- data.frame( area = df$area, cp.ratio.log = log(df$cp.ratio))
M3 <- lm(cp.ratio.log ~ area, dat = df.log)
```
Transforming a variable, whether $x$ or $y$, will change the nature of the relationship between both variables. We therefore need to recreate the scatter plot. We'll also see how well a 1^st^ order polynomial can characterize this relationship (re-expressing values can sometimes help "straighten" a relationship between variables, hence why we are not starting off first with a 2^nd^ order polynomial fit).
```{r fig.height=2.5, fig.width=2.5, small.mar=TRUE}
ggplot(df.log, aes(x = area, y = cp.ratio.log)) + geom_point() +
stat_smooth(method = "lm", se = FALSE)
```
At first glance, the log transformation seems to have done a good job at straightening the batch of values. Next, let's look at the residual-dependence plot.
```{r fig.height=2.5, fig.width=2.5, small.mar=TRUE}
df.log$residuals <- residuals(M3)
ggplot(df.log, aes(x = area, y = residuals)) + geom_point() +
stat_smooth(method = "loess", se = FALSE, span = 1,
method.args = list(degree = 1) )
```
Logging the y values has eliminated the residual's dependence on `area`. Next, let's assess homogeneity in the residuals using the s-l plot.
```{r fig.height=2.5, fig.width=2.5, small.mar=TRUE}
sl3 <- data.frame( std.res = sqrt(abs(residuals(M3))),
fit = predict(M3))
ggplot(sl3, aes(x = fit, y = std.res)) + geom_point() +
stat_smooth(method ="loess", se = FALSE, span = 1,
method.args=list(degree = 1) )
```
Even though we observe a mid range peak in spread, we do not observe a **systematic increase** in spread. The log transformation seems to have removed the monotone spread.
Finally, we'll check for normality of the residuals.
```{r fig.height=2.5, fig.width=2.5}
ggplot(df.log, aes(sample = residuals)) +
geom_qq(distribution = qnorm) +
geom_qq_line(distribution = qnorm, col = "blue")
```
Re-expressing the data seems to have improved the *normality* of the residuals somewhat. Though not perfect, more points seem to follow a straight line. But, recall from chapter 18.4 that noise can be quite prominent in a normal q-q plot when working with small datasets.
[^1]: Cleveland, William. *Visualizing data*. Hobart Press. 1993. [pp 87-88]