-
Notifications
You must be signed in to change notification settings - Fork 24
/
Copy pathsl_plot.qmd
executable file
·126 lines (87 loc) · 7.08 KB
/
sl_plot.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
output: html_document
editor_options:
chunk_output_type: console
---
# Univariate spread-level plot
```{r echo=FALSE}
source("libs/Common.R")
```
```{r echo = FALSE}
pkg_ver(c("dplyr", "ggplot2"))
```
## Introduction
In chapter 19, we highlighted the importance of a homogeneous spread when comparing fitted values across groups. A situation that can lead to a non-homogeneous spread is one where the dataset exhibits a systematic change in spread as a function of location. In other words, the variability in each batch may be dependent on the batches' location value such as its mean or median. Such dependency is often undesirable (e.g. in an ANOVA for instance) and preferably removed in an analysis. A plot well suited for visualizing this dependency is the **spread-level** plot, **s-l** (you may sometimes see it referred to as the **spread-location** plot). Note that the s-l plot presented here is that for *univariate* data. A bivariate version of the s-l plot is covered in chapter 24.
## Constructing the (univariate) s-l plot
The s-l plot compares a measure of the spread's residual to the location (usually the median given its robustness to outliers) for each batch of data. The spread is usually distilled down to its residual (what remains after subtracting each batch value by the batch fitted location) then it's transformed by taking the square root of its absolute value. The following block of code walks you through the steps needed to create an s-l plot.
```{r, fig.height=2.5,fig.width=6}
library(dplyr)
library(ggplot2)
singer <- lattice::singer
res.sq <- singer %>% group_by(voice.part) %>%
mutate(Median = median(height),
Residual = sqrt(abs(height - Median)))
ggplot(res.sq, aes(x=Median, y=Residual)) +
geom_jitter(alpha=0.4, width=0.05, height=0) +
stat_summary(fun = median, geom = "line", col = "red") +
ylab(expression(sqrt(abs(" Residuals ")))) +
geom_text(aes(x = Median, y = 3.3, label = voice.part))
```
The red line in the plot connects the median values of each batch of residuals. It helps identify the type of relationship between spread and location. If the line increases *monotonically upward*, there is an increasing spread as a function of increasing location; if the line decreases *monotonically downward*, there is a decreasing spread as a function of increasing location; and if the line is neither increasing nor decreasing monotonically, there is no change in spread as a function of location.
> Note that if you are to rescale the y-axis when using the `stat_summary()` function, you should use the `coord_cartesian(ylim = c( .. , .. ))` function instead of the `ylim()` function. The latter will mask the values above its maximum range from the `stat_summary()` function, the former will not.
In the above example, the singer dataset does not seem to exhibit any dependence between a voice part's spread and its median value.
Next, we'll look at an example of a dataset that *does* exhibit a dependence between spread and fitted values.
## Example: the `iris` dataset
R has a built-in dataset called `iris` that provide measurements of sepal and petal dimensions for three different species of the iris family. In this next example, we will plot the spreads of the `Petal.Length` residuals (after removing their group median values) to their group medians.
```{r, fig.height=3,fig.width=3.5}
# Create two new columns: group median and group residuals
df1 <- iris %>%
group_by(Species) %>%
mutate( Median = median(Petal.Length),
Residuals = sqrt(abs( Petal.Length - Median)))
# Generate the s-l plot
ggplot(df1, aes(x = Median, y = Residuals)) +
geom_jitter(alpha = 0.4, width = 0.05, height = 0) +
stat_summary(fun = median, geom = "line", col = "red") +
ylab(expression(sqrt( abs(" Residuals ")))) +
geom_text(aes(x = Median, y = 1.3, label = Species))
```
A monotonic spread is apparent in this dataset, i.e. as the median of the Petal length increases, so does the spread.
## How can we stabilize spreads in a dataset?
A technique used to help reduce or eliminate monotonic variations in the spreads as a function of fitted values is to **re-express** the original values. Re-expression, which involves transforming values via a power transformation, will be covered in the next chapter. However, in the section that follows, we learn of a variation of the s-l plot that can identify a power transformation that can help stabilize spread.
### A variation of the s-l plot
Another version of the s-l plot pits the log of the inter-quartile spread vs the log of the median. This approach only works for positive values (this may require that values be adjusted so that the minimum value be greater than 0).
This approach is appealing in that the slope of the best fit line can guide us to a power transformation via **power = 1 - slope**.
This variant of the s-l plot can be computed in R as follows (we will use the `iris` dataset as an example).
```{r fig.height=2.8, fig.width=2.8}
sl <- iris %>%
group_by(Species) %>%
summarise (level = log(median(Petal.Length)),
IQR = IQR(Petal.Length), # Computes the interquartile range
spread = log(IQR))
ggplot(sl, aes(x = level, y = spread)) + geom_point() +
stat_smooth(method = MASS::rlm, se = FALSE) +
xlab("Median (log)") + ylab("Spread (log)") +
geom_text(aes(x = level, y = spread, label = Species), cex=2.5)
```
Note how this plot differs from our earlier s-l plot in that we are only displaying each batch's median spread value and we are *fitting* a straight line to the medians instead of *connecting* them.
The slope suggests a monotonic increase in spread vs location. We can extract the slope value from a regression model. Here, we'll adopt a robust bivariate model (resistant line is covered in chapter 25).
```{r}
coefficients(MASS::rlm(spread ~ level, sl))
```
The slope is the second coefficient in the above output. The computed slope value is `1.14`. This suggests a power transformation of `1 - 1.14` (or about `-0.14`). You will learn in the next chapter that as a power transformation value approaches `0`, one can opt to use the `log` transformation instead. We'll therefore apply a log transformation to `Petal.Length`.
```{r, fig.height=3,fig.width=3.5}
# Create two new columns: group median and group residuals
df1 <- iris %>%
group_by(Species) %>%
mutate( log.length = log(Petal.Length),
Median = median(log.length),
Residuals = sqrt(abs( log.length - Median)))
# Generate the s-l plot
ggplot(df1, aes(x = Median, y = Residuals)) +
geom_jitter(alpha = 0.4, width = 0.05, height = 0) +
stat_summary(fun = median, geom = "line", col = "red") +
ylab(expression(sqrt( abs(" Residuals ")))) +
geom_text(aes(x = Median, y = 1.3, label = Species))
```
Applying a log transformation to the petal length values seems to have helped stabilize the spread given that the connected lines are close to flat. However, there is a very slight upward slope--though probably insignificant. Would applying a power transformation of `-0.14` have helped? We will revisit this dataset in the next chapter.