-
Notifications
You must be signed in to change notification settings - Fork 24
/
letter_values.qmd
167 lines (110 loc) · 7.59 KB
/
letter_values.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
output: html_document
editor_options:
chunk_output_type: console
---
# Letter value summaries
```{r echo=FALSE}
source("libs/Common.R")
```
```{r echo = FALSE}
pkg_ver(c("dplyr", "tukeyedar"))
```
## Introduction
The boxplot is a five number summary of a batch of values that gives us a handle on the symmetry (or lack thereof) of the data. The five numbers consist of the median, the inter-quartile values and the upper and lower adjacent values (aka whiskers). The **letter value summary** was introduced by John Tukey and extends the boxplot's 5 number summary by exploring the symmetry of the batch for depth levels other than the half (median) or the fourth (quartiles).
## Constructing the letter value summaries
```{r echo=FALSE}
n <- 11
set.seed(13)
x <- sample(1:30,n)
```
Let's start with a simple batch of numbers: `r x`.
#### Order the values
First, we order the numbers from smallest to largest.
![](./img/letter_value_summary_a.png)
#### Find the median (M)
Next, we find the median. It's the location in the enumerated values that splits the batch into two equal sets of values. If the number of values is odd, then the median is the value furthest from the ends. If the number of values in the batch is even, then the median is the average of the two middle values, each furthest from its end. A simple formula to identify the element number (or depth) of the batch associated with the median is:
$$
depth\ of\ median = \frac{n + 1}{2}
$$
where $n$ is the number of values in the batch. In our example, we have 11 values, so the median is (11 + 1)/2 or 6; it's the 6^th^ element from the left (or the right) of the sorted values.
![](./img/letter_value_summary_median.png)
If our batch consisted of an even number of values such as 10, the median would be the 5.5th value which does not coincide with an existing value. This would require that we find the 5th and 6th elements in the batch, then compute their average to find the median.
The median is the value furthest from the extreme values; it's said to have the greatest **depth** (e.g. a depth of 6 in our example). The minimum and maximum values have the lowest depth with a depth value of 1, each.
#### Find the hinges (H)
Next, we take both halves of our batch of ordered numbers and find the middle of each. These mid points are referred to as **hinges**. They can be easily computed by modifying the formula used to find the median: we simply substitute the value $n$ with the depth associated with the median, $d(M)$ (i.e. the median becomes an extreme value in this operation).
$$
depth\ of\ hinge = \frac{d(M) + 1}{2}
$$
In our working example, the depth of the median is 6, therefore the depth of the hinge is (6+1)/2 = 3.5. So the hinge is the 3.5th element from the left (or right) of the first half of the batch and the 3.5th element from the left (or right) of the second half of the batch. Since the depth does not fall on an existing value, we need to compute it using the two closest values (depth 3 and depth 4). This gives us (8+11)/2=9.5 for the left hinge and (22+24)/2=23 for the right hinge.
![](./img/letter_value_summary_hinge.png)
If our batch consisted of even number of values, we would need to drop the ½ fraction from depth of the median before computing the depth of the hinge. For example, if we had 10 values the depth of the median would be 5.5 and the depth of the hinge would be calculated as (5+1)/2.
Note that the hinges are similar to the quartiles but because they are computed differently, their values may be slightly different from what you might get from a boxplot, for example.
#### Find the other letter summaries (E, D, C, B, A, etc...)
So far, we've found the median (M) and the hinges (H). We keep computing the depths for each outer group of values delimited by the outer extreme values and the previous depth. For example, the mid-point of the outer quarters gives us our **eights** (**E**):
$$
depth\ of\ eights = \frac{d(H) + 1}{2}
$$
or, after dropping the ½ fraction from the depth of the hinge, (3+1)/2=2.
This continues until we've exhausted all depths (i.e. until we reach a depth of 1 associated with the minimum and maximum values).
Once past the eight, we label each subsequent depths using letters in reverse lexicographic order starting with **D** (for sixteenth) then **C**, **B**, **A**, **Z**, **Y**, etc...
In our working example, we stop at a depth of **D** (though some will stop at a depth of two and only report the extreme values thereafter).
#### The mids and spreads
Once we've identified the values associated with each depth, we compute the **middle** value for each depth pair. For example, the middle value for the paired hinges is 16.25; the middle value for the paired eights is 14; and so on. We can also compute the **spread** for each depth by computing the difference between each paired value.
![](./img/letter_value_summary_b.png)
The letter value summary is usually reported in tabular form:
```{r echo=FALSE}
library(tukeyedar)
x.lsum <- eda_lsum(x)
knitr::kable(x.lsum)
```
## The `eda_lsum` function
A custom function, `eda_lsum`, is available in the `tukeyedar` package that will compute the letter value summaries.
For example, to generate the letter summary function for a batch of values `x`, type:
```{r class.source="eda"}
library(tukeyedar)
x <- c(22, 8, 11, 3, 26, 1, 14, 18, 20, 25, 24)
eda_lsum(x)
```
You can specify the number of levels with the `l=` argument. For example, to limit the output to just 3 depths, type:
```{r class.source="eda"}
eda_lsum(x, l = 3)
```
## Interpreting the letter value summaries
Let' explore the letter summary values for five simulated distributions. We'll start with a strong right-skewed distribution then progress to a strong left-skewed distribution with a Gaussian (normal) distribution in between.
```{r, fig.width=8, fig.height=3, echo=FALSE}
n <- 100 # Number of simulated samples
fi <- (1:n - 0.5)/n
b.shp1 <- c(1, 8 , 50, 10, 10)
b.shp2 <- c(10, 10 , 70, 8, 1)
# Generate quantile plots (uniform q-q)
b <- fi
OP <- par( mfcol = c(2,5), mar = c(2,2,1,1) )
for (i in 1:5 ) {
if(i !=3) {
a <- qbeta(fi,shape1=b.shp1[i], shape2 = b.shp2[i])
} else {
a <- qnorm(seq(0.01,0.99,0.01))
}
plot(density(a),main=NA,xlab=NA)
a.ls <- eda_lsum(a, l=9)
plot(mid ~ eval(n/2 - depth), a.ls ,main=NA,xlab=NA, pch=16)
text(eval(n/2 - a.ls$depth), a.ls$mid, a.ls$letter, col="blue",pos=4)
}
par(OP)
```
Note the shape of the letter summaries *vis-a-vis* the direction of the skew. Note too that the letter value summary plot is extremely sensitive to deviations from perfect symmetry. This is apparent in the middle plot which is for a perfectly symmetrical (Gaussian) distribution. The reason has to do with machine precision: the range of values along the y-axis is extremely small, $10^{-16}$, which is the lower limit of the computer's precision.
This sensitivity has its rewards. Note the second plot from the left and the right. The asymmetry is barely noticeable in both distributions, yet the letter value summaries do a great job in identifying the slight asymmetry. Even the boxplots cannot convey this asymmetry as effectively.
```{r, fig.width=8, fig.height=1, echo=FALSE}
OP <- par( mfcol = c(1,5), mar = c(2,2,1,1) )
for (i in 1:5 ) {
if(i !=3) {
a <- qbeta(fi,shape1=b.shp1[i], shape2 = b.shp2[i])
} else {
a <- qnorm(seq(0.01,0.99,0.01))
}
boxplot(a,main=NA,xlab=NA, horizontal = TRUE)
}
par(OP)
```
This is not to say that just because asymmetry is present in the letter summary values we necessarily have a problem; but it may warrant further exploration before proceeding with the analysis--especially if statistical procedures warrant it.