You can get help for a function using
anova()
?anova
help(anova)
CRAN is a repo for R packages
install.packages("mice")
# There are two ways to load a package into R:
library(stats)
require(stats)
## Load the built-in 'bfi' data from the 'psychTools' package
data(bfi, package = "psychTools")
## Access the documentation for the 'bfi' data
?psychTools::bfi
## Define the directory holding our data
dataDir <- "../../../data/"
## Load the 'boys' data from the R workspace
## '../../../data/boys.RData'
load(paste0(dataDir, "boys.RData"))
save(boys, file = paste0(dataDir, "tmp.RData"))
## Load the 'titanic' data stored in R data set
## '../../../data/titanic.rds'
titanic <- readRDS(paste0(dataDir, "titanic.rds"))
saveRDS(titanic, file="titanic2.rds")
## Load the 'diabetes' data from the tab-delimited file
## '../../../data/diabetes.txt'
diabetes <- read.table(paste0(dataDir, "diabetes.txt"),
header = TRUE,
sep = "\t")
write.table(boys,
paste0(dataDir, "boys.txt"),
row.names = FALSE,
sep = "\t",
na = "-999")
## Load the 2017 UTMB data from the comma-separated file
## '../../../data/utmb_2017.csv'
utmb1 <- read.csv(paste0(dataDir, "utmb_2017.csv"))
write.csv2(boys, paste0(dataDir, "boys.csv"), row.names = FALSE, na = "")
- The read.csv() function assumes the values are seperated by commas
- For EU-formatted CSV files (semicolons instead of commas), we can use read.csv2()
Reading this format is tricky, if we want to read SAV files there are two popular options:
- foreign::read.spss()
- haven::read_spss()
foreign doesn’t have a read equivelant
## Load the foreign package:
library(foreign)
## Use foreign::read.spss() to read '../../../data/mtcars.sav' into a list
mtcars1 <- read.spss(paste0(dataDir, "mtcars.sav"))
## Read '../../../data/mtcars.sav' as a data frame
mtcars2 <- read.spss(paste0(dataDir, "mtcars.sav"), to.data.frame = TRUE)
## Read '../../../data/mtcars.sav' without value labels
mtcars3 <- read.spss(paste0(dataDir, "mtcars.sav"),
to.data.frame = TRUE,
use.value.labels = FALSE)
library(labelled)
## Use haven::read_spss() to read '../../../data/mtcars.sav' into a tibble
mtcars4 <- read_spss(paste0(dataDir, "mtcars.sav"))
haven::read_spss() converts any SPSS variables with labels into labelled vectors
- We can use the labbeled::unlabeleld() function to remove the value labels
- [V-Sha~ -> V-Shaped
The best option we have for writing SPSS data is haven::write_sav()
write_sav(mtcars2, paste0(dataDir, "mctars2.sav"))
This will preserve label information provided by factor variables and the haven_labelled class
We have two good options for loading data from Excel spreadsheets:
- readxl::read_excel()
- openxlsx::read.xlsx()
## Load the packages:
library(readxl)
library(openxlsx)
## Use the readxl::read_excel() function to read the data from the 'titanic'
## sheet of the Excel workbook stored at '../../../data/example_data.xlsx'
titanic2 <- read_excel(paste0(dataDir, "example_data.xlsx"),
sheet = "titanic")
## Use the openxlsx::read.xlsx() function to read the data from the 'titanic'
## sheet of the Excel workbook stored at '../../../data/example_data.xlsx'
titanic3 <- read.xlsx(paste0(dataDir, "example_data.xlsx"),
sheet = "titanic")
The openxlsx package provides a powerful toolkit for programmatically building Excel workbooks in R and saving the results.
## Use the openxlsx::write.xlsx() function to write the 'diabetes' data to an
## XLSX workbook
write.xlsx(diabetes, paste0(dataDir, "diabetes.xlsx"), overwrite = TRUE)
## Use the openxlsx::write.xlsx() function to write each data frame in a list
## to a separate sheet of an XLSX workbook
write.xlsx(list(titanic = titanic, diabetes = diabetes, mtcars = mtcars),
paste0(dataDir, "example_data.xlsx"),
overwrite = TRUE)
Any R command written as a word followed by parentheses is a function
- mean()
- library()
Infix operators are aliased functions
- <-
- + - *
- > < ==
We can define our own functions using the “function()” function
square <- function(x) {
out <- x^2
out
}
square(5)
25
One-line functions don’t need braces:
square <- function(x) x^2
square(5)
25
Function arguments are not strictly typed
square <- function(x) x^2
sqaure(TRUE) # 1
square(1:5)
1 |
4 |
9 |
16 |
25 |
But there are limits:
square("bob")
Error in x^2: non-numeric argument to binary operator
Multiple arguments:
mod <- function(x, y) x %% y
mod(10, 3)
Or you can use a list to get multiple arguments:
getLsBeta <- function(datList) {
X <- datList$X
y <- datList$y
solve(crossprod(X)) %*% t(X) %*% y
}
X <- matrix(runif(500), ncol = 5)
datList <- list(y = X %*% rep(0.5, 5), X = X)
getLsBeta(datList = datList)
Functions are first-class objects in R.
- We can treat functions like any other R object.
R views an unevaluated function as an object with type ”closure”.
class(getLsBeta)
[1] “function”
typeof(getLsBeta)
[1] “closure”
An evaluated functions is equivalent to the objects it returns.
class(getLsBeta(datList))
[1] “matrix” “array”
typeof(getLsBeta(datList))
[1] “double”
There are three types of loops in R: for, while and until. Deze course behandelt alleen for (lol)
for(INDEX in RANGE) { Stuff To Do with the Current INDEX Value }
examples:
val <- 0
for(i in 1:100) {
val <- val + i
}
val
5050
This loop will compute the mean of every column in the mtcars data
means <- rep(0, ncol(mtcars))
for(j in 1:ncol(mtcars)) {
means[j] <- mean(mtcars[ , j])
}
means
listThing <- list(6,9,4,2)
sum <- 0
for(i in listThing) {
sum <- i + sum
}
sum
21
Usually there is a function to achieve what you are trying to do with the loop (sum() for example)
Usually one of two forms:
apply(DATA, MARGIN, FUNCTION, ...)
apply(DATA, FUNCTION, ...) # not sure if this works??? but was in slides
Margin is how the function should be applied: E.g., for a matrix 1 indicates rows, 2 indicates columns, c(1, 2) indicates rows and columns. Where X has named dimnames, it can be a character vector selecting dimension names.
data(bfi, package = "psychTools")
dat1 <- bfi[1:5,1:3]
# | 2 | 4 | 3 |
# | 2 | 4 | 5 |
# | 5 | 4 | 5 |
# | 4 | 4 | 6 |
# | 2 | 3 | 3 |
# sum over rows with above table
apply(dat1, 1, sum)
9 |
11 |
14 |
14 |
8 |
data(bfi, package = "psychTools")
dat1 <- bfi[1:5,1:3]
# subtract 1 from every cell
apply(dat1, (1:2), function(x)x-1)
1 | 3 | 2 |
1 | 3 | 4 |
4 | 3 | 4 |
3 | 3 | 5 |
1 | 2 | 2 |
similar concept but works on lists
list1 <- list(1,3,4,2,4)
lapply(list1, function(x) x+1)
2 | 4 | 5 | 3 | 5 |
sapply is a user friendly wrapper for lapply which be default returns a vector or matrix
- While using RStudie use TAB to quickly acces the documentation of the function arguments
- Use the tidyverse style guide
- Can be down automatically in Rstudio -> addins -> style selection
Vectors are the simplest kind of R object, there is no concept of a “scalar” in R
Vectors come in one of six “atomic modes”:
- numeric/double
- logical -> bool
- character
- integer
- complex
- A complex value in R is defined via the pure imaginary value i.
- raw
- Intended to hold raw bytes
c(1, 2, 3)
1 |
2 |
3 |
1:5
1 |
2 |
3 |
4 |
5 |
1.2:5.3
1.2 |
2.2 |
3.2 |
4.2 |
5.2 |
rep(33,4)
33 |
33 |
33 |
33 |
rep(1:3,3)
1 |
2 |
3 |
1 |
2 |
3 |
1 |
2 |
3 |
rep(1:3,each=2)
1 |
1 |
2 |
2 |
3 |
3 |
seq(0, 1, 0.25)
0 |
0.25 |
0.5 |
0.75 |
1 |
Not possible, some data types will be converted
a <- c(1,2,3)
b <- c("foo", "bar")
c(a,b)
[1] “1” “2” “3” “4” “5” “foo” “bar”
Matrices generalize vectors by adding a dimension attribute.
matrix(1:5, nrow = 5, ncol = 2)
1 | 1 |
2 | 2 |
3 | 3 |
4 | 4 |
5 | 5 |
cbind(c(1,2,3),c(4,5,6))
1 | 4 |
2 | 5 |
3 | 6 |
m1 <- matrix(1:5, nrow = 5, ncol = 2)
attributes(m1)
5 |
2 |
Matrices are populated in column-major order, by default
matrix(1:9, 3, 3)
1 | 4 | 7 |
2 | 5 | 8 |
3 | 6 | 9 |
The byrow = TRUE option allows us to fill by row-major order
matrix(1:9, 3, 3, byrow = TRUE)
1 | 2 | 3 |
4 | 5 | 6 |
7 | 8 | 9 |
Like vectors, matrices can only hold one type of data
Lists are the workhorse of R data objects
- An R list can hold an arbitrary set of other R objects
In lists data types CAN be mixed
list(1,2, "HI mom", TRUE)
1 | 2 | HI mom | TRUE |
l3 <- list(name = "bob",
alive = TRUE,
age = 33,
relationshipStatus = 42+3i)
l3$name
bob
Can also be done post-hoc via the names()
l1 = list(3, "hi", TRUE)
names(l1) <- c("name1", "second", "last")
l1$second
hi
(l4 <- list())
l4$people <- c("Bob", "Alice", "Suzy")
l4$money <- 0
l4$logical <- FALSE
l4
$people [1] “Bob” “Alice” “Suzy” $money [1] 0 $logical [1] FALSE
R’s way of storing rectangular data sets
- Each column of a data frame is a vector
- Each of these vectors can have a different type:
Data frames are actually lists of vectors (representing the columns)
We can create one using data.frame()
data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
1 | -1 | 0.1 |
2 | 1 | 0.2 |
3 | -1 | 0.3 |
4 | 1 | 0.4 |
5 | -1 | 0.5 |
6 | 1 | 0.6 |
Columns can also be given names:
data.frame(x = 1:6, y = c(-1, 1), z = seq(0.1, 0.6, 0.1))
x y z 1 1 -1 0.1 2 2 1 0.2 3 3 -1 0.3 4 4 1 0.4 5 5 -1 0.5 6 6 1 0.6
Creating from matrix
data.frame(matrix(5, 4, 3))
5 | 5 | 5 |
5 | 5 | 5 |
5 | 5 | 5 |
5 | 5 | 5 |
Sample makes a vector of some length randomly sampled from a vector
Runif generates a vector of random numbers
data.frame(a = sample(c(TRUE, FALSE), 5, replace = TRUE),
b = sample(c("foo", "bar"), 5, replace = TRUE),
c = runif(5)
)
TRUE | foo | 0.640138026094064 |
TRUE | bar | 0.0884113181382418 |
TRUE | bar | 0.099734982708469 |
TRUE | bar | 0.862016763305292 |
TRUE | bar | 0.991723588900641 |
Factors are R’s way of repesenting nominal (categorical) variables
- We can create a factor using the factor() function
factor(sample(1:3, 8, TRUE), labels = c("red", "yellow", "blue"))
red |
blue |
red |
blue |
yellow |
blue |
yellow |
yellow |
x <- factor(sample(1:3, 8, TRUE), labels = c("red", "yellow", "blue"))
attributes(x)
$levels [1] “red” “yellow” “blue” $class [1] “factor”
Even though a factor’s data are represented by an integer vector, R does not consider factors to be interger/numeric data.
In Base R, we typically use three operators to subset objects
- []
- [[]]
- $
Which we choose depends on:
- What type of object are we trying to subset?
- How much of the original typing do we want to keep in the subset?
x <- 10:20
x [2]
x [1:3]
x [c(2, 5, 7)] # 11, 14, 16
x [c(TRUE, FALSE, TRUE, FALSE, FALSE)]
10 |
12 |
15 |
17 |
20 |
Rarely used, does have some differences though, drops any names or dimnames attribute, and only allows selection of a single element
x <- 10:20
x [[2]]
11
The [] method returns objects of class list (or data.frame if foo was a data.frame) while the [[]] method returns objects whose class is determined by the type of their values.
y <- matrix(1:24, 6, 4)
y[4:6,2:3] # 2
10 | 16 |
11 | 17 |
12 | 18 |
We can mix different selecting styles
y <- matrix(1:24, 6, 4)
y[c(2,4),c(FALSE, TRUE, TRUE)] # 2
8 | 14 |
10 | 16 |
We can use all three operators to subset lists.
list1 <- list(2, "noo", TRUE)
names(list1) <- c("first", "second", "last")
list1$first
list1[1]
list1[["first"]]
2
l4 <- list(2, "noo", TRUE)
names(l4) <- c("first", "second", "last")
tmp1 <- l4[1]
tmp1$people
# [1] "Bob" "Alice" "Suzy"
class(tmp1)
# [1] "list"
tmp2 <- l4[[1]]
# [1] "Bob" "Alice" "Suzy"
class(tmp2)
# [1] "character"
numeric
We can subset the columns of a data frame using list semantics
d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
names(d3) <- c("a", "b", "c")
d3$a
1 |
2 |
3 |
4 |
5 |
6 |
d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
names(d3) <- c("a", "b", "c")
d3[1]
1 |
2 |
3 |
4 |
5 |
6 |
Filter by rows
d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
names(d3) <- c("a", "b", "c")
d3[(d3$a)%%2==0 & d3$c>0,]
Notic the comma at the end!
This comma is needed because R uses the following notation for subsetting data frames.
data(rows you want, columns you want)
2 | 1 | 0.2 |
4 | 1 | 0.4 |
6 | 1 | 0.6 |
The dplyr package provides ways to subset data, two are most frequently used:
library(dplyr)
d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
names(d3) <- c("a", "b", "c")
select(d3, a, b)
1 | -1 |
2 | 1 |
3 | -1 |
4 | 1 |
5 | -1 |
6 | 1 |
library(dplyr)
d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
names(d3) <- c("a", "b", "c")
select(d3, -a)
-1 | 0.1 |
1 | 0.2 |
-1 | 0.3 |
1 | 0.4 |
-1 | 0.5 |
1 | 0.6 |
library(dplyr)
d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
names(d3) <- c("a", "b", "c")
filter(d3, a%%2==0, c>0)
2 | 1 | 0.2 |
4 | 1 | 0.4 |
6 | 1 | 0.6 |
We can achieve the same using logical indexing in Base R
d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
names(d3) <- c("a", "b", "c")
d3[(d3$a)%%2==0 & d3$c>0,]
Notic the comma at the end!
2 | 1 | 0.2 |
4 | 1 | 0.4 |
6 | 1 | 0.6 |
d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
names(d3) <- c("a", "b", "c")
d4 <- d3
d4$a <- scale(d4$a)
d4$a <- as.numeric(d4$a)
d4
-1.33630620956212 | -1 | 0.1 |
-0.801783725737273 | 1 | 0.2 |
-0.267261241912424 | -1 | 0.3 |
0.267261241912424 | 1 | 0.4 |
0.801783725737273 | -1 | 0.5 |
1.33630620956212 | 1 | 0.6 |
library(dplyr)
d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
names(d3) <- c("a", "b", "c")
mutate(d3,
d = rbinom(nrow(d3), 1, c),
e = d * c,
a = scale(a)
)
-1.33630620956212 | -1 | 0.1 | 0 | 0 |
-0.801783725737273 | 1 | 0.2 | 0 | 0 |
-0.267261241912424 | -1 | 0.3 | 1 | 0.3 |
0.267261241912424 | 1 | 0.4 | 1 | 0.4 |
0.801783725737273 | -1 | 0.5 | 0 | 0 |
1.33630620956212 | 1 | 0.6 | 1 | 0.6 |
l3 <- runif(n=8, min=1, max=20)
sort(l3)
1.14997771941125 |
6.18364422139712 |
7.40294478787109 |
8.80276804696769 |
10.6937525619287 |
14.7475167061202 |
16.3595214642119 |
16.9985088445246 |
l3 <- runif(n=8, min=1, max=20)
sort(l3, decreasing = TRUE)
15.9437932423316 |
15.0901547158137 |
9.608802491799 |
4.20592033001594 |
4.16299939691089 |
2.97972352942452 |
2.14775081258267 |
1.15068965195678 |
library(dplyr)
d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
names(d3) <- c("a", "b", "c")
d3$d <- runif(n=6, min=1, max=20)
arrange(d3, d)
4 | 1 | 0.4 | 1.03251850209199 |
5 | -1 | 0.5 | 2.67228518775664 |
3 | -1 | 0.3 | 6.22691969852895 |
1 | -1 | 0.1 | 8.31828686594963 |
6 | 1 | 0.6 | 12.4162582552526 |
2 | 1 | 0.2 | 15.5486617612187 |
library(dplyr)
d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
names(d3) <- c("a", "b", "c")
d3$d <- runif(n=6, min=1, max=20)
arrange(d3, a*d, c)
2 | 1 | 0.2 | 1.041215069592 |
1 | -1 | 0.1 | 19.1222469233908 |
6 | 1 | 0.6 | 8.04311123443767 |
3 | -1 | 0.3 | 18.1283924696036 |
4 | 1 | 0.4 | 15.9365976287518 |
5 | -1 | 0.5 | 15.7623822011519 |
The %>% symbol represents the pipe operator, used to compose functions
- f(x) becomes x %>% f()
- f(x, y) becomes x %>% f(y)
- h(g(f(x))) becomes x %>% f %>% g %>% h
Big brain lecturer forgot to mention this operator is not in the standard library, is in dplyr, magrittr and tidyverse among others
Example use:
library(dplyr)
sqr <- function (x)
{
x**2
}
plus1 <- function (x)
{
x + 1
}
test <-
(1:10) %>%
sqr() %>%
plus1()
test
2 |
5 |
10 |
17 |
26 |
37 |
50 |
65 |
82 |
101 |
This is equivalent to:
test <-
sqr(
plus1(
(1:10)))
test
Specify where you want the piped value to go
library(dplyr)
power <- function (x, y)
{
x**y
}
test <-
(1:10) %>%
power(3, y=.)
test
3 |
9 |
27 |
81 |
243 |
729 |
2187 |
6561 |
19683 |
59049 |
[%$%] expose the names in lhs to the rhs expression. This is useful when functions do not have a built-in data argument. or The exposition pipe exposes the contents of an object to the next function in the pipeline.
library(dplyr)
# all equivelant:
cats %$% t.test(Hwt ~ Sex)
cats %>% t.test(Hwt ~ Sex, data = .)
t.test(Hwt ~ Sex, data = cats)
We can create a basic scatterplot using the plot() function
plot(runif(20, 0, 100), runif(20, 0, 10))
diabetes %$% plot(
y = tc,
x = bmi,
ylab = "Total Cholesterol",
xlab = "Body Mass Index",
main = "Relation between BMI and Cholesterol",
ylim = c(0, 350),
xlim = c(0, 50)
)
We can create a simple histogram with the hist() function
hist(runif(30,0,10))
We can create simple boxplots via the boxplot function
boxplot(runif(30,40,60),runif(60,10,50))
density(runif(5,0,100)) %>% plot()
Base R graphics are fine for quick and dirty visualizations but for publication quality graphics, you should probably use GGPlot
- GGPlot uses the ”grammar of graphics” and ”tidy data” to build up a figure from modular components.
We start by initializing the object that will define our plot
- This is regardless of what we want to plot
- We must specify a data source and a aesthetic via the aes() function
library(ggplot2)
p1 <- ggplot(data = diabetes, mapping = aes(x = bmi, y = glu))
p1
This doesn’t make any plot yet
p1 + geom_point() + geom_line()
p2 <- ggplot(diabetes, aes(tc))
p2 + geom_histogram()
p2 + geom_density()
p2 + geom_histogram()
p3 <- ggplot(diabetes, aes(sex, bmi))
p3 + geom_boxplot()
We can also add statistical summaries of the data.
p1 + geom_point() + geom_smooth()
p1 + geom_point() + geom_smooth(method = "lm")
Changing the style outside of aes() applies the styling to the entire plot
p5 <- ggplot(titanic, aes(age, survived))
p5 + geom_jitter(color = "blue", size = 2, height = 0.1)
We can apply canned themes to adjust a plot overall appearance\ Default theme:
p1.1 <- p1 + geom_point()
p1.1 + theme_classic()
You can also modify individual themes
p1.1 + theme_classic() +
theme(axis.title = element_text(size = 16,
family = "serif",
face = "bold",
color = "blue"),
aspect.ratio = 1)
Faceting allow us to make arrays of conditional plots.
p7 <- ggplot(titanic, aes(age, survived, color = class)) +
geom_jitter(height = 0.05) +
geom_smooth(method = "glm",
method.args = list(family = "binomial"),
se = FALSE)
p7 + facet_wrap(vars(sex))
p8 <- ggplot(titanic, aes(age, survived)) +
geom_jitter(height = 0.05) +
geom_smooth(method = "glm",
method.args = list(family = "binomial"),
se = FALSE)
p8 + facet_grid(vars(sex), vars(class))
If we want to paste several different plots into a single figure (without faceting), we can use the utilities in the gridExtra package.
library(gridExtra)
grid.arrange(p1 + geom_point(),
p3 + geom_boxplot(),
p1 + geom_line(),
p8 + facet_grid(vars(sex), vars(class)),
ncol = 2)
When we receive new data, they are generally messy and contaminated by various anomalies and errors
One of the first steps in processing a new set of data is cleaning
By cleaning the data, we ensure a few properties:
- The data are in an analyzable format.
- All data take legal values.
- Any outliers are located and treated.
- Any missing data are located and treated.
Empty cells in a dataset where there should be observerd values
- The missing cells correspond to true population values, but we haven’t observed those values
Not every empty cell is a missing datum
- Quality-of-life ratings for dead patients in a mortality study
- Firm profitability after the company goes out of business
- Self-reported severity of menstrual cramping for men
- Empty blocks of data following “gateway” items
Missing data patterns represent unique combinations of observerd and missing items
-
$P$ items ->$2^P$ possible patterns
## Load the mice library:
library(mice)
## Compute missing data patterns:
pats <- md.pattern(boys)
pats
plot_pattern(boys, rotate = TRUE)
age | reg | wgt | hgt | bmi | hc | gen | phb | tv | ||
---|---|---|---|---|---|---|---|---|---|---|
223 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
19 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 |
1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 2 |
437 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 3 |
43 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 4 |
16 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 5 |
1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 6 |
1 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 5 |
1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 3 |
1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 4 |
1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 |
3 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 4 |
0 | 3 | 4 | 20 | 21 | 46 | 503 | 503 | 522 | 1622 |
- The proportion of cells containing missing data
- Should be computed for each variable, not for the entire dataset
## Load the mice library:
library(mice)
## Load some example data:
data(boys, package = "mice")
## Compute variable-wise proportions missing:
mMat <- is.na(boys)
mMat %>% colMeans() %>% round(3)
age | hgt | wgt | bmi | hc | gen | phb | tv | reg |
---|---|---|---|---|---|---|---|---|
0.000 | 0.027 | 0.005 | 0.028 | 0.061 | 0.672 | 0.672 | 0.698 | 0.004 |
## Compute observation-wise proportions missing:
pmRow <- rowMeans(mMat)
## Summarize the above:
range(pmRow)
# [1] 0.0000000 0.7777778
pmRow[pmRow > 0] %>% range()
# [1] 0.1111111 0.7777778
median(pmRow)
# [1] 0.3333333
## Compute the proportion of complete cases:
mean(pmRow == 0)
# [1] 0.2981283
- The proportion of participants that drop-out of a study at each measurement occasion
- The proportion of cases available to estimate a given pairwise relationship
- e.g. a convariance between two variables
- Very important to have adequate coverage for the parameters you want to estimate
We can use routines from the mice package to calculate covariance coverage and response patterns.
## Compute the covariance coverage:
cc <- md.pairs(boys)$rr / nrow(boys)
## Check the result:
round(cc, 2)
age | hgt | wgt | bmi | hc | gen | phb | tv | reg | |
---|---|---|---|---|---|---|---|---|---|
age | 1.00 | 0.97 | 0.99 | 0.97 | 0.94 | 0.33 | 0.33 | 0.3 | 1.00 |
hgt | 0.97 | 0.97 | 0.97 | 0.97 | 0.92 | 0.32 | 0.32 | 0.3 | 0.97 |
wgt | 0.99 | 0.97 | 0.99 | 0.97 | 0.94 | 0.32 | 0.32 | 0.3 | 0.99 |
bmi | 0.97 | 0.97 | 0.97 | 0.97 | 0.91 | 0.32 | 0.32 | 0.3 | 0.97 |
hc | 0.94 | 0.92 | 0.94 | 0.91 | 0.94 | 0.33 | 0.33 | 0.3 | 0.93 |
gen | 0.33 | 0.32 | 0.32 | 0.32 | 0.33 | 0.33 | 0.33 | 0.3 | 0.33 |
phb | 0.33 | 0.32 | 0.32 | 0.32 | 0.33 | 0.33 | 0.33 | 0.3 | 0.33 |
tv | 0.30 | 0.30 | 0.30 | 0.30 | 0.30 | 0.30 | 0.30 | 0.3 | 0.30 |
reg | 1.00 | 0.97 | 0.99 | 0.97 | 0.93 | 0.33 | 0.33 | 0.3 | 1.00 |
The ggmice package provides some nice ways to visualize incomplete data and objects created during missing data treatment.
library(ggmice); library(ggplot2)
ggmice(boys, aes(wgt, hgt)) + geom_point()
We can also create a nicer version of the response pattern plot.
plot_pattern(boys, rotate = TRUE)
The naniar package also provides some nice visualization and numerical summary routines for incomplete data.
naniar::gg_miss_upset(boys)
We only consider univariate (:only involving one variable) outliers
- Extreme values with respect to the distribution of a variable’s other observations
- A human height measurement of 3 meters
- Student with a income of $250,000
- Not accounting for any particular model
A univariate outlier may, or may not be an illegal value
- Data entry errors are probably the most common cause
- Outliers can also be legal, but extreme, values
Key Point: We choose to view an outlier as arising from a different population than the one to which we want to generalize our findings to
Tukey (1977) described a procedure for flagging potential outliers based on a box-and-whiskers plot.
- Does not require normally distributed X
- Not sensitive to outliers
A fence is an interval defined as the following function of the first quartile, the third quartile, and the inner quartile range
- Taking C = 1.5 produces the inner fence
- Taking C = 3.0 produces the outer fence
We can use these fences to indentify potential outliers:
- Any value that falls outside of the inner fence is a possible outlier
- Any value that falls outside of the inner fence is a probable outlier
## Find potentially outlying cases:
(out <- boys %$% boxplot.stats(bmi, coef = 3)$out)
30.62 31.34 31.74 |
## Which observations are potential outliers?
boys %$% which(bmi %in% out)
574 668 733 |
## View the potentially outlying cases:
boys %>% filter(bmi %in% out)
age | hgt | wgt | bmi | hc | gen | phb | tv | reg | |
---|---|---|---|---|---|---|---|---|---|
1 | 15.493 | 182.5 | 102.0 | 30.62 | 57.7 | <NA> | <NA> | NA | west |
2 | 17.749 | 174.0 | 94.9 | 31.34 | 56.3 | G5 | P5 | 25 | west |
3 | 19.926 | 192.3 | 117.4 | 31.74 | 57.6 | G5 | P6 | 18 | north |
To compare statistics, we consider their breakdown points.
- The breakdown point is the minimum proportion of cases that must be replaced by ∞ to cause the value of the statistic to go to ∞.
The mean has a breakdown point of 1/N.
- Replacing a single value with ∞ will produce an infinite mean.
- This is why we shouldn’t use basic z-scores to find outliers.
The median has breakdown point of 50%.
- We can replace n < N/2 of the observations with ∞ without producing an infinite median.
Nominal, ordinal, and binary items can have outliers
- Outliers on categorical variables are often more indicative of bad variables than outlying cases.
Ordinal
- Most participant endorse one of the lowest categories on an ordinal item, but a few participants endorse the highest category.
- The participants who endorse the highest category may be outliers.
Nominal
- Groups with very low membership may be outliers on nominal grouping variables.
Binary
- If most endorse the item, the few who do not may be outliers.
If we locate any outliers, they must be treated.
- Outliers cause by errors, mistakes, or malfunctions (i.e., error outliers) should be directly corrected.
- Labeling non-error outliers is a subjective task.
- A (non-error) outlier must originate from a population separate from the one we care about.
- Don’t blindly automate the decision process.
The most direct solution is to delete any outlying observation.
- If you delete non-error outliers, the analysis should be reported twice: with outliers and without.
For univariate outliers, we can use less extreme types of deletion.
- Delete outlying values (but not the entire observation).
- These empty cells then become missing data.
- Treat the missing values along with any naturally-occurring nonresponse.
Winsorization:
- Replace the outliers with the nearest non-outlying value
We can also use robust regression procedures to estimate the model directly in the presence of outliers.
Weight the objective function to reduce the impact of outliers
- M-estimation
Trim outlying observations during estimation
- Least trimmed squares, MCD, MVE
Take the median, instead of the mean, of the squared residuals
- Least median of squares
Model some quantile of the DV’s distribution instead of the mean
- Quantile regression
Model the outcome with a heavy-tailed distribution
- Laplacian, Student’s T
Best fit line is defined by a simple equation: \begin{equation} \hat Y = \hat β_o+\hat\beta_1X \end{equation}
The above equation only describes the best fit line. We need to account for the estimation error \begin{equation} Y = \hat β_o+\hat\beta_1X +\hat\epsilon \end{equation}
Minimize the sum of squared residuals \begin{equation} RSS =∑^Nn=1 \hat ε^2_n \end{equation}
We may also want to know how well our model explains the outcome.
- Our model explains some proportion of the outcome’s variability.
- The residual variance ˆ𝜎2 = Var(ˆ𝜀) will be less than Var(Y)
We quantify the proportion of the outcome’s variance that is explained by our model using the R2 statistic
\begin{equation} R^2=\frac{TSS-RSS}{TSS}=1-\frac{RSS}{TSS} \end{equation}
\begin{equation}
TSS=∑^Nn=1(Y_n-\bar Y)^2=Var(Y)×(N-1)
\end{equation}
When assessing predictive performance, we will most often use the mean squared error as our criterion.
\begin{equation} MSE=\frac 1 N ∑^Nn=1(Y_n-\hat Y_n)^2=\frac {RSS} N \end{equation} MSE is better to use when we want to predict, so we dont care about explenation.
We can use information criteria to quickly compare non-nested models while accounting for model complexity.
Information criteria balance two competing forces.
- Blue: The optimized loglikelihood quantifies fit to the data.
- Red: The penalty term corrects for model complexity
K number of parameters, amount of x variables in this case
These statistics are only interesting for comparing models, on their on they don’t say a lot
out1 <- lm(bp ~ age, data = dDat)
coef(out1) -- get coefficients
Just simple linear regression with multiple predictor variables.
out1 <- lm(bp ~ age, data = dDat)
out2 <- lm(bp ~ age + bmi, data = dDat)
## Extract R^2 values:
r2.1 <- summary(out1)$r.squared
r2.2 <- summary(out2)$r.squared
We test if the fact that the r squard is bigger is conincedence or not
We use the F-statistic to test
1 <- summary(out1)$fstatistic
f1
value numdf dendf
55.78116 1.00000 440.00000
pf(q = f1[1], df1 = f1[2], df2 = f1[3], lower.tail = FALSE)
value 4.392569e-13
f2 <- summary(out2)$fstatistic
f2
value numdf dendf
64.6647 2.0000 439.0000
pf(f2[1], f2[2], f2[3], lower.tail = FALSE)
value 2.433518e-25
How do we quantify the additional variation explained by BMI, above and beyond age? • Compute the ΔR2
## Compute change in R^2:
r2.2 - r2.1
[1] 0.115049
How do we know if ΔR2 represents a significantly greater degree of explained variation? • Use an F-test for H0 : ΔR2 = 0 vs. H1 : ΔR2 > 0
## Compute change in R^2:
r2.2 - r2.1
[1] 0.115049
How do we know if ΔR2 represents a significantly greater degree of explained variation? • Use an F-test for H0 : ΔR2 = 0 vs. H1 : ΔR2 > 0
## Is that increase significantly greater than zero?
anova(out1, out2)
Analysis of Variance Table Model 1: bp ~ age Model 2: bp ~ age + bmi Res.Df RSS Df Sum of Sq F Pr(>F) 1 440 74873 2 439 65167 1 9706.1 65.386 6.057e-15 *** — Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ’ ’ 1
We can also compare models based on their prediction errors. • For OLS regression, we usually compare MSE values.
mse1 <- MSE(y_pred = predict(out1), y_true = dDat$bp)
mse2 <- MSE(y_pred = predict(out2), y_true = dDat$bp)
mse1
# [1] 169.3963
mse2
# [1] 147.4367
In this case, the MSE for the model with BMI included is smaller. • We should prefer the the larger model.
Finally, we can compare models based on information criteria.
AIC(out1, out2)
# df AIC
# out1 3 3528.792
# out2 4 3469.424
BIC(out1, out2)
# df BIC
# out1 3 3541.066
# out2 4 3485.789
In this case, both the AIC and the BIC for the model with BMI included are smaller. • We should prefer the the larger model.
Categorical predictors have to be converted to dummy codes:
drink | juice | tea | |
1 | juice | 1 | 0 |
2 | coffee | 0 | 0 |
3 | tea | 0 | 1 |
4 | tea | 0 | 1 |
Then these dummy variables can simply be used as normally: $$Y=β_0+β_1Xjuice+β_2Xtea+ε$$
All R factors have an associated contrasts attribute. • The contrasts define a coding to represent the grouping information. • Modeling functions code the factors using the rules defined by the contrasts.
x <- factor(c("a", "b", "c"))
contrasts(x)
b | c | |
a | 0 | 0 |
b | 1 | 0 |
c | 0 | 1 |
For variables with only two levels, we can test the overall factor’s significance by evaluating the significance of a single dummy code.
out <- lm(Price ~ Man.trans.avail, data = Cars93)
partSummary(out, -c(1, 2))
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 23.841 1.623 14.691 <2e-16
# Man.trans.availYes -6.603 2.004 -3.295 0.0014
# Residual standard error: 9.18 on 91 degrees of freedom
# Multiple R-squared: 0.1066,Adjusted R-squared: 0.09679
# F-statistic: 10.86 on 1 and 91 DF, p-value: 0.001403
For variables with more than two levels, we need to simultaneously evaluate the significance of each of the variable’s dummy codes
partSummary(out4, -c(1, 2))
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 21.7187 2.9222 7.432 6.25e-11
# Man.trans.availYes -5.8410 1.8223 -3.205 0.00187
# DriveTrainFront -0.2598 2.8189 -0.092 0.92677
# DriveTrainRear 10.5169 3.3608 3.129 0.00237
# Residual standard error: 8.314 on 89 degrees of freedom
# Multiple R-squared: 0.2834,Adjusted R-squared: 0.2592
# F-statistic: 11.73 on 3 and 89 DF, p-value: 1.51e-06
summary(out4)$r.squared - summary(out)$r.squared
# [1] 0.1767569
# anova(out, out4)
# Analysis of Variance Table
# Model 1: Price ~ Man.trans.avail
# Model 2: Price ~ Man.trans.avail + DriveTrain
# Res.Df RSS Df Sum of Sq F Pr(>F)
# 1 91 7668.9
# 2 89 6151.6 2 1517.3 10.976 5.488e-05 ***
# ---
# Signif. codes:
# 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To quantify uncertainty in our predictions, we want to use an appropriate interval estimate: There are two flavors:
The CI for ˆYm gives a likely range (in the sense of coverage probability and “confidence”) for the mth value of the true conditional mean.
CIs only account for uncertainty in the estimated regression coefficients, { ˆ𝛽0, ˆ𝛽p}.
The prediction interval for Ym gives a likely range (in the same sense as CIs) for the mth outcome value.
Prediction intervals also account for the regression errors, 𝜀.
CIs for
- Only accounts for the best fit
Prediction intervals are wider than CIs.
- They account for the additional uncertainty contributed by 𝜀.
Additive models allow us to examine the partial effects of several predictors on some outcome.
- The effect of one predictor does not change based on the values of other predictors
Moderation allows us to ask when one variable, X, affects another variable, Y.
- We’re considering the conditional effects of X on Y given certain levels of a third variable Z.
- How much a variable effects the outcome is dependent on another variable
In addtive MLR we have the following equation:
When X and Z are independent predictors, the following are true
- X and Z can be correlated
-
$β_1$ and$β_2$ are partial regression coefficients - The effect of X and Y is the same at all levels of Z, and the effect of Z on Y is the same at all levels of X.
When testing the moderation, we hypothesize that the effect of X on Y varies af a function of Z.
$Y=β_0+f(Z)X+β_2Z+ε$
If we assume that Z linearly affects the relationship between X and Y:
$f(Z)=β_1+β_3Z$
Combining the two equations:
$Y=β_0+(β_1+β_3Z)X+β_2Z+ε$ $Y=β_0+β_1X+β_2Z+β_3XZ+ε$
To test for moderation we simply need to test the significan of the interaction term, XZ
- For a unit change in Z, it is the expected change in the effect of X and Y
- interpreted where the other predictor is zero
- For a unit change in X, ˆ𝛽1 is the expected change in Y, when Z = 0.
- For a unit change in Z, ˆ𝛽2 is the expected change in Y, when X = 0.
## Load data:
socSup <- readRDS(paste0(dataDir, "social_support.rds"))
## Estimate the moderated regression model:
out <- lm(bdi ~ tanSat * sex, data = socSup)
partSummary(out, -c(1, 2))
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 20.8478 6.2114 3.356 0.00115
# tanSat -0.5772 0.3614 -1.597 0.11372
# sexmale 14.3667 12.2054 1.177 0.24223
# tanSat:sexmale -0.9482 0.7177 -1.321 0.18978
# Residual standard error: 9.267 on 91 degrees of freedom
# Multiple R-squared: 0.08955,Adjusted R-squared: 0.05954
# F-statistic: 2.984 on 3 and 91 DF, p-value: 0.03537
You always have to make same assumptions to create a model.
This is OK:
This is not: $Y=β_0Xβ_1+ε$
- power is also a parameter here -> not allowed
If the model is not linear in the parameters, then we’re not even working with linear regression.
- We need to move to entirely different modeling paradigm
Each modeled X must exhibit a linear relation with Y.
- We can define X via nonlinear transformations of the original data.
out1 <- lm(MPG ~ Horsepower,
data = Cars93)
plot(Cars93$MPG, Cars93$Horsepower)
plot(out1, 1)
In multiple regression models, basic residual plots won’t tell us which predictors exhibit nonlinear associations.
We can use Component + Residual Plots (AKA, partial residual plots) to visualize the unique effects of each X variable.
library(car)
crPlots(out3)
Nonlinearity in the residual plots is usually a sign of either:
- Model misspecification
- Influential observations
This type of model misspecification usually implies omitted functions of modeled variables.
- Polynomial terms
- Interactions
The solution is to include the omitted term into the model and refit.
- This is very much easier said than done.
$N>P$ - No
$X_p$ can be a linear combination of other predictors
If the predictor matrix is not full rank, the model is not estimable.
- The parameter estimates cannot be uniquely determined from the data
The predictors do not correlate with the errors
$Cov(\hat Y,ε)=0$ $E[ε_n]=0$
If the predictors are not exogenous, the estimated regression coefficients will be biased.
The most common cause of endogeneity (i.e., violating Assumption 3) is omitted variable bias.
- If we leave an important predictor variable out of our equation, some modeled predictors will become endogenous and their estimated regression slopes will be biased.
- The omitted variable must be correlated with Y and at least one of the modeled Xp, to be a problem.
Assume the following is the true regression model.
Now, suppose we omit Z from the model:
Our new error, 𝜔, is a combination of the true error, 𝜀, and the omitted
term,
- Consequently, if X and Z are correlated, omitting Z induces a correlation between X and 𝜔 (i.e., endogeneity).
Omitted variable bias can have severe consequences, but you can’t really test for it.
- The errors are correlated with the predictors, but our model is estimated under the assumption of exogeneity, so the residuals from our model will generally be uncorrelated with the predictors.
- We mostly have to pro-actively work to include all relevant variables in our model.
$Var(ε_n)=σ^2<∞$
see spherical
Non-constant error variance will not bias the parameter estimates.
- The best fit line is still correct.
- Our measure of uncertainty around that best fit line is wrong.
Heteroscedasticity will bias standard errors (usually downward).
- Test statistics will be too large.
- CIs will be too narrow.
- We will have inflated Type I error rates.
- Transform your outcome using a concave function (e.g., ln(Y), √Y).
- These transformations will shrink extreme values more than small/moderate ones.
- It’s usually a good idea to first shift the variable’s scale by setting the minimum value to 1.
- Refit the model using weighted least squares.
- Create inverse weights using functions of the residual variances or quantities highly correlated therewith.
- Use a Heteroscedasticity Consistent (HC) estimate of the asymptotic covariance matrix.
- Robust standard errors, Huber-White standard errors, Sandwich estimators
- HC estimators correct the standard errors for non-constant error variance.
## The 'sandwich' package provides several HC estimators:
library(sandwich)
## the 'lmtest' package provides fancy testing tools for linear models:
library(lmtest)
## Use sandwich estimator to compute ACOV matrix:
hcCov <- vcovHC(out1)
## Test coefficients with robust SEs:
robTest <- coeftest(out1, vcov = hcCov)
## Test coefficients with default SEs:
defTest <- summary(out1)$coefficients
We can create a residual plot to test for this
Non-constant error variance violates assumption4
Non-constant error variance
see spherical
Errors can become correlated in two basic ways:
- Serial dependence
- When modeling longitudinal data, the errors for a given observational unit
- We can detect temporal dependence by examining the autocorrelation of the residuals
- Clustering
- Your data have some important, unmodeled, grouping structure
- Children nested within classrooms
- Romantic couples
- Departments within a company
- We can detect problematic level of clustering with the intraclass correlation coefficient (icc)
- We need to know the clustering variable to apply the icc
- Your data have some important, unmodeled, grouping structure
Serially dependent errors in a longitudinal model usually indicate an inadequate model.
- Your model is ignoring some important aspect of the temporal variation that is being absorbed by the error terms.
- Hopefully, you can add the missing component to your model.
Clustering can be viewed as theoretically meaningful or as a nuisance factor that needs to be controlled
If the clustering is meaningfull, you should model the data using multilevel modelling
- Hierarchical linear regression
- Mixed models
- Random effects models
If the clustering is an uninteresting nuisance, you can use specialized HC variance estimators that deal with clustering
## Read in some data:
LeeBryk <- readRDS(paste0(dataDir, "lee_bryk.rds"))
## Estimate a linear regression model:
fit <- lm(math ~ ses + sector, data = LeeBryk)
## Calculate the residual ICC:
ICC::ICCbare(x = LeeBryk$schoolid, y = resid(fit))
0.07487712 |
## Robust tests:
coeftest(fit, vcov = vcovCL(fit, ~ schoolid))
#t test of coefficients:
#Estimate Std. Error t value Pr(>|t|)
#(Intercept) 11.79965 0.20318 58.0746 < 2.2e-16 ***
#ses 2.94860 0.12794 23.0475 < 2.2e-16 ***
#sectorprivate 1.93495 0.31717 6.1006 1.111e-09 ***
#---
#Signif. codes:
#0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If errors are non-normal, small-sample inferences may be biased.
- The justification for some tests and procedures used in regression analysis may not hold
plot(out1, 2)
One of the best ways to evaluate the normality of the error distribution with a Q-Q Plot.
- Plot the quantiles of the residual distribution against the theoretically ideal quantiles.
- We can actually use a Q-Q Plot to compare any two distributions.
In small samples, with fixed predictors, normally distributed errors imply normal sampling distributions for the regression coefficients.
- In large samples, the central limit theorem implies normal sampling distributions for the coefficients, regardless of the error distribution.
Prediction intervals require normally distributed errors.
- Confidence intervals for predictions share the same normality requirements as the coefficients’ sampling distributions.
Parameter estimates will not be fully efficient.
- Standard errors will be larger than they would have been with normally distributed errors.
We usually don’t need to do anything about non-normal errors
- The central limit theorem will protect our inferences.
We can use bootstrapping to get around the need for normality.
- Treat your sample as a synthetic population from which you draw many new samples (with replacement).
- Estimate your model in each new sample.
- The replicates of your estimated parameters generate an empirical sampling distribution that you can use for inference.
Bootstrapping can be used for inference on pretty much any estimable parameter, but it won’t work with small samples.
- Need to assume that your sample is representative of the population
The assumption of spherical erros combines assumptions 4 and 5
If the errors are not spherical, the standard errors will be biased.
- The estimated regression coefficients will be unbiased, though.
- But the statistics and stuf can be biased
After fitting the model we check the assumptions, most assumptions cant be checked before fitting.
If some of the assumptions are violeted, the inferences we make using the model may be wrong
- We need to check the tenability of our assumptions before leaning too heavily on the model estimates
These checks are called regression diagnostics
- Graphical visualizations
- Quantitative indices/measures
- Formal statistical
Alot of test require their own assumptions which is bad.
One of the most useful diagnostic graphics is the plot of residuals vs. predicted values.
We can create this plot by plotting the fitted lm() object in R.
out1 <- lm(Price ~ Horsepower,
data = Cars93)
plot(out1, 1)
Influential observations contaminate analyses in two ways:
- Exert too much influence on the fitted regression model
- Invalidate estimates/inferences by violating assumptions
There are two distinct types of influential observations:
- Outliers
- Observations with extreme outcome values, relative to the other data.
- Observations with outcome values that fit the model very badly.
- High-leverage observations
- Observation with extreme predictor values, relative to other data
Outliers can be identified by scrutinizing the residuals.
- Observations with residuals of large magnitude may be outliers.
- The difficulty arises in quantifying what constitutes a “large” residual.
If the residuals do not have constant variance, then we cannot directly compare them.
- We need to standardize the residuals in some way
We are specifically interested in externally studentized residuals.
We can’t simply standardize the ordinary residuals.
- Internally studentized residuals
- Outliers can pull the regression line towards themselves.
- The internally studentized residuals for outliers will be too small.
Begin by defining the concept of a deleted residual: $$ \hat ε(n)=Y_n-\hat Y(n) $$
- $\hat\epsilon(n)$ quantifies the distance of
$Y_n$ from the regression line estimated after excluding the nth observation. - $\hat Y(n)$ is the predicted value for the nth observatin from the regression line estimated after exluding the nth observation
If we standardize the deleted residual, $\hat\epsilon(n)
- Each t(n) is scaled equivalently.
- We can directly compare different t(n) .
- The t(n) are Student’s t distributed.
- We can quantify outliers in terms of quantiles of the t distribution.
- |t(n) | > 3.0 is a common rule of thumb for flagging outliers.
Index plots of the externally studentized residuals can help spotlight potential outliers.
- Look for observations that clearly “stand out from the crowd”
rstudent(out1) %>% plot()
We identify high-leverage observations through their leverage values.
- An observation’s leverage,
$h_n$ , quantifies the extent to which its predictors affect the fitted regression model. - Observations with
$X$ values very far from the mean,$\bar X$ , affect the fitted model disproportionately.
Index plots of the leverage values can help spotlight high-leverage points.
- Again, look for observations that clearly “stand out from the crowd.”
In simple linear regression, the nth leverage is given by: $$h_n=\frac 1 N + \frac{(X_n-\bar X)^2}{∑^Nm=1(X_m-\bar X)^2}$$
hatvalues(out1) %>% plot()
Observations with high leverage or large (externally) studentized residuals are not necessarily influential.
- High-leverage observations tend to be more influential than outliers.
- The worst problems arise from observations that are both outliers and have high leverage.
Measures of influence simultaneously consider extremity in both X and Y dimensions.
- Observations with high measures of influence are very likely to cause problems.
All measures of influence use the same logic as the deleted residual.
- Compare models estimated from the whole sample to models estimated from samples excluding individual observations.
One of the most common measures of influence is Cook’s Distance <<cooks>> $$Cook’s D_n=\frac{∑^Nn=1(\hat Y_n-\hat Y(n))^2}{(P+1)\hat σ^2}=(P+1)-1t^2_n\frac{h_n}{1-h_n}$$
Index plots of Cook’s distances can help spotlight the influential points.
- Look for observations that clearly “stand out from the crowd.”
cd <- cooks.distance(out1)
plot(cd)
Observation number 28 was the most influential according to Cook’s Distance
(maxD <- which.max(cd))
28 |
## Exclude the influential case:
Cars93.2 <- Cars93[-maxD, ]
## Fit model with reduced sample:
out2 <- lm(Price ~ Horsepower, data = Cars93.2)
round(summary(out1)$coefficients, 6)
Estimate | Std. | Error | t value | Pr(>|t|) |
(Intercept) | -1.398769 | 1.820016 | -0.768548 | 0.444152 |
Horsepower | 0.145371 | 0.011898 | 12.218325 | 0.000000 |
round(summary(out2)$coefficients, 6)
Estimate | Std. | Error | t value | Pr(>|t|) |
(Intercept) | -2.837646 | 1.806418 | -1.570868 | 0.119722 |
Horsepower | 0.156750 | 0.011996 | 13.066942 | 0.000000 |
Removing the outlier did not have a massive effect on the slope but it did lower the t-values by quite a bit
You could also remove the 2 most influential observations
(maxDs <- sort(cd) %>% names() %>% tail(2) %>% as.numeric())
## Exclude influential cases:
Cars93.2 <- Cars93[-maxDs, ]
## Fit model with reduced sample:
out2.2 <- lm(Price ~ Horsepower, data = Cars93.2)
The most common way to address influential observations is simply to delete them and refit the model.
- This approach is often effective—and always simple—but it is not fool-proof.
- Although an observation is influential, we may not be able to justify excluding it from the analysis.
Robust regression procedures can estimate the model directly in the presence of influential observations.
- Observations in the tails of the distribution are weighted less in the estimation process, so outliers and high-leverage points cannot exert substantial influence on the fit.
- We can do robust regression with the rlm() function from the MASS package.
So far we have only discussed general linear models:
All flavor of linear regression are general linear models
- Simple Linear regression, Multiple Linear Regression
- t-test, ANOVA, ANCOVA
- Multilevel linear regression models
We can break our model into pieces:
- $η = β_0 + ∑ ^pp=1 β_p X_p$
$Y = η + ε$
Where:
-
$η$ is the systematic component of the model (AKA, the linear predictor) - The normal distribution,
$N(…,…)$ is the model’s random component
The purpose of general linear modeling is to build a model of the outcome’s mean
- In this case
$μ_Y = η$ - The systematic component defines the mean of
$Y$
The random component quantifies variability around
- In the general linear model, we assume that this error variance follows a normal distribution
We can generalize the models we’ve been using in two important ways:
- Allow for random components other than the normal distribution
- Allow for more complicated relations between
$μ_Y$ and$η$ - Allow a link function:
$g(μ_Y)=η$
- Allow a link function:
Like this we optain the class of generalized linear models (GLMs)
The random component in a GLM can be any distribution from the exponential family:
- Normality
- Binomial
- Poisson
- Many others…
The systematic component of a GLM is exactly the same as it is in general linear models.
- $η = β_0 + ∑ ^pp=1 β_p X_p$
In GLMs,
- We first transform
$μ_Y$ via a link function
The link function performs two important functions
- Linearize the association betweeen
$X$ and$Y$ - Nonlinear:
$X→ μ_Y$ - Linear:
$X→ g(μ_Y)$
- Nonlinear:
- Allows GLMs for outcomes with the restricted ranges without requiring any restrictions on the range of the
$\{X_P\}$ - In many cases,
$μ_Y$ - Counts:
$μ_Y > 0$ - Probabilities:
$μ_Y ∈ [0,1]$
- Counts:
- When correctly specified,
$g(μ_Y)$ can take any value on the real line
- In many cases,
Every GLM is built from three components
-
The systematic component, $η$
- A linear function of the predictors,
$\{X_P\}$ - Describes the association between
$X$ and$Y$
- A linear function of the predictors,
-
The link function, $g(η_Y)$
- Linearizes the relation between
$X$ and$Y$ - Transforms
$μ_Y$ so that it can take any value on the real line
- Linearizes the relation between
-
The random component, $P(Y|g-1(η))$
- The distribution of the observed
$Y$ - Quantifies the error variance around
$η$
- The distribution of the observed
The General Linear Model is a special case of GLM
- Systematic component: $$η = β_0 + ∑ ^pp=1 β_p X_p$$
- Link function:
$$μ_Y=η$$ - Random component:
$$Y∼ N(η, σ^2)$$
In R you can train a generalized linear model with glm
data(iris)
## Generalized linear model:
glmFit <- glm(Petal.Length ~ Petal.Width + Species,
family = gaussian(link = "identity"),
data = iris)
We want to use logistic regression for a classification task with a two-level outcome.
Using a linear model doesn’t work well for this kind off task
- The link function ensures legal predicted values (between 0 and 1)
- The sigmoid curve implies fluctuation in the effectiveness of extra study time
- example below: More study time is most beneficial for students with around 5.5 hours of study
In logistic regression problems we are modeling binary data:
$Y∈\{1=”Succes”,0=”failure”\}$
The systematic component in our logistic regression model will be the binomial distribution.
The mean of the binomial distribution (with
- we are interested in modelling this succes probability,
$μ_Y=π$ :
We use the logit link function to convert the output of the systematic component to a value between 0 and 1.
- Given
$π$ , the odds of succes are:$$O_s=\frac π {1-π}$$ - Because
$π ∈ [0,1]$ , we know that$O_s ≥ 0$ - We can take the natural log of the odds as the last step to fully map
$π$ to the real line$$\operatorname{logit}(π)=ln\left(\frac π {1-π}\right)$$
Note that in the above function we map the probability to the real line, not the other way around.
- For the other way around use the logistic function: <<inverse_logit>>
$$σ(x)=\frac 1 {1+e-x}=\frac {e^x} {1+e^x}$$
- Also called sigmoid or expit
- Is the inverse of the logit function
Our final logistic regression model is:
The fitted model can be represented as: $$\operatorname{logit}(\hat\pi)=\hat\beta_0 + ∑^pp=1\hat\beta_pX_p$$
We have a model:
The fitted coefficients,
Log odds do not lend themselves to interpretation.
- We can convert the effects back to an odds scale by exponentiation
-
$\hat\beta$ has log odds units, but$e^\hat\beta$ has odds units.
note for all these interpretations odds and probabilities are not the same!
Exponentiating the coefficients also converts the additive effects to multiplicative effects.
- We can interpret
$\hat\beta$ as we would in linear regression:- A unit change in
$X_p$ produces an expected change of$\hat\beta$ units in$\operatorname{logit}(π)$
- A unit change in
- After exponentiation, however, unit changes in
$X_p$ imply multiplicative changes in the odds,$O_s=\frac π {1-π}$ - A unit change in
$X_p$ results in multiplying$O_s$ by $e\hat\beta_p$
- A unit change in
Due to the confusing interpretations of the coefficients, we often focus on the valence of the effects.
- Additional study time is associated with increased odds of passing
-
$\hat\beta_p>0 = “Increased Succes”$ , $e\hat\beta_p>1 = “Increased succes”$
For example interpretations look at the slides, start from slide 28
We don’t have an
anova(fit0, fit, test = "LRT")
## Analysis of Deviance Table
##
## Model 1: survived ~ age + sex
## Model 2: survived ~ age + sex + class
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 884 916.00
## 2 882 801.59 2 114.41 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can also use information criteria
AIC(fit0, fit)
## df AIC
## fit0 3 921.9989
## fit 5 811.5940
BIC(fit0, fit)
## df BIC
## fit0 3 936.3624
## fit 5 835.5333
Given a fitted logistic regression model, we can get predictions for new observations
Directly apply
By applying the inverse link function $g-1()$, to
Once we have computed the predicted succes probabilities
By choosing a threshold on t$, we can classify the new observations as succes or failures
$$\hat Y'
\begin{cases} 1 &\text{ if }\hat\pi’≥ t \ 0 &\text{ if }\hat\pi’≥ t \end{cases}$$
Note the optional threshold is not always
- This is OK:
$$\operatorname{logit}(π)=β_0 + β_1X + β_2Z + β_3X^2$$ - This is not: $$\operatorname{logit}(π)=β_0Xβ_1$$
Assumption 1 implies a linear relation between continuous predictors and the logit of the success probability.
- We can basically evaluate the linearity assumption using the same methods we applied with linear regression.
car::crPlots(glmFit, terms = ~ age + fare)
$N>P$ - No
$X_p$ can be a linear combination of other predictors
Basically the same as for linear regression Assumption 2 implies two conditions:
- P > N
- No severe (multi)collinearity among the predictors
We can quantify multicollinearity with the variance inflation factor (VIF).
car::vif(glmFit)
age | sex | fare |
1.031829 | 1.007699 | 1.026373 |
VIF > 10 indicates severe multicollinearity.
The distributional assumptions of logistic regression are not framed in terms of residuals.
This assumption fills the role played by all residual-related assumptions in linear regression.
Instead we consider the entire outcome distribution in logistic regression.
Assumption 3 implies several conditions.
- The outcome, Y, is binary.
- The linear predictor, 𝜂, can explain all the systematic trends in
$π$ .- No residual clustering after accounting for X.
- No important variables omitted from X.
We can easily check the first condition with summary statistics.
levels(titanic$survived)
“no” | “yes” |
table(titanic$survived)
no | yes |
545 | 342 |
If we have a non-binary, categorical outcome, we can use a different type of model.
- Multiclass nominal variables: multinomial logistic regression
nnet::multinom()
- Ordinal variables: Proportional odds logistic regression
MASS::polr()
- Counts: Poisson regression
glm()
withfamily = "poisson"
The binomial distribution is also appropriate for modeling the proportion of successes in N trials.
For the second condition:
We can check for residual clustering by calculating the ICC using deviance residuals.
## Check for residual dependence induced by 'class':
ICC::ICCbare(x = titanic$class, y = resid(glmFit, type = "deviance"))
0.1054665 |
In logistic regression the outcome is binary,
- Due to this mismatch in measurements levels, we don’t have a natural definition of a “residual” in logistic regression
The most basic residual is the raw residual,
- The difference between the observerd outcome value and the predicted probability
We train a model on the titanic dataset:
## Read the data:
titanic <- readRDS(paste0(dataDir, "titanic.rds"))
## Estimate the logistic regression model:
glmFit <- glm(survived ~ age + sex + fare,
data = titanic,
family = "binomial")
## Save the linear predictor estimates:
titanic$etaHat <- predict(glmFit, type = "link")
Raw residual plot:
library(ggplot)
## Calculate the raw residuals:
titanic$e <- resid(glmFit, type = "response")
## Plot raw residuals vs. fitted
## linear predictor values:
ggplot(titanic, aes(etaHat, e)) +
geom_point() +
geom_smooth() +
theme_classic() +
xlab("Linear Predictor") +
ylab("Raw Residual")
Pearson residuals,
## Calculate the Pearson residuals:
titanic$r <- resid(glmFit, type = "pearson")
Deviance residuals,
The residual deviance,
- Quantifies how well the model fits the data
In Addition to the preceding statistical assumptions, we must satisfy three computational requirements that were not necessary in linear regression.
Logistic regression models are estimated with numerical methods, so we need larger samples than we would for linear regression models.
- The sample size requirements increase with model complexity.
Some suggested rules of thumb:
- 10 cases for each predictor (Agresti, 2018)
-
$N = 10P/π_0$ (Peduzzi, Concato, Kemper, Holford, & Feinstein, 1996)-
$P$ : Number of predictors -
$π_0$ : Proportion of the minority class
-
-
$N = 100 + 50P$ (Bujang, Omar, & Baharum, 2018)
The logistic regression may not perform well when the outcome classes are severely imbalanced.
- e.g. way more of one class then the other
with(titanic, table(survived) / length(survived))
no | yes |
0.6144307 | 0.3855693 |
We have a few possible solutions for problematic imbalance:
- Down-sampling the majority class
- Randomly remove some samples from the majority class
- Up-sampling the minority class
- Random select samples from the minority class to use multiple times
- Use weights when estimating the logistic regression model
weights
argument inglm()
We don’t actually want to perfectly predict class membership.
- The model cannot estimate with perfectly separable classes.
Why?
- Imagine the shape of the logit function to estimate classes which can be seperated perfectly
- The slope would have to be really low, then really high
- To get this the slope would have to be infinite, which obviously can’t be estimated
Model regularization (e.g., ridge or LASSO penalty) may help.
glmnet::glmnet()
As with linear regression, we need to deal with any overly influential cases.
- We can use the linear predictor values to calculate Cook’s Distances.
- Any cases that exerts undue influence on the linear predictor will have the same effect of the predicted success probabilities.
cooks.distance(glmFit) %>% plot()
plot(glmFit, 4)
One of the most direct ways to evaluate classification performance is the confusion matrix.
## Add predictions to the dataset:
titanic %<>%
mutate(piHat = predict(glmFit, type = "response"),
yHat = as.factor(ifelse(piHat <= 0.5, "no", "yes")))
library(caret)
cMat <- titanic %$% confusionMatrix(data = yHat, reference = survived)
cMat$table
Reference | ||
Prediction | no | yes |
no | 458 | 106 |
yes | 87 | 236 |
Each cell in the confusion matrix represents a certain classification result:
- TP true positive: Correctly predicted positive
- TN true negative: Correctly predicted negative
- FP false positive: Falsely predicted positive
- FN false negative: Falsely predicted negative
cMat$overall
Accuracy | Kappa | AccuracyLower | AccuracyUpper | AccuracyNull | AccuracyPValue | McnemarPValue |
7.824126e-01 | 5.359709e-01 | 7.537802e-01 | 8.091549e-01 | 6.144307e-01 | 7.471405e-27 | 1.950898e-01 |
cMat$byClass
Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | Precision | Recall |
0.8403670 | 0.6900585 | 0.8120567 | 0.7306502 | 0.8120567 | 0.8403670 |
F1 | Prevalence | Detection Rate | Detection Prevalence | Balanced Accuracy | |
0.8259693 | 0.6144307 | 0.5163472 | 0.6358512 | 0.7652127 |
Summaries/statistics of the Confusion matrix:
$Accuracy = (TP+TN)/(P+N)$ $Error\ Rate = (FP+FN)/(P+N)=1-Accuracy$ -
$Sensitivity = TP/(TP+FN)$ <<sensitivity>> -
$Specificity = TN/(TN+FP)$ <<specificity>> $False\ Positive\ Rate (FPR)=FP/(TN+FP)=1-Specificity$ $Positive\ Predictive\ Value\ (PPV) = TP / (TP + FP)$ $Negative\ Predictive\ Value\ (NPV) = TN / (TN + FN)$
A receiver operating characteristic (ROC) curve illustrates the diagnostic ability of a binary classifier for all possible values of the classification threshold.
- The ROC curve plots sensitivity against specificity at different threshold values.
The area under the ROC curve (AUC) is a one-number summary of the potential performance of the classifier.
- The AUC does not depend on the classification threshold.
pROC::auc(rocData)
Area under the curve: 0.8298 |
Rule of thumb for AUC according to Mandrekar (2010):
- AUC value from 0.7 – 0.8: Acceptable
- AUC value from 0.8 – 0.9: Excellent
- AUC value over 0.9: Outstanding
We can use numerical methods to estimate an optimal threshold value
library(OptimalCutpoints)
ocOut <- optimal.cutpoints(X = "piHat",
status = "survived",
tag.healthy = "no",
data = titanic,
method = "ROC01"
)
partSummary(ocOut, -1)
Area under the ROC curve (AUC): 0.83 (0.802, 0.858)
CRITERION: ROC01
Number of optimal cutoffs: 1
Estimate | |
---|---|
cutoff | 0.2360978 |
Se | 0.7543860 |
Sp | 0.7761468 |
PPV | 0.6789474 |
NPV | 0.8343195 |
DLR.Positive | 3.3700029 |
DLR.Negative | 0.3164531 |
FP | 122.0000000 |
FN | 84.0000000 |
Optimal criterion | 0.1104365 |
Not exam material, might be usefull for the practical though.
Measuring classification performance from a confusion matrix can be problematic.
• Sometimes too coarse.
We can also base our error measure on the residual deviance with the Cross-Entropy Error: $$CEE = -N-1 ∑^Nn=1 Y_nln(\hat\pi_n) + (1 - Y_n) ln(1 -\hat\pi_n)$$
- The CEE is sensitive to classification confidence.
- Stronger predictions are more heavily weighted
The misclassification rate is a na¨ıvely appealing option.
- The proportion of cases assigned to the wrong group
Consider two perfect classifiers:
- P( ˆYn = 1|Yn = 1) = 0.90, P( ˆYn = 1|Yn = 0) = 0.10, n = 1, 2, … , N
- P( ˆYn = 1|Yn = 1) = 0.55, P( ˆYn = 1|Yn = 0) = 0.45, n = 1, 2, … , N
Both of these classifiers will have the same misclassification rate.
- Neither model ever makes an incorrect group assignment.
The first model will have a lower CEE.
- The classifications are made with higher confidence.
- CEE1 = 0.105, CEE2 = 0.598