Skip to content

Latest commit

 

History

History
2774 lines (2269 loc) · 83.7 KB

notes.org

File metadata and controls

2774 lines (2269 loc) · 83.7 KB

Fundamental Techniques in Data Science with R Notes

R basics

Getting help

You can get help for a function using

anova()
?anova
help(anova)

CRAN and packages

CRAN is a repo for R packages

install.packages("mice")
# There are two ways to load a package into R:
library(stats)
require(stats)

R Data formats and importing/writing data

Native data formats

## Load the built-in 'bfi' data from the 'psychTools' package
data(bfi, package = "psychTools")

## Access the documentation for the 'bfi' data
?psychTools::bfi

## Define the directory holding our data
dataDir <- "../../../data/"

## Load the 'boys' data from the R workspace
## '../../../data/boys.RData'
load(paste0(dataDir, "boys.RData"))
save(boys, file = paste0(dataDir, "tmp.RData"))

## Load the 'titanic' data stored in R data set
## '../../../data/titanic.rds'
titanic <- readRDS(paste0(dataDir, "titanic.rds"))
saveRDS(titanic, file="titanic2.rds")

Delimited Data types (csv etc.)

## Load the 'diabetes' data from the tab-delimited file
## '../../../data/diabetes.txt'
diabetes <- read.table(paste0(dataDir, "diabetes.txt"),
    header = TRUE,
    sep = "\t")

write.table(boys,
    paste0(dataDir, "boys.txt"),
    row.names = FALSE,
    sep = "\t",
    na = "-999")

## Load the 2017 UTMB data from the comma-separated file
## '../../../data/utmb_2017.csv'
utmb1 <- read.csv(paste0(dataDir, "utmb_2017.csv"))

write.csv2(boys, paste0(dataDir, "boys.csv"), row.names = FALSE, na = "")
  • The read.csv() function assumes the values are seperated by commas
  • For EU-formatted CSV files (semicolons instead of commas), we can use read.csv2()

SPSS Data (SAV)

Reading this format is tricky, if we want to read SAV files there are two popular options:

  • foreign::read.spss()
  • haven::read_spss()

foreign doesn’t have a read equivelant

## Load the foreign package:
library(foreign)

## Use foreign::read.spss() to read '../../../data/mtcars.sav' into a list
mtcars1 <- read.spss(paste0(dataDir, "mtcars.sav"))

## Read '../../../data/mtcars.sav' as a data frame
mtcars2 <- read.spss(paste0(dataDir, "mtcars.sav"), to.data.frame = TRUE)

## Read '../../../data/mtcars.sav' without value labels
mtcars3 <- read.spss(paste0(dataDir, "mtcars.sav"),
    to.data.frame = TRUE,
    use.value.labels = FALSE)
library(labelled)

## Use haven::read_spss() to read '../../../data/mtcars.sav' into a tibble
mtcars4 <- read_spss(paste0(dataDir, "mtcars.sav"))

haven::read_spss() converts any SPSS variables with labels into labelled vectors

  • We can use the labbeled::unlabeleld() function to remove the value labels
  • [V-Sha~ -> V-Shaped

The best option we have for writing SPSS data is haven::write_sav()

write_sav(mtcars2, paste0(dataDir, "mctars2.sav"))

This will preserve label information provided by factor variables and the haven_labelled class

Excel (xlsx)

We have two good options for loading data from Excel spreadsheets:

  • readxl::read_excel()
  • openxlsx::read.xlsx()
## Load the packages:
library(readxl)
library(openxlsx)

## Use the readxl::read_excel() function to read the data from the 'titanic'
## sheet of the Excel workbook stored at '../../../data/example_data.xlsx'
titanic2 <- read_excel(paste0(dataDir, "example_data.xlsx"),
    sheet = "titanic")

## Use the openxlsx::read.xlsx() function to read the data from the 'titanic'
## sheet of the Excel workbook stored at '../../../data/example_data.xlsx'
titanic3 <- read.xlsx(paste0(dataDir, "example_data.xlsx"),
    sheet = "titanic")

The openxlsx package provides a powerful toolkit for programmatically building Excel workbooks in R and saving the results.

## Use the openxlsx::write.xlsx() function to write the 'diabetes' data to an
## XLSX workbook
write.xlsx(diabetes, paste0(dataDir, "diabetes.xlsx"), overwrite = TRUE)

## Use the openxlsx::write.xlsx() function to write each data frame in a list

## to a separate sheet of an XLSX workbook
write.xlsx(list(titanic = titanic, diabetes = diabetes, mtcars = mtcars),
    paste0(dataDir, "example_data.xlsx"),
    overwrite = TRUE)

R Functions

Any R command written as a word followed by parentheses is a function

  • mean()
  • library()

Infix operators are aliased functions

  • <-
  • + - *
  • > < ==

User-Defined Functions

We can define our own functions using the “function()” function

square <- function(x) {
    out <- x^2
    out
}

square(5)
25

One-line functions don’t need braces:

square <- function(x) x^2
square(5)
25

Not strictly typed

Function arguments are not strictly typed

square <- function(x) x^2
sqaure(TRUE) # 1
square(1:5)
1
4
9
16
25

But there are limits:

square("bob")

Error in x^2: non-numeric argument to binary operator

Multiple/list arguments

Multiple arguments:

mod <- function(x, y) x %% y
mod(10, 3)

Or you can use a list to get multiple arguments:

getLsBeta <- function(datList) {
    X <- datList$X
    y <- datList$y
    solve(crossprod(X)) %*% t(X) %*% y
}

X <- matrix(runif(500), ncol = 5)
datList <- list(y = X %*% rep(0.5, 5), X = X)
getLsBeta(datList = datList)

Type/class of functions

Functions are first-class objects in R.

  • We can treat functions like any other R object.

R views an unevaluated function as an object with type ”closure”.

class(getLsBeta)

[1] “function”

typeof(getLsBeta)

[1] “closure”

An evaluated functions is equivalent to the objects it returns.

class(getLsBeta(datList))

[1] “matrix” “array”

typeof(getLsBeta(datList))

[1] “double”

Iteration

Loops

There are three types of loops in R: for, while and until. Deze course behandelt alleen for (lol)

for(INDEX in RANGE) { Stuff To Do with the Current INDEX Value }

examples:

val <- 0
for(i in 1:100) {
    val <- val + i
}
val
5050

This loop will compute the mean of every column in the mtcars data

means <- rep(0, ncol(mtcars))
for(j in 1:ncol(mtcars)) {
    means[j] <- mean(mtcars[ , j])
}
means
listThing <- list(6,9,4,2)
sum <- 0
for(i in listThing) {
    sum <- i + sum
}
sum
21

Usually there is a function to achieve what you are trying to do with the loop (sum() for example)

Apply

Usually one of two forms:

apply(DATA, MARGIN, FUNCTION, ...)
apply(DATA, FUNCTION, ...) # not sure if this works??? but was in slides

Margin

Margin is how the function should be applied: E.g., for a matrix 1 indicates rows, 2 indicates columns, c(1, 2) indicates rows and columns. Where X has named dimnames, it can be a character vector selecting dimension names.

data(bfi, package = "psychTools")
dat1 <- bfi[1:5,1:3]
# | 2 | 4 | 3 |
# | 2 | 4 | 5 |
# | 5 | 4 | 5 |
# | 4 | 4 | 6 |
# | 2 | 3 | 3 |
# sum over rows with above table
apply(dat1, 1, sum)
9
11
14
14
8
data(bfi, package = "psychTools")
dat1 <- bfi[1:5,1:3]
# subtract 1 from every cell
apply(dat1, (1:2), function(x)x-1)
132
134
434
335
122

lapply and sapply

similar concept but works on lists

list1 <- list(1,3,4,2,4)
lapply(list1, function(x) x+1)
24535

sapply is a user friendly wrapper for lapply which be default returns a vector or matrix

Some tips/ style guide

  • While using RStudie use TAB to quickly acces the documentation of the function arguments
  • Use the tidyverse style guide
    • Can be down automatically in Rstudio -> addins -> style selection

R Objects & Data Types

Vectors

Vectors are the simplest kind of R object, there is no concept of a “scalar” in R

Atomic modes

Vectors come in one of six “atomic modes”:

  • numeric/double
  • logical -> bool
  • character
  • integer
  • complex
    • A complex value in R is defined via the pure imaginary value i.
  • raw
    • Intended to hold raw bytes

Generating vectors

c(1, 2, 3)
1
2
3
1:5
1
2
3
4
5
1.2:5.3
1.2
2.2
3.2
4.2
5.2
rep(33,4)
33
33
33
33
rep(1:3,3)
1
2
3
1
2
3
1
2
3
rep(1:3,each=2)
1
1
2
2
3
3
seq(0, 1, 0.25)
0
0.25
0.5
0.75
1

Combining data types in vectors?

Not possible, some data types will be converted

a <- c(1,2,3)
b <- c("foo", "bar")
c(a,b)

[1] “1” “2” “3” “4” “5” “foo” “bar”

Matrices

Matrices generalize vectors by adding a dimension attribute.

Generating Matrices

matrix(1:5, nrow = 5, ncol = 2)
11
22
33
44
55
cbind(c(1,2,3),c(4,5,6))
14
25
36
m1 <- matrix(1:5, nrow = 5, ncol = 2)
attributes(m1)
5
2

Matrices are populated in column-major order, by default

matrix(1:9, 3, 3)
147
258
369

The byrow = TRUE option allows us to fill by row-major order

matrix(1:9, 3, 3, byrow = TRUE)
123
456
789

Mixing data types

Like vectors, matrices can only hold one type of data

Lists

Lists are the workhorse of R data objects

  • An R list can hold an arbitrary set of other R objects

In lists data types CAN be mixed

Creating a list

list(1,2, "HI mom", TRUE)
12HI momTRUE

Named list elements

l3 <- list(name = "bob",
    alive = TRUE,
    age = 33,
    relationshipStatus = 42+3i)

l3$name
bob

Can also be done post-hoc via the names()

l1 = list(3, "hi", TRUE)
names(l1) <- c("name1", "second", "last")
l1$second
hi

Append onto list

(l4 <- list())
l4$people <- c("Bob", "Alice", "Suzy")
l4$money <- 0
l4$logical <- FALSE
l4

$people [1] “Bob” “Alice” “Suzy” $money [1] 0 $logical [1] FALSE

Data Frames

R’s way of storing rectangular data sets

  • Each column of a data frame is a vector
  • Each of these vectors can have a different type:

Data frames are actually lists of vectors (representing the columns)

Creating a dataframe

We can create one using data.frame()

data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
1-10.1
210.2
3-10.3
410.4
5-10.5
610.6

Columns can also be given names:

data.frame(x = 1:6, y = c(-1, 1), z = seq(0.1, 0.6, 0.1))

x y z 1 1 -1 0.1 2 2 1 0.2 3 3 -1 0.3 4 4 1 0.4 5 5 -1 0.5 6 6 1 0.6

Creating from matrix

data.frame(matrix(5, 4, 3))
555
555
555
555

Filling using sample or runif

Sample makes a vector of some length randomly sampled from a vector

Runif generates a vector of random numbers

data.frame(a = sample(c(TRUE, FALSE), 5, replace = TRUE),
    b = sample(c("foo", "bar"), 5, replace = TRUE),
    c = runif(5)
    )
TRUEfoo0.640138026094064
TRUEbar0.0884113181382418
TRUEbar0.099734982708469
TRUEbar0.862016763305292
TRUEbar0.991723588900641

Factors

Factors are R’s way of repesenting nominal (categorical) variables

  • We can create a factor using the factor() function

Creating factors

factor(sample(1:3, 8, TRUE), labels = c("red", "yellow", "blue"))
red
blue
red
blue
yellow
blue
yellow
yellow

Factor attributes

x <- factor(sample(1:3, 8, TRUE), labels = c("red", "yellow", "blue"))
attributes(x)

$levels [1] “red” “yellow” “blue” $class [1] “factor”

Internal representation

Even though a factor’s data are represented by an integer vector, R does not consider factors to be interger/numeric data.

Data Manipulation and pipes

Base R Subsetting

In Base R, we typically use three operators to subset objects

  • []
  • [[]]
  • $

Which we choose depends on:

  • What type of object are we trying to subset?
  • How much of the original typing do we want to keep in the subset?

[]

x <- 10:20
x [2]
x [1:3]
x [c(2, 5, 7)] # 11, 14, 16
x [c(TRUE, FALSE, TRUE, FALSE, FALSE)]
10
12
15
17
20

[[]]

Rarely used, does have some differences though, drops any names or dimnames attribute, and only allows selection of a single element

x <- 10:20
x [[2]]
11

The [] method returns objects of class list (or data.frame if foo was a data.frame) while the [[]] method returns objects whose class is determined by the type of their values.

Matrices

y <- matrix(1:24, 6, 4)
y[4:6,2:3] # 2
1016
1117
1218

We can mix different selecting styles

y <- matrix(1:24, 6, 4)
y[c(2,4),c(FALSE, TRUE, TRUE)] # 2
814
1016

$ and lists

We can use all three operators to subset lists.

list1 <- list(2, "noo", TRUE)
names(list1) <- c("first", "second", "last")

list1$first

list1[1]

list1[["first"]]
2
l4 <- list(2, "noo", TRUE)
names(l4) <- c("first", "second", "last")
tmp1 <- l4[1]
tmp1$people
# [1] "Bob" "Alice" "Suzy"
class(tmp1)
# [1] "list"
tmp2 <- l4[[1]]
# [1] "Bob" "Alice" "Suzy"
class(tmp2)
# [1] "character"
numeric

Data Frames

We can subset the columns of a data frame using list semantics

d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
names(d3) <- c("a", "b", "c")
d3$a
1
2
3
4
5
6
d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
names(d3) <- c("a", "b", "c")
d3[1]
1
2
3
4
5
6

Data_Manipulation_and_pipes/2023-11-23_16-18-08_screenshot.png

Logical indexing

Filter by rows

d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
names(d3) <- c("a", "b", "c")
d3[(d3$a)%%2==0 & d3$c>0,]

Notic the comma at the end!

This comma is needed because R uses the following notation for subsetting data frames.

data(rows you want, columns you want)

210.2
410.4
610.6

Tidyverse subsetting

The dplyr package provides ways to subset data, two are most frequently used:

select

library(dplyr)
d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
names(d3) <- c("a", "b", "c")
select(d3, a, b)
1-1
21
3-1
41
5-1
61
library(dplyr)
d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
names(d3) <- c("a", "b", "c")
select(d3, -a)
-10.1
10.2
-10.3
10.4
-10.5
10.6

filter and logical indexing

library(dplyr)
d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
names(d3) <- c("a", "b", "c")
filter(d3, a%%2==0, c>0)
210.2
410.4
610.6

We can achieve the same using logical indexing in Base R

d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
names(d3) <- c("a", "b", "c")
d3[(d3$a)%%2==0 & d3$c>0,]

Notic the comma at the end!

210.2
410.4
610.6

Base R Variable Transformation

d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
names(d3) <- c("a", "b", "c")
d4 <- d3
d4$a <- scale(d4$a)
d4$a <- as.numeric(d4$a)
d4
-1.33630620956212-10.1
-0.80178372573727310.2
-0.267261241912424-10.3
0.26726124191242410.4
0.801783725737273-10.5
1.3363062095621210.6

Tidyverse Variable Transformations

library(dplyr)
d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
names(d3) <- c("a", "b", "c")

mutate(d3,
    d = rbinom(nrow(d3), 1, c),
    e = d * c,
    a = scale(a)
)
-1.33630620956212-10.100
-0.80178372573727310.200
-0.267261241912424-10.310.3
0.26726124191242410.410.4
0.801783725737273-10.500
1.3363062095621210.610.6

Sorting & Ordering

Base R sort

l3 <- runif(n=8, min=1, max=20)
sort(l3)
1.14997771941125
6.18364422139712
7.40294478787109
8.80276804696769
10.6937525619287
14.7475167061202
16.3595214642119
16.9985088445246
l3 <- runif(n=8, min=1, max=20)
sort(l3, decreasing = TRUE)
15.9437932423316
15.0901547158137
9.608802491799
4.20592033001594
4.16299939691089
2.97972352942452
2.14775081258267
1.15068965195678

Tidyverse Ordering

library(dplyr)
d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
names(d3) <- c("a", "b", "c")

d3$d <- runif(n=6, min=1, max=20)
arrange(d3, d)
410.41.03251850209199
5-10.52.67228518775664
3-10.36.22691969852895
1-10.18.31828686594963
610.612.4162582552526
210.215.5486617612187
library(dplyr)
d3 <- data.frame(1:6, c(-1, 1), seq(0.1, 0.6, 0.1))
names(d3) <- c("a", "b", "c")

d3$d <- runif(n=6, min=1, max=20)
arrange(d3, a*d, c)
210.21.041215069592
1-10.119.1222469233908
610.68.04311123443767
3-10.318.1283924696036
410.415.9365976287518
5-10.515.7623822011519

Pipes

The %>% symbol represents the pipe operator, used to compose functions

  • f(x) becomes x %>% f()
  • f(x, y) becomes x %>% f(y)
  • h(g(f(x))) becomes x %>% f %>% g %>% h

Big brain lecturer forgot to mention this operator is not in the standard library, is in dplyr, magrittr and tidyverse among others

Example use

Example use:

library(dplyr)

sqr <- function (x)
{
    x**2
}

plus1 <- function (x)
{
    x + 1
}

test <-
    (1:10) %>%
    sqr() %>%
    plus1()

test
2
5
10
17
26
37
50
65
82
101

This is equivalent to:

test <-
    sqr(
        plus1(
            (1:10)))

test

dot . in a pipeline, target specific argument

Specify where you want the piped value to go

library(dplyr)

power <- function (x, y)
{
    x**y
}

test <-
    (1:10) %>%
    power(3, y=.)

test
3
9
27
81
243
729
2187
6561
19683
59049

Exposition Pipe: %$%

[%$%] expose the names in lhs to the rhs expression. This is useful when functions do not have a built-in data argument. or The exposition pipe exposes the contents of an object to the next function in the pipeline.

library(dplyr)

# all equivelant:

cats %$% t.test(Hwt ~ Sex)
cats %>% t.test(Hwt ~ Sex, data = .)
t.test(Hwt ~ Sex, data = cats)

Data Visualization

Baser R Graphics

Scatterplots

We can create a basic scatterplot using the plot() function

plot(runif(20, 0, 100), runif(20, 0, 10))

Data_Visualization/2023-11-27_17-17-25_screenshot.png

diabetes %$% plot(
    y = tc,
    x = bmi,
    ylab = "Total Cholesterol",
    xlab = "Body Mass Index",
    main = "Relation between BMI and Cholesterol",
    ylim = c(0, 350),
    xlim = c(0, 50)
    )

Data_Visualization/2023-11-27_17-19-43_screenshot.png

Histogram

We can create a simple histogram with the hist() function

hist(runif(30,0,10))

Data_Visualization/2023-11-27_17-21-52_screenshot.png

Boxplots

We can create simple boxplots via the boxplot function

boxplot(runif(30,40,60),runif(60,10,50))

Data_Visualization/2023-11-27_17-24-41_screenshot.png

Kernel Density Plots

density(runif(5,0,100)) %>% plot()

Data_Visualization/2023-11-27_17-27-25_screenshot.png

GGPlot graphics

Base R graphics are fine for quick and dirty visualizations but for publication quality graphics, you should probably use GGPlot

  • GGPlot uses the ”grammar of graphics” and ”tidy data” to build up a figure from modular components.

Grammar of graphics

Data_Visualization/2023-11-27_17-29-27_screenshot.png

Basic Setup (Data and aesthetic)

We start by initializing the object that will define our plot

  • This is regardless of what we want to plot
  • We must specify a data source and a aesthetic via the aes() function
library(ggplot2)
p1 <- ggplot(data = diabetes, mapping = aes(x = bmi, y = glu))
p1

This doesn’t make any plot yet

Data_Visualization/2023-11-27_17-34-05_screenshot.png

Geometries (line, dots etc)

Line and dots

p1 + geom_point() + geom_line()

Data_Visualization/2023-11-27_17-36-34_screenshot.png

Histogram

p2 <- ggplot(diabetes, aes(tc))
p2 + geom_histogram()

Data_Visualization/2023-11-27_17-47-28_screenshot.png

Density

p2 + geom_density()

Data_Visualization/2023-11-27_17-49-57_screenshot.png

Boxplot

p2 + geom_histogram()

Data_Visualization/2023-11-27_17-52-44_screenshot.png

p3 <- ggplot(diabetes, aes(sex, bmi))
p3 + geom_boxplot()

Data_Visualization/2023-11-27_17-53-59_screenshot.png

Statistics

We can also add statistical summaries of the data.

p1 + geom_point() + geom_smooth()

Data_Visualization/2023-11-27_17-56-05_screenshot.png

p1 + geom_point() + geom_smooth(method = "lm")

Data_Visualization/2023-11-27_17-58-08_screenshot.png

Styling

Changing the style outside of aes() applies the styling to the entire plot

Jitter

p5 <- ggplot(titanic, aes(age, survived))
p5 + geom_jitter(color = "blue", size = 2, height = 0.1)

Data_Visualization/2023-11-27_18-02-02_screenshot.png

Themes

We can apply canned themes to adjust a plot overall appearance\ Default theme:

p1.1 <- p1 + geom_point()

Data_Visualization/2023-12-29_12-20-35_screenshot.png

p1.1 + theme_classic()

Data_Visualization/2023-12-29_12-22-38_screenshot.png

You can also modify individual themes

p1.1 + theme_classic() +
    theme(axis.title = element_text(size = 16,
                                    family = "serif",
                                    face = "bold",
                                    color = "blue"),
        aspect.ratio = 1)

Facets

Faceting allow us to make arrays of conditional plots.

p7 <- ggplot(titanic, aes(age, survived, color = class)) +
    geom_jitter(height = 0.05) +
    geom_smooth(method = "glm",
                method.args = list(family = "binomial"),
                se = FALSE)
p7 + facet_wrap(vars(sex))

Data_Visualization/2023-12-29_12-34-35_screenshot.png

p8 <- ggplot(titanic, aes(age, survived)) +
    geom_jitter(height = 0.05) +
    geom_smooth(method = "glm",
                method.args = list(family = "binomial"),
                se = FALSE)
p8 + facet_grid(vars(sex), vars(class))

Data_Visualization/2023-12-29_12-38-20_screenshot.png

Joining multiple Figures

If we want to paste several different plots into a single figure (without faceting), we can use the utilities in the gridExtra package.

library(gridExtra)

grid.arrange(p1 + geom_point(),
             p3 + geom_boxplot(),
             p1 + geom_line(),
             p8 + facet_grid(vars(sex), vars(class)),
             ncol = 2)

Data_Visualization/2023-12-29_12-43-52_screenshot.png

Data Cleaning

When we receive new data, they are generally messy and contaminated by various anomalies and errors

One of the first steps in processing a new set of data is cleaning

By cleaning the data, we ensure a few properties:

  • The data are in an analyzable format.
  • All data take legal values.
  • Any outliers are located and treated.
  • Any missing data are located and treated.

Missing data

Empty cells in a dataset where there should be observerd values

  • The missing cells correspond to true population values, but we haven’t observed those values

Not every empty cell is a missing datum

  • Quality-of-life ratings for dead patients in a mortality study
  • Firm profitability after the company goes out of business
  • Self-reported severity of menstrual cramping for men
  • Empty blocks of data following “gateway” items

Missing Data Pattern

Missing data patterns represent unique combinations of observerd and missing items

  • $P$ items -> $2^P$ possible patterns

Data_Cleaning/2023-12-29_14-07-36_screenshot.png

example/ in r

## Load the mice library:
library(mice)

## Compute missing data patterns:
pats <- md.pattern(boys)

pats
plot_pattern(boys, rotate = TRUE)
ageregwgthgtbmihcgenphbtv
2231111111110
191111111101
11111111011
11111110102
4371111110003
431111100004
161110010005
11110000006
11101010005
11100011113
11100001114
11100000007
31011110004
0342021465035035221622

Data_Cleaning/2023-12-29_14-09-04_screenshot.png

Nonresponse Rates

Proportion missing

  • The proportion of cells containing missing data
  • Should be computed for each variable, not for the entire dataset
## Load the mice library:
library(mice)

## Load some example data:
data(boys, package = "mice")
## Compute variable-wise proportions missing:
mMat <- is.na(boys)
mMat %>% colMeans() %>% round(3)
agehgtwgtbmihcgenphbtvreg
0.0000.0270.0050.0280.0610.6720.6720.6980.004
## Compute observation-wise proportions missing:
pmRow <- rowMeans(mMat)
## Summarize the above:
range(pmRow)
# [1] 0.0000000 0.7777778

pmRow[pmRow > 0] %>% range()
# [1] 0.1111111 0.7777778

median(pmRow)
# [1] 0.3333333

## Compute the proportion of complete cases:
mean(pmRow == 0)
# [1] 0.2981283

Attrition Rate

  • The proportion of participants that drop-out of a study at each measurement occasion

Covariance Coverage

  • The proportion of cases available to estimate a given pairwise relationship
    • e.g. a convariance between two variables
  • Very important to have adequate coverage for the parameters you want to estimate

We can use routines from the mice package to calculate covariance coverage and response patterns.

## Compute the covariance coverage:
cc <- md.pairs(boys)$rr / nrow(boys)
## Check the result:
round(cc, 2)
agehgtwgtbmihcgenphbtvreg
age1.000.970.990.970.940.330.330.31.00
hgt0.970.970.970.970.920.320.320.30.97
wgt0.990.970.990.970.940.320.320.30.99
bmi0.970.970.970.970.910.320.320.30.97
hc0.940.920.940.910.940.330.330.30.93
gen0.330.320.320.320.330.330.330.30.33
phb0.330.320.320.320.330.330.330.30.33
tv0.300.300.300.300.300.300.300.30.30
reg1.000.970.990.970.930.330.330.31.00

Visualizing Incomplete Data

The ggmice package provides some nice ways to visualize incomplete data and objects created during missing data treatment.

library(ggmice); library(ggplot2)
ggmice(boys, aes(wgt, hgt)) + geom_point()

Data_Cleaning/2023-12-29_15-44-49_screenshot.png

We can also create a nicer version of the response pattern plot.

plot_pattern(boys, rotate = TRUE)

Data_Cleaning/2023-12-29_14-09-04_screenshot.png

The naniar package also provides some nice visualization and numerical summary routines for incomplete data.

naniar::gg_miss_upset(boys)

Data_Cleaning/2023-12-29_16-04-49_screenshot.png

Outliers

We only consider univariate (:only involving one variable) outliers

  • Extreme values with respect to the distribution of a variable’s other observations
    • A human height measurement of 3 meters
    • Student with a income of $250,000
  • Not accounting for any particular model

A univariate outlier may, or may not be an illegal value

  • Data entry errors are probably the most common cause
  • Outliers can also be legal, but extreme, values

Key Point: We choose to view an outlier as arising from a different population than the one to which we want to generalize our findings to

Finding outliers: Boxplot Method

Tukey (1977) described a procedure for flagging potential outliers based on a box-and-whiskers plot.

  • Does not require normally distributed X
  • Not sensitive to outliers

A fence is an interval defined as the following function of the first quartile, the third quartile, and the inner quartile range $(IQR=Q_3-Q_1)$: $$F=\{Q_1-C× IQR,\ Q_3+C× IQR\}$$

  • Taking C = 1.5 produces the inner fence
  • Taking C = 3.0 produces the outer fence

We can use these fences to indentify potential outliers:

  • Any value that falls outside of the inner fence is a possible outlier
  • Any value that falls outside of the inner fence is a probable outlier

Boxplot in R

## Find potentially outlying cases:
(out <- boys %$% boxplot.stats(bmi, coef = 3)$out)
30.62 31.34 31.74
## Which observations are potential outliers?
boys %$% which(bmi %in% out)
574 668 733
## View the potentially outlying cases:
boys %>% filter(bmi %in% out)
agehgtwgtbmihcgenphbtvreg
115.493182.5102.030.6257.7<NA><NA>NAwest
217.749174.094.931.3456.3G5P525west
319.926192.3117.431.7457.6G5P618north

Breakdown Point

To compare statistics, we consider their breakdown points.

  • The breakdown point is the minimum proportion of cases that must be replaced by ∞ to cause the value of the statistic to go to ∞.

The mean has a breakdown point of 1/N.

  • Replacing a single value with ∞ will produce an infinite mean.
  • This is why we shouldn’t use basic z-scores to find outliers.

The median has breakdown point of 50%.

  • We can replace n < N/2 of the observations with ∞ without producing an infinite median.

Outliers for Categorical Data

Nominal, ordinal, and binary items can have outliers

  • Outliers on categorical variables are often more indicative of bad variables than outlying cases.

Ordinal

  • Most participant endorse one of the lowest categories on an ordinal item, but a few participants endorse the highest category.
  • The participants who endorse the highest category may be outliers.

Nominal

  • Groups with very low membership may be outliers on nominal grouping variables.

Binary

  • If most endorse the item, the few who do not may be outliers.

Treating outliers

If we locate any outliers, they must be treated.

  • Outliers cause by errors, mistakes, or malfunctions (i.e., error outliers) should be directly corrected.
  • Labeling non-error outliers is a subjective task.
    • A (non-error) outlier must originate from a population separate from the one we care about.
    • Don’t blindly automate the decision process.

The most direct solution is to delete any outlying observation.

  • If you delete non-error outliers, the analysis should be reported twice: with outliers and without.

For univariate outliers, we can use less extreme types of deletion.

  • Delete outlying values (but not the entire observation).
  • These empty cells then become missing data.
  • Treat the missing values along with any naturally-occurring nonresponse.

Winsorization:

  • Replace the outliers with the nearest non-outlying value

We can also use robust regression procedures to estimate the model directly in the presence of outliers.

Weight the objective function to reduce the impact of outliers

  • M-estimation

Trim outlying observations during estimation

  • Least trimmed squares, MCD, MVE

Take the median, instead of the mean, of the squared residuals

  • Least median of squares

Model some quantile of the DV’s distribution instead of the mean

  • Quantile regression

Model the outcome with a heavy-tailed distribution

  • Laplacian, Student’s T

Introduction to Linear Modeling

Simple Linear Regression

Best fit line is defined by a simple equation: \begin{equation} \hat Y = \hat β_o+\hat\beta_1X \end{equation}

Thinking about error

The above equation only describes the best fit line. We need to account for the estimation error \begin{equation} Y = \hat β_o+\hat\beta_1X +\hat\epsilon \end{equation}

Estimating the Regression Coefficients

Minimize the sum of squared residuals \begin{equation} RSS =∑^Nn=1 \hat ε^2_n \end{equation}

$\hat ε_n=Y_n-\hat Y_n$

Model Fit, measures, r-squared

We may also want to know how well our model explains the outcome.

  • Our model explains some proportion of the outcome’s variability.
  • The residual variance ˆ𝜎2 = Var(ˆ𝜀) will be less than Var(Y)

We quantify the proportion of the outcome’s variance that is explained by our model using the R2 statistic

\begin{equation} R^2=\frac{TSS-RSS}{TSS}=1-\frac{RSS}{TSS} \end{equation}

\begin{equation} TSS=∑^Nn=1(Y_n-\bar Y)^2=Var(Y)×(N-1) \end{equation} $\bar y$ is average of y

When assessing predictive performance, we will most often use the mean squared error as our criterion.

\begin{equation} MSE=\frac 1 N ∑^Nn=1(Y_n-\hat Y_n)^2=\frac {RSS} N \end{equation} MSE is better to use when we want to predict, so we dont care about explenation.

$$RMSE = \sqrt{MSE}$$

Information Criteria

We can use information criteria to quickly compare non-nested models while accounting for model complexity.

Introduction_to_Linear_Modeling/2023-12-04_17-46-19_screenshot.png

Information criteria balance two competing forces.

  • Blue: The optimized loglikelihood quantifies fit to the data.
  • Red: The penalty term corrects for model complexity

K number of parameters, amount of x variables in this case

These statistics are only interesting for comparing models, on their on they don’t say a lot

In R

out1 <- lm(bp ~ age, data = dDat)
coef(out1) -- get coefficients

Multiple Linear Regression

Just simple linear regression with multiple predictor variables.

Comparing models/ multiple R^2/ f statistic

Multiple R^2

out1 <- lm(bp ~ age, data = dDat)
out2 <- lm(bp ~ age + bmi, data = dDat)

## Extract R^2 values:
r2.1 <- summary(out1)$r.squared
r2.2 <- summary(out2)$r.squared

F-statistic/null hypothesis

We test if the fact that the r squard is bigger is conincedence or not

We use the F-statistic to test $H_0:R^2=0$ vs. $H_1:R^2&gt;0$.

1 <- summary(out1)$fstatistic
f1
value numdf dendf
55.78116 1.00000 440.00000
pf(q = f1[1], df1 = f1[2], df2 = f1[3], lower.tail = FALSE)

value 4.392569e-13

f2 <- summary(out2)$fstatistic
f2
value numdf dendf
64.6647 2.0000 439.0000
pf(f2[1], f2[2], f2[3], lower.tail = FALSE)

value 2.433518e-25

How do we quantify the additional variation explained by BMI, above and beyond age? • Compute the ΔR2

## Compute change in R^2:
r2.2 - r2.1

[1] 0.115049

How do we know if ΔR2 represents a significantly greater degree of explained variation? • Use an F-test for H0 : ΔR2 = 0 vs. H1 : ΔR2 > 0

## Compute change in R^2:
r2.2 - r2.1

[1] 0.115049

How do we know if ΔR2 represents a significantly greater degree of explained variation? • Use an F-test for H0 : ΔR2 = 0 vs. H1 : ΔR2 > 0

## Is that increase significantly greater than zero?
anova(out1, out2)

Analysis of Variance Table Model 1: bp ~ age Model 2: bp ~ age + bmi Res.Df RSS Df Sum of Sq F Pr(>F) 1 440 74873 2 439 65167 1 9706.1 65.386 6.057e-15 *** — Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ’ ’ 1

Comparing models with MSE

We can also compare models based on their prediction errors. • For OLS regression, we usually compare MSE values.

mse1 <- MSE(y_pred = predict(out1), y_true = dDat$bp)
mse2 <- MSE(y_pred = predict(out2), y_true = dDat$bp)
mse1
# [1] 169.3963
mse2
# [1] 147.4367

In this case, the MSE for the model with BMI included is smaller. • We should prefer the the larger model.

Comparing models with information criteria

Finally, we can compare models based on information criteria.

AIC(out1, out2)
# df AIC
# out1 3 3528.792
# out2 4 3469.424
BIC(out1, out2)
# df BIC
# out1 3 3541.066
# out2 4 3485.789

In this case, both the AIC and the BIC for the model with BMI included are smaller. • We should prefer the the larger model.

Categorical Predictors

Categorical predictors have to be converted to dummy codes:

drinkjuicetea
1juice10
2coffee00
3tea01
4tea01

Then these dummy variables can simply be used as normally: $$Y=β_0+β_1Xjuice+β_2Xtea+ε$$

Contrast

All R factors have an associated contrasts attribute. • The contrasts define a coding to represent the grouping information. • Modeling functions code the factors using the rules defined by the contrasts.

x <- factor(c("a", "b", "c"))
contrasts(x)
bc
a00
b10
c01

Significane Testing

For variables with only two levels, we can test the overall factor’s significance by evaluating the significance of a single dummy code.

out <- lm(Price ~ Man.trans.avail, data = Cars93)
partSummary(out, -c(1, 2))
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 23.841 1.623 14.691 <2e-16
# Man.trans.availYes -6.603 2.004 -3.295 0.0014
# Residual standard error: 9.18 on 91 degrees of freedom
# Multiple R-squared: 0.1066,Adjusted R-squared: 0.09679
# F-statistic: 10.86 on 1 and 91 DF, p-value: 0.001403

For variables with more than two levels, we need to simultaneously evaluate the significance of each of the variable’s dummy codes

partSummary(out4, -c(1, 2))
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 21.7187 2.9222 7.432 6.25e-11
# Man.trans.availYes -5.8410 1.8223 -3.205 0.00187
# DriveTrainFront -0.2598 2.8189 -0.092 0.92677
# DriveTrainRear 10.5169 3.3608 3.129 0.00237
# Residual standard error: 8.314 on 89 degrees of freedom
# Multiple R-squared: 0.2834,Adjusted R-squared: 0.2592
# F-statistic: 11.73 on 3 and 89 DF, p-value: 1.51e-06
summary(out4)$r.squared - summary(out)$r.squared
# [1] 0.1767569
# anova(out, out4)
# Analysis of Variance Table
# Model 1: Price ~ Man.trans.avail
# Model 2: Price ~ Man.trans.avail + DriveTrain
# Res.Df RSS Df Sum of Sq F Pr(>F)
# 1 91 7668.9
# 2 89 6151.6 2 1517.3 10.976 5.488e-05 ***
# ---
# Signif. codes:
# 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model-Based Prediction

Prediction example

Introduction_to_Linear_Modeling/2023-12-09_14-20-49_screenshot.png

Interval estimates for prediction

To quantify uncertainty in our predictions, we want to use an appropriate interval estimate: There are two flavors:

Confidence interval for $\hat Y_m$

The CI for ˆYm gives a likely range (in the sense of coverage probability and “confidence”) for the mth value of the true conditional mean.

CIs only account for uncertainty in the estimated regression coefficients, { ˆ𝛽0, ˆ𝛽p}.

Prediction intervals for a specif observation, $Y_m$

The prediction interval for Ym gives a likely range (in the same sense as CIs) for the mth outcome value.

Prediction intervals also account for the regression errors, 𝜀.

Confidence vs prediction intervals

CIs for $\hat Y$ ignore the errors, $ε$.

  • Only accounts for the best fit

Prediction intervals are wider than CIs.

  • They account for the additional uncertainty contributed by 𝜀.

Introduction_to_Linear_Modeling/2023-12-09_14-34-27_screenshot.png

Moderation

Additive models allow us to examine the partial effects of several predictors on some outcome.

  • The effect of one predictor does not change based on the values of other predictors

Moderation allows us to ask when one variable, X, affects another variable, Y.

  • We’re considering the conditional effects of X on Y given certain levels of a third variable Z.
  • How much a variable effects the outcome is dependent on another variable

Additive regression + derivation

In addtive MLR we have the following equation: $Y=β_0+β_1X+β_2Z+ε$ This equation assumes that X and Z are independent predictors of Y.

When X and Z are independent predictors, the following are true

  • X and Z can be correlated
  • $β_1$ and $β_2$ are partial regression coefficients
  • The effect of X and Y is the same at all levels of Z, and the effect of Z on Y is the same at all levels of X.

When testing the moderation, we hypothesize that the effect of X on Y varies af a function of Z.

  • $Y=β_0+f(Z)X+β_2Z+ε$

If we assume that Z linearly affects the relationship between X and Y:

  • $f(Z)=β_1+β_3Z$

Combining the two equations:

  • $Y=β_0+(β_1+β_3Z)X+β_2Z+ε$
  • $Y=β_0+β_1X+β_2Z+β_3XZ+ε$

Test for moderation

$Y=β_0+β_1X+β_2Z+β_3XZ+ε$

To test for moderation we simply need to test the significan of the interaction term, XZ

$\hat\beta_3$ quantifies the effect of Z on the focal effect (the X->Y effect)

  • For a unit change in Z, it is the expected change in the effect of X and Y

$\hat\beta_1\text{ and } \hat\beta_2$ are conditional effects.

  • interpreted where the other predictor is zero
  • For a unit change in X, ˆ𝛽1 is the expected change in Y, when Z = 0.
  • For a unit change in Z, ˆ𝛽2 is the expected change in Y, when X = 0.

in R

## Load data:
socSup <- readRDS(paste0(dataDir, "social_support.rds"))
## Estimate the moderated regression model:
out <- lm(bdi ~ tanSat * sex, data = socSup)
partSummary(out, -c(1, 2))

# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 20.8478 6.2114 3.356 0.00115
# tanSat -0.5772 0.3614 -1.597 0.11372
# sexmale 14.3667 12.2054 1.177 0.24223
# tanSat:sexmale -0.9482 0.7177 -1.321 0.18978
# Residual standard error: 9.267 on 91 degrees of freedom
# Multiple R-squared: 0.08955,Adjusted R-squared: 0.05954
# F-statistic: 2.984 on 3 and 91 DF, p-value: 0.03537

Regression Assumptions and Diagnostics

You always have to make same assumptions to create a model.

Assumptions of MLR

1 The model is linear in the parameters <<la1>>

This is OK: $Y=β_0+β_1X+β_2Z+β_3XZ+β_4X^2+β_5X^3$

This is not: $Y=β_0Xβ_1+ε$

  • power is also a parameter here -> not allowed

If the model is not linear in the parameters, then we’re not even working with linear regression.

  • We need to move to entirely different modeling paradigm

Each modeled X must exhibit a linear relation with Y.

  • We can define X via nonlinear transformations of the original data.

residuals plots

out1 <- lm(MPG ~ Horsepower,
    data = Cars93)
plot(Cars93$MPG, Cars93$Horsepower)
plot(out1, 1)

In multiple regression models, basic residual plots won’t tell us which predictors exhibit nonlinear associations.

Component + residuals plots

We can use Component + Residual Plots (AKA, partial residual plots) to visualize the unique effects of each X variable.

library(car)
crPlots(out3)

Treating Residual Nonlinearity

Nonlinearity in the residual plots is usually a sign of either:

  1. Model misspecification
  2. Influential observations

This type of model misspecification usually implies omitted functions of modeled variables.

  • Polynomial terms
  • Interactions

The solution is to include the omitted term into the model and refit.

  • This is very much easier said than done.

2 The predictor matrix is full rank <<la2>>

  • $N&gt;P$
  • No $X_p$ can be a linear combination of other predictors

If the predictor matrix is not full rank, the model is not estimable.

  • The parameter estimates cannot be uniquely determined from the data

3 The predictors are strictly exogenous

The predictors do not correlate with the errors

  • $Cov(\hat Y,ε)=0$
  • $E[ε_n]=0$

If the predictors are not exogenous, the estimated regression coefficients will be biased.

Omitted variables

The most common cause of endogeneity (i.e., violating Assumption 3) is omitted variable bias.

  • If we leave an important predictor variable out of our equation, some modeled predictors will become endogenous and their estimated regression slopes will be biased.
  • The omitted variable must be correlated with Y and at least one of the modeled Xp, to be a problem.

Assume the following is the true regression model. $$Y=β_0+β_1X+β_2Z+ε$$

Now, suppose we omit Z from the model: $$Y=β_0+β_1X+ω$$ $$ω = ε + β_2Z$$

Our new error, 𝜔, is a combination of the true error, 𝜀, and the omitted term, $β_2Z$

  • Consequently, if X and Z are correlated, omitting Z induces a correlation between X and 𝜔 (i.e., endogeneity).

Omitted variable bias can have severe consequences, but you can’t really test for it.

  • The errors are correlated with the predictors, but our model is estimated under the assumption of exogeneity, so the residuals from our model will generally be uncorrelated with the predictors.
  • We mostly have to pro-actively work to include all relevant variables in our model.

4 The errors have constant, finit variance. heteroscedasticity

  • $Var(ε_n)=σ^2&lt;∞$

see spherical

Consequences of heteroscedasticity

Non-constant error variance will not bias the parameter estimates.

  • The best fit line is still correct.
  • Our measure of uncertainty around that best fit line is wrong.

Heteroscedasticity will bias standard errors (usually downward).

  • Test statistics will be too large.
  • CIs will be too narrow.
  • We will have inflated Type I error rates.

Treating heteroscedasticity

  1. Transform your outcome using a concave function (e.g., ln(Y), √Y).
    • These transformations will shrink extreme values more than small/moderate ones.
    • It’s usually a good idea to first shift the variable’s scale by setting the minimum value to 1.
  2. Refit the model using weighted least squares.
    • Create inverse weights using functions of the residual variances or quantities highly correlated therewith.
  3. Use a Heteroscedasticity Consistent (HC) estimate of the asymptotic covariance matrix.
    • Robust standard errors, Huber-White standard errors, Sandwich estimators
    • HC estimators correct the standard errors for non-constant error variance.
## The 'sandwich' package provides several HC estimators:
library(sandwich)
## the 'lmtest' package provides fancy testing tools for linear models:
library(lmtest)
## Use sandwich estimator to compute ACOV matrix:
hcCov <- vcovHC(out1)
## Test coefficients with robust SEs:
robTest <- coeftest(out1, vcov = hcCov)
## Test coefficients with default SEs:
defTest <- summary(out1)$coefficients

Testing

We can create a residual plot to test for this

Non-constant error variance violates assumption4

Regression_Assumptions_and_Diagnostics/2023-12-11_17-54-50_screenshot.png

Non-constant error variance

5 The errors are uncorrelated

$\Cov(ε_j,ε_j)=0,\ i≠ j$

see spherical

Correlated Errors

Errors can become correlated in two basic ways:

  1. Serial dependence
    • When modeling longitudinal data, the errors for a given observational unit
    • We can detect temporal dependence by examining the autocorrelation of the residuals
  2. Clustering
    • Your data have some important, unmodeled, grouping structure
      • Children nested within classrooms
      • Romantic couples
      • Departments within a company
    • We can detect problematic level of clustering with the intraclass correlation coefficient (icc)
      • We need to know the clustering variable to apply the icc

Treating Correlated Errors

Serially dependent errors in a longitudinal model usually indicate an inadequate model.

  • Your model is ignoring some important aspect of the temporal variation that is being absorbed by the error terms.
  • Hopefully, you can add the missing component to your model.

Clustering can be viewed as theoretically meaningful or as a nuisance factor that needs to be controlled

If the clustering is meaningfull, you should model the data using multilevel modelling

  • Hierarchical linear regression
  • Mixed models
  • Random effects models

If the clustering is an uninteresting nuisance, you can use specialized HC variance estimators that deal with clustering

## Read in some data:
LeeBryk <- readRDS(paste0(dataDir, "lee_bryk.rds"))
## Estimate a linear regression model:
fit <- lm(math ~ ses + sector, data = LeeBryk)
## Calculate the residual ICC:
ICC::ICCbare(x = LeeBryk$schoolid, y = resid(fit))
0.07487712
## Robust tests:
coeftest(fit, vcov = vcovCL(fit, ~ schoolid))

#t test of coefficients:
#Estimate Std. Error t value Pr(>|t|)
#(Intercept) 11.79965 0.20318 58.0746 < 2.2e-16 ***
#ses 2.94860 0.12794 23.0475 < 2.2e-16 ***
#sectorprivate 1.93495 0.31717 6.1006 1.111e-09 ***
#---
#Signif. codes:
#0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

6 The errors are normally distributed.

$e∼ N(0,σ^2)$

If errors are non-normal, small-sample inferences may be biased.

  • The justification for some tests and procedures used in regression analysis may not hold

Test with Q-Q Plot

plot(out1, 2)

One of the best ways to evaluate the normality of the error distribution with a Q-Q Plot.

  • Plot the quantiles of the residual distribution against the theoretically ideal quantiles.
  • We can actually use a Q-Q Plot to compare any two distributions.

Regression_Assumptions_and_Diagnostics/2024-01-02_17-46-29_screenshot.png

Consequences of Violating Normality

In small samples, with fixed predictors, normally distributed errors imply normal sampling distributions for the regression coefficients.

  • In large samples, the central limit theorem implies normal sampling distributions for the coefficients, regardless of the error distribution.

Prediction intervals require normally distributed errors.

  • Confidence intervals for predictions share the same normality requirements as the coefficients’ sampling distributions.

Parameter estimates will not be fully efficient.

  • Standard errors will be larger than they would have been with normally distributed errors.

Treating Violations of Normality

We usually don’t need to do anything about non-normal errors

  • The central limit theorem will protect our inferences.

We can use bootstrapping to get around the need for normality.

  1. Treat your sample as a synthetic population from which you draw many new samples (with replacement).
  2. Estimate your model in each new sample.
  3. The replicates of your estimated parameters generate an empirical sampling distribution that you can use for inference.

Bootstrapping can be used for inference on pretty much any estimable parameter, but it won’t work with small samples.

  • Need to assume that your sample is representative of the population

Spherical errors (4+5)

The assumption of spherical erros combines assumptions 4 and 5 $Var(ε)=σ^2I_N$

If the errors are not spherical, the standard errors will be biased.

  • The estimated regression coefficients will be unbiased, though.
  • But the statistics and stuf can be biased

Regression Diagnostics

After fitting the model we check the assumptions, most assumptions cant be checked before fitting.

If some of the assumptions are violeted, the inferences we make using the model may be wrong

  • We need to check the tenability of our assumptions before leaning too heavily on the model estimates

These checks are called regression diagnostics

  • Graphical visualizations
  • Quantitative indices/measures
  • Formal statistical

Alot of test require their own assumptions which is bad.

Residual Plots

One of the most useful diagnostic graphics is the plot of residuals vs. predicted values. Regression_Assumptions_and_Diagnostics/2023-12-11_17-51-01_screenshot.png

We can create this plot by plotting the fitted lm() object in R.

out1 <- lm(Price ~ Horsepower,
    data = Cars93)
plot(out1, 1)

Influential Observations

Influential observations contaminate analyses in two ways:

  1. Exert too much influence on the fitted regression model
  2. Invalidate estimates/inferences by violating assumptions

There are two distinct types of influential observations:

  1. Outliers
    • Observations with extreme outcome values, relative to the other data.
    • Observations with outcome values that fit the model very badly.
  2. High-leverage observations
    • Observation with extreme predictor values, relative to other data

Outliers

Outliers can be identified by scrutinizing the residuals.

  • Observations with residuals of large magnitude may be outliers.
  • The difficulty arises in quantifying what constitutes a “large” residual.

If the residuals do not have constant variance, then we cannot directly compare them.

  • We need to standardize the residuals in some way

We are specifically interested in externally studentized residuals.

We can’t simply standardize the ordinary residuals.

  • Internally studentized residuals
  • Outliers can pull the regression line towards themselves.
  • The internally studentized residuals for outliers will be too small.

Begin by defining the concept of a deleted residual: $$ \hat ε(n)=Y_n-\hat Y(n) $$

  • $\hat\epsilon(n)$ quantifies the distance of $Y_n$ from the regression line estimated after excluding the nth observation.
  • $\hat Y(n)$ is the predicted value for the nth observatin from the regression line estimated after exluding the nth observation

Studentized Residuals

If we standardize the deleted residual, $\hat\epsilon(n)$, we get the externally studentized residual: $$t(n)=\frac{\hat\epsilon(n)}{SE\hat\epsilon_{(n)}}$$ The externally studentized residuals have two very useful properties:

  1. Each t(n) is scaled equivalently.
    • We can directly compare different t(n) .
  2. The t(n) are Student’s t distributed.
    • We can quantify outliers in terms of quantiles of the t distribution.
    • |t(n) | > 3.0 is a common rule of thumb for flagging outliers.

Index plots of the externally studentized residuals can help spotlight potential outliers.

  • Look for observations that clearly “stand out from the crowd”
rstudent(out1) %>% plot()

Regression_Assumptions_and_Diagnostics/2024-01-02_19-29-57_screenshot.png

High-Leverage Points

We identify high-leverage observations through their leverage values.

  • An observation’s leverage, $h_n$, quantifies the extent to which its predictors affect the fitted regression model.
  • Observations with $X$ values very far from the mean, $\bar X$, affect the fitted model disproportionately.

Index plots of the leverage values can help spotlight high-leverage points.

  • Again, look for observations that clearly “stand out from the crowd.”

In simple linear regression, the nth leverage is given by: $$h_n=\frac 1 N + \frac{(X_n-\bar X)^2}{∑^Nm=1(X_m-\bar X)^2}$$

hatvalues(out1) %>% plot()

Regression_Assumptions_and_Diagnostics/2024-01-02_19-38-14_screenshot.png

Influential Points

Observations with high leverage or large (externally) studentized residuals are not necessarily influential.

  • High-leverage observations tend to be more influential than outliers.
  • The worst problems arise from observations that are both outliers and have high leverage.

Measures of influence simultaneously consider extremity in both X and Y dimensions.

  • Observations with high measures of influence are very likely to cause problems.

All measures of influence use the same logic as the deleted residual.

  • Compare models estimated from the whole sample to models estimated from samples excluding individual observations.

One of the most common measures of influence is Cook’s Distance <<cooks>> $$Cook’s D_n=\frac{∑^Nn=1(\hat Y_n-\hat Y(n))^2}{(P+1)\hat σ^2}=(P+1)-1t^2_n\frac{h_n}{1-h_n}$$

Index plots of Cook’s distances can help spotlight the influential points.

  • Look for observations that clearly “stand out from the crowd.”
cd <- cooks.distance(out1)
plot(cd)

Regression_Assumptions_and_Diagnostics/2024-01-02_19-49-05_screenshot.png

Removing Influential Observations

Observation number 28 was the most influential according to Cook’s Distance

(maxD <- which.max(cd))
28
## Exclude the influential case:
Cars93.2 <- Cars93[-maxD, ]
## Fit model with reduced sample:
out2 <- lm(Price ~ Horsepower, data = Cars93.2)
round(summary(out1)$coefficients, 6)
EstimateStd.Errort valuePr(>|t|)
(Intercept)-1.3987691.820016-0.7685480.444152
Horsepower0.1453710.01189812.2183250.000000
round(summary(out2)$coefficients, 6)
EstimateStd.Errort valuePr(>|t|)
(Intercept)-2.8376461.806418-1.5708680.119722
Horsepower0.1567500.01199613.0669420.000000

Removing the outlier did not have a massive effect on the slope but it did lower the t-values by quite a bit

Regression_Assumptions_and_Diagnostics/2024-01-02_20-02-02_screenshot.png

You could also remove the 2 most influential observations

(maxDs <- sort(cd) %>% names() %>% tail(2) %>% as.numeric())

## Exclude influential cases:
Cars93.2 <- Cars93[-maxDs, ]

## Fit model with reduced sample:
out2.2 <- lm(Price ~ Horsepower, data = Cars93.2)

Treating Influential Points

The most common way to address influential observations is simply to delete them and refit the model.

  • This approach is often effective—and always simple—but it is not fool-proof.
  • Although an observation is influential, we may not be able to justify excluding it from the analysis.

Robust regression procedures can estimate the model directly in the presence of influential observations.

  • Observations in the tails of the distribution are weighted less in the estimation process, so outliers and high-leverage points cannot exert substantial influence on the fit.
  • We can do robust regression with the rlm() function from the MASS package.

Generalized Linear Model & Logistic regression

General Linear model

So far we have only discussed general linear models:

$$Y = β_0 + ∑ ^pp=1 β_p X_p+ε$$

All flavor of linear regression are general linear models

  • Simple Linear regression, Multiple Linear Regression
  • t-test, ANOVA, ANCOVA
  • Multilevel linear regression models

We can break our model into pieces:

  • $η = β_0 + ∑ ^pp=1 β_p X_p$
  • $Y = η + ε$

$ε ∼ N(0,σ^2)$, so we can also write: $$Y∼ N(η,σ^2)$$

Where:

  • $η$ is the systematic component of the model (AKA, the linear predictor)
  • The normal distribution, $N(…,…)$ is the model’s random component

The purpose of general linear modeling is to build a model of the outcome’s mean $μ_Y$

  • In this case $μ_Y = η$
  • The systematic component defines the mean of $Y$

The random component quantifies variability around $μ_Y$

  • In the general linear model, we assume that this error variance follows a normal distribution

Generalized Linear Model

We can generalize the models we’ve been using in two important ways:

  1. Allow for random components other than the normal distribution
  2. Allow for more complicated relations between $μ_Y$ and $η$
    • Allow a link function: $g(μ_Y)=η$

Like this we optain the class of generalized linear models (GLMs)

The random component in a GLM can be any distribution from the exponential family:

  • Normality
  • Binomial
  • Poisson
  • Many others…

The systematic component of a GLM is exactly the same as it is in general linear models.

  • $η = β_0 + ∑ ^pp=1 β_p X_p$

In GLMs, $η$ does not directly describe $μ_Y$

  • We first transform $μ_Y$ via a link function

The link function performs two important functions

  1. Linearize the association betweeen $X$ and $Y$
    • Nonlinear: $X→ μ_Y$
    • Linear: $X→ g(μ_Y)$
  2. Allows GLMs for outcomes with the restricted ranges without requiring any restrictions on the range of the $\{X_P\}$
    • In many cases, $μ_Y$
      • Counts: $μ_Y &gt; 0$
      • Probabilities: $μ_Y ∈ [0,1]$
    • When correctly specified, $g(μ_Y)$ can take any value on the real line

Every GLM is built from three components

  1. The systematic component, $η$
    • A linear function of the predictors, $\{X_P\}$
    • Describes the association between $X$ and $Y$
  2. The link function, $g(η_Y)$
    • Linearizes the relation between $X$ and $Y$
    • Transforms $μ_Y$ so that it can take any value on the real line
  3. The random component, $P(Y|g-1(η))$
    • The distribution of the observed $Y$
    • Quantifies the error variance around $η$

The General Linear Model is a special case of GLM

  1. Systematic component: $$η = β_0 + ∑ ^pp=1 β_p X_p$$
  2. Link function: $$μ_Y=η$$
  3. Random component: $$Y∼ N(η, σ^2)$$

Generalized Linear Model in R

In R you can train a generalized linear model with glm

data(iris)

## Generalized linear model:
glmFit <- glm(Petal.Length ~ Petal.Width + Species,
              family = gaussian(link = "identity"),
              data = iris)

Logistic Regression

We want to use logistic regression for a classification task with a two-level outcome.

Using a linear model doesn’t work well for this kind off task

  • The link function ensures legal predicted values (between 0 and 1)
  • The sigmoid curve implies fluctuation in the effectiveness of extra study time
    • example below: More study time is most beneficial for students with around 5.5 hours of study

Generalized_Linear_Model_&_Logistic_regression/2024-01-09_15-40-58_screenshot.png

In logistic regression problems we are modeling binary data:

  • $Y∈\{1=”Succes”,0=”failure”\}$

The systematic component in our logistic regression model will be the binomial distribution.

The mean of the binomial distribution (with $N=1$) is the “succes” probability, $π=P(Y=1)$

  • we are interested in modelling this succes probability, $μ_Y=π$:

$$g(π) = β_0 + ∑ ^pp=1 β_p X_p$$

We use the logit link function to convert the output of the systematic component to a value between 0 and 1.

  • Given $π$, the odds of succes are: $$O_s=\frac π {1-π}$$
  • Because $π ∈ [0,1]$, we know that $O_s ≥ 0$
  • We can take the natural log of the odds as the last step to fully map $π$ to the real line $$\operatorname{logit}(π)=ln\left(\frac π {1-π}\right)$$

Note that in the above function we map the probability to the real line, not the other way around.

  • For the other way around use the logistic function: <<inverse_logit>> $$σ(x)=\frac 1 {1+e-x}=\frac {e^x} {1+e^x}$$
    • Also called sigmoid or expit
    • Is the inverse of the logit function

Our final logistic regression model is:

$$Y∼ \operatorname{Bin}(π,1)$$ $$\operatorname{logit}(π)=β_0 + ∑^pp=1β_pX_p$$

The fitted model can be represented as: $$\operatorname{logit}(\hat\pi)=\hat\beta_0 + ∑^pp=1\hat\beta_pX_p$$

Interpreting logistic regression

We have a model:

$$\operatorname{logit}(\hat\pi)=\hat\beta_0 + ∑^pp=1\hat\beta_pX_p$$

The fitted coefficients, $\{\hat\beta_0, \hat\beta_p\}$, are interpreted in units of log odds.

Log odds do not lend themselves to interpretation.

  • We can convert the effects back to an odds scale by exponentiation
  • $\hat\beta$ has log odds units, but $e^\hat\beta$ has odds units.

note for all these interpretations odds and probabilities are not the same!

Exponentiating the coefficients also converts the additive effects to multiplicative effects.

  • We can interpret $\hat\beta$ as we would in linear regression:
    • A unit change in $X_p$ produces an expected change of $\hat\beta$ units in $\operatorname{logit}(π)$
  • After exponentiation, however, unit changes in $X_p$ imply multiplicative changes in the odds, $O_s=\frac π {1-π}$
    • A unit change in $X_p$ results in multiplying $O_s$ by $e\hat\beta_p$

Due to the confusing interpretations of the coefficients, we often focus on the valence of the effects.

  • Additional study time is associated with increased odds of passing
  • $\hat\beta_p&gt;0 = “Increased Succes”$, $e\hat\beta_p>1 = “Increased succes”$

For example interpretations look at the slides, start from slide 28

Model Comparison for logistic regression

We don’t have an $R^2$ statistic for logistic regression models, so we need to use a likelihood ratio test to compare nested models.

anova(fit0, fit, test = "LRT")

## Analysis of Deviance Table
##
## Model 1: survived ~ age + sex
## Model 2: survived ~ age + sex + class
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 884 916.00
## 2 882 801.59 2 114.41 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can also use information criteria

AIC(fit0, fit)

## df AIC
## fit0 3 921.9989
## fit 5 811.5940

BIC(fit0, fit)

## df BIC
## fit0 3 936.3624
## fit 5 835.5333

Classification with logistic regression

Given a fitted logistic regression model, we can get predictions for new observations $X’_p$

Directly apply $\hat\beta_0,\hat\beta_p$ to $X’_p$ will produce predictions on the scale of $η$ $$\hat\eta ’ = \hat\beta_0+∑^pp=1\hat\beta_pX’_p$$

By applying the inverse link function $g-1()$, to $\hat\eta’$, we get predicted succes probabilities $$\hat\pi’=g-1(\hat\eta’)$$

Once we have computed the predicted succes probabilities $\hat\pi’$, we can use them to classify new observations

By choosing a threshold on $\hat\pi’$, say $\hat\pi’t$, we can classify the new observations as succes or failures $$\hat Y' \begin{cases} 1 &\text{ if }\hat\pi’≥ t \ 0 &\text{ if }\hat\pi’≥ t \end{cases}$$

Note the optional threshold is not always $0.5$.

Assumptions of Logistic Regression

1 The model is linear in the parameters

Same as linear regression

  • This is OK: $$\operatorname{logit}(π)=β_0 + β_1X + β_2Z + β_3X^2$$
  • This is not: $$\operatorname{logit}(π)=β_0Xβ_1$$

Checking Assumption 1

Assumption 1 implies a linear relation between continuous predictors and the logit of the success probability.

  • We can basically evaluate the linearity assumption using the same methods we applied with linear regression.
car::crPlots(glmFit, terms = ~ age + fare)

Assumtions_of_Logistic_Regression/2024-01-10_15-15-52_screenshot.png

2 The predictor matrix if full rank

Same as linear regression

  • $N&gt;P$
  • No $X_p$ can be a linear combination of other predictors

Checking Assumption 2

Basically the same as for linear regression Assumption 2 implies two conditions:

  1. P > N
  2. No severe (multi)collinearity among the predictors

We can quantify multicollinearity with the variance inflation factor (VIF).

car::vif(glmFit)
agesexfare
1.0318291.0076991.026373

VIF > 10 indicates severe multicollinearity.

3 The outcome is independently and identically binomially distributed

The distributional assumptions of logistic regression are not framed in terms of residuals.

This assumption fills the role played by all residual-related assumptions in linear regression.

Instead we consider the entire outcome distribution in logistic regression.

$$Y_n ∼iid Bin(\hat\pi, 1)$$ $$\hat\pi_n=\operatorname{logistic}\left( \hat\beta_0 + ∑^pp=1 \hat\beta_pXnp \right)$$

Checking Assumption 3

Assumption 3 implies several conditions.

  1. The outcome, Y, is binary.
  2. The linear predictor, 𝜂, can explain all the systematic trends in $π$.
    • No residual clustering after accounting for X.
    • No important variables omitted from X.

We can easily check the first condition with summary statistics.

levels(titanic$survived)
“no”“yes”
table(titanic$survived)
noyes
545342

If we have a non-binary, categorical outcome, we can use a different type of model.

  • Multiclass nominal variables: multinomial logistic regression
    • nnet::multinom()
  • Ordinal variables: Proportional odds logistic regression
    • MASS::polr()
  • Counts: Poisson regression
    • glm() with family = "poisson"

The binomial distribution is also appropriate for modeling the proportion of successes in N trials.

For the second condition:

We can check for residual clustering by calculating the ICC using deviance residuals.

## Check for residual dependence induced by 'class':
ICC::ICCbare(x = titanic$class, y = resid(glmFit, type = "deviance"))
0.1054665

Residuals

In logistic regression the outcome is binary, $Y∈\{0,1\}$, but the parameter that we’re trying to model is continuous, $π ∈ (0,1)$.

  • Due to this mismatch in measurements levels, we don’t have a natural definition of a “residual” in logistic regression

The most basic residual is the raw residual, $e_n$

  • The difference between the observerd outcome value and the predicted probability

$$e_n=Y_n-\hat\pi_n$$

We train a model on the titanic dataset:

## Read the data:
titanic <- readRDS(paste0(dataDir, "titanic.rds"))
## Estimate the logistic regression model:
glmFit <- glm(survived ~ age + sex + fare,
data = titanic,
family = "binomial")
## Save the linear predictor estimates:
titanic$etaHat <- predict(glmFit, type = "link")

Raw residual plot:

library(ggplot)

## Calculate the raw residuals:
titanic$e <- resid(glmFit, type = "response")

## Plot raw residuals vs. fitted
## linear predictor values:
ggplot(titanic, aes(etaHat, e)) +
    geom_point() +
    geom_smooth() +
    theme_classic() +
    xlab("Linear Predictor") +
    ylab("Raw Residual")

Assumtions_of_Logistic_Regression/2024-01-10_14-31-15_screenshot.png

Pearson residuals, $r_n$, are scaled raw residuals $$r_n=\frac {e_n}{\sqrt{\hat\pi_n(1-\hat\pi_n)}}$$

## Calculate the Pearson residuals:
titanic$r <- resid(glmFit, type = "pearson")

Deviance residuals, $d_n$, are derived directly from the objective function used to estimate the model.

$$d_n=sign(e_n)\sqrt{-2[Y_nln (\hat\pi_n)+(1-Y_n)ln(1-\hat\pi_n)]} $$

The residual deviance, $D$, is the sum of squared deviance residuals.

  • Quantifies how well the model fits the data

$$D=∑^Nn=1d^2_n$$

Computational Considerations

In Addition to the preceding statistical assumptions, we must satisfy three computational requirements that were not necessary in linear regression.

1 Sufficient sample size

Logistic regression models are estimated with numerical methods, so we need larger samples than we would for linear regression models.

  • The sample size requirements increase with model complexity.

Some suggested rules of thumb:

  • 10 cases for each predictor (Agresti, 2018)
  • $N = 10P/π_0$ (Peduzzi, Concato, Kemper, Holford, & Feinstein, 1996)
    • $P$: Number of predictors
    • $π_0$: Proportion of the minority class
  • $N = 100 + 50P$ (Bujang, Omar, & Baharum, 2018)

2 Outcome classes are sufficiently balanced

The logistic regression may not perform well when the outcome classes are severely imbalanced.

  • e.g. way more of one class then the other
with(titanic, table(survived) / length(survived))
noyes
0.61443070.3855693

We have a few possible solutions for problematic imbalance:

  • Down-sampling the majority class
    • Randomly remove some samples from the majority class
  • Up-sampling the minority class
    • Random select samples from the minority class to use multiple times
  • Use weights when estimating the logistic regression model
    • weights argument in glm()

3 There is no perfect prediction

We don’t actually want to perfectly predict class membership.

  • The model cannot estimate with perfectly separable classes.

Why?

  • Imagine the shape of the logit function to estimate classes which can be seperated perfectly
  • The slope would have to be really low, then really high
  • To get this the slope would have to be infinite, which obviously can’t be estimated

Model regularization (e.g., ridge or LASSO penalty) may help.

  • glmnet::glmnet()

Influential Cases

As with linear regression, we need to deal with any overly influential cases.

  • We can use the linear predictor values to calculate Cook’s Distances.
  • Any cases that exerts undue influence on the linear predictor will have the same effect of the predicted success probabilities.
cooks.distance(glmFit) %>% plot()

Assumptions_of_Logistic_Regression/2024-01-10_16-41-13_screenshot.png

plot(glmFit, 4)

Assumptions_of_Logistic_Regression/2024-01-10_16-42-16_screenshot.png

Classification Performance Measures

Confusion Matrix

One of the most direct ways to evaluate classification performance is the confusion matrix.

## Add predictions to the dataset:
titanic %<>%
    mutate(piHat = predict(glmFit, type = "response"),
           yHat = as.factor(ifelse(piHat <= 0.5, "no", "yes")))

library(caret)

cMat <- titanic %$% confusionMatrix(data = yHat, reference = survived)

cMat$table
Reference
Predictionnoyes
no458106
yes87236

Each cell in the confusion matrix represents a certain classification result:

  • TP true positive: Correctly predicted positive
  • TN true negative: Correctly predicted negative
  • FP false positive: Falsely predicted positive
  • FN false negative: Falsely predicted negative
cMat$overall
AccuracyKappaAccuracyLowerAccuracyUpperAccuracyNullAccuracyPValueMcnemarPValue
7.824126e-015.359709e-017.537802e-018.091549e-016.144307e-017.471405e-271.950898e-01
cMat$byClass
SensitivitySpecificityPos Pred ValueNeg Pred ValuePrecisionRecall
0.84036700.69005850.81205670.73065020.81205670.8403670
F1PrevalenceDetection RateDetection PrevalenceBalanced Accuracy
0.82596930.61443070.51634720.63585120.7652127

Summaries/statistics of the Confusion matrix:

  • $Accuracy = (TP+TN)/(P+N)$
  • $Error\ Rate = (FP+FN)/(P+N)=1-Accuracy$
  • $Sensitivity = TP/(TP+FN)$&lt;<sensitivity>>
  • $Specificity = TN/(TN+FP)$&lt;<specificity>>
  • $False\ Positive\ Rate (FPR)=FP/(TN+FP)=1-Specificity$
  • $Positive\ Predictive\ Value\ (PPV) = TP / (TP + FP)$
  • $Negative\ Predictive\ Value\ (NPV) = TN / (TN + FN)$

ROC Curve

A receiver operating characteristic (ROC) curve illustrates the diagnostic ability of a binary classifier for all possible values of the classification threshold.

The area under the ROC curve (AUC) is a one-number summary of the potential performance of the classifier.

  • The AUC does not depend on the classification threshold.
pROC::auc(rocData)
Area under the curve: 0.8298

Rule of thumb for AUC according to Mandrekar (2010):

  • AUC value from 0.7 – 0.8: Acceptable
  • AUC value from 0.8 – 0.9: Excellent
  • AUC value over 0.9: Outstanding

Threshold Selection

We can use numerical methods to estimate an optimal threshold value

library(OptimalCutpoints)
ocOut <- optimal.cutpoints(X = "piHat",
                           status = "survived",
                           tag.healthy = "no",
                           data = titanic,
                           method = "ROC01"
                           )

partSummary(ocOut, -1)

Area under the ROC curve (AUC): 0.83 (0.802, 0.858)

CRITERION: ROC01

Number of optimal cutoffs: 1

Estimate
cutoff0.2360978
Se0.7543860
Sp0.7761468
PPV0.6789474
NPV0.8343195
DLR.Positive3.3700029
DLR.Negative0.3164531
FP122.0000000
FN84.0000000
Optimal criterion0.1104365

Cross-Entropy Error

Not exam material, might be usefull for the practical though.

Measuring classification performance from a confusion matrix can be problematic.

• Sometimes too coarse.

We can also base our error measure on the residual deviance with the Cross-Entropy Error: $$CEE = -N-1 ∑^Nn=1 Y_nln(\hat\pi_n) + (1 - Y_n) ln(1 -\hat\pi_n)$$

  • The CEE is sensitive to classification confidence.
  • Stronger predictions are more heavily weighted

The misclassification rate is a na¨ıvely appealing option.

  • The proportion of cases assigned to the wrong group

Consider two perfect classifiers:

  1. P( ˆYn = 1|Yn = 1) = 0.90, P( ˆYn = 1|Yn = 0) = 0.10, n = 1, 2, … , N
  2. P( ˆYn = 1|Yn = 1) = 0.55, P( ˆYn = 1|Yn = 0) = 0.45, n = 1, 2, … , N

Both of these classifiers will have the same misclassification rate.

  • Neither model ever makes an incorrect group assignment.

The first model will have a lower CEE.

  • The classifications are made with higher confidence.
  • CEE1 = 0.105, CEE2 = 0.598