forked from dlab-berkeley/R-Machine-Learning-Legacy
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path08-PCA.Rmd
148 lines (103 loc) · 3.42 KB
/
08-PCA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
# Principal Component Analysis
# Load packages
```{r}
library(dplyr)
library(tidyverse) # tidyverse packages
library(corrr) # correlation analysis
library(GGally) # visualizing correlation analysis
library(tidymodels) # tidymodels framework
library(here) # reproducible way to find files
theme_set(theme_minimal())
```
# Load data
Reimport the heart disease dataset.
```{r}
load(here("data", "preprocessed.RData"))
```
# Overview
## Unsupervised approaches
Since we are not trying to predict the value of any target variable like in supervised approaches, the value of unsupervised machine learning can be to see how data separate based solely on the nature of their features. This is a major value, as we can include all of the data at once, and just see how it sorts! Unsupervised approaches are also useful for optimizing other machine learning algorithms.
Principal component analysis (PCA) is a powerful linear transformation technique used to explore patterns in data and highly correlated variables. It is useful for distilling variation across many variables onto a reduced feature space, such as a two-dimensional scatterplot.
## Correlation analysis
- Notice some problems?
- NAs
- Scaling issues
```{r}
data_original %>%
corrr::correlate()
```
# Preprocessing
`recipe` is essential for preprocesssing multiple features at once :^)
```{r}
pca_recipe <- recipe(~., data = data_original) %>%
# Imputing NAs using mean
step_impute_mean(all_predictors()) %>%
# Normalize some numeric variables
step_normalize(c("age", "trestbps", "chol", "thalach", "oldpeak"))
```
# PCA analysis
```{r}
pca_res <- pca_recipe %>%
step_pca(all_predictors(),
id = "pca") %>% # id argument identifies each PCA step
prep()
pca_res %>%
tidy(id = "pca")
```
## Screeplot
```{r}
pca_recipe %>%
step_pca(all_predictors(),
id = "pca") %>% # id argument identifies each PCA step
prep() %>%
tidy(id = "pca", type = "variance") %>%
filter(terms == "percent variance") %>%
ggplot(aes(x = component, y = value)) +
geom_col() +
labs(x = "PCAs of heart disease",
y = "% of variance",
title = "Scree plot")
```
## View factor loadings
```{r}
pca_recipe %>%
step_pca(all_predictors(),
id = "pca") %>% # id argument identifies each PCA step
prep() %>%
tidy(id = "pca") %>%
filter(component %in% c("PC1", "PC2")) %>%
ggplot(aes(x = fct_reorder(terms, value), y = value,
fill = component)) +
geom_col(position = "dodge") +
coord_flip() +
labs(x = "Terms",
y = "Contribtutions",
fill = "PCAs")
```
# PCA for Machine Learning
Create a 70/30 training/test split
```{r}
# Set seed for reproducibility
set.seed(1234)
# Split
split_cluster <- initial_split(data_original, prop = 0.7)
# Training set
train_set <- training(split_cluster)
# Test set
test_set <- testing(split_cluster)
# Apply the recipe we created above
final_recipe <- recipe(~., data = train_set) %>%
# Imputing NAs using mean
step_impute_mean(all_predictors()) %>%
# Normalize some numeric variables
step_normalize(c("age", "trestbps", "chol", "thalach", "oldpeak")) %>%
step_pca(all_predictors()) # id argument identifies each PCA step
# Preprocessed training set
ggtrain <- final_recipe %>%
prep(retain = TRUE) %>%
juice()
# Preprocessed test set
ggtest <- final_recipe %>%
prep() %>%
bake(test_set)
```