08-PCA.Rmd

# Principal Component Analysis

# Load packages

```{r}

library(dplyr)
library(tidyverse) # tidyverse packages 
library(corrr) # correlation analysis 
library(GGally) # visualizing correlation analysis 
library(tidymodels) # tidymodels framework 
library(here) # reproducible way to find files 

theme_set(theme_minimal())

```

# Load data

Reimport the heart disease dataset. 

```{r}
load(here("data", "preprocessed.RData"))
```

# Overview

## Unsupervised approaches

Since we are not trying to predict the value of any target variable like in supervised approaches, the value of unsupervised machine learning can be to see how data separate based solely on the nature of their features. This is a major value, as we can include all of the data at once, and just see how it sorts! Unsupervised approaches are also useful for optimizing other machine learning algorithms.  

Principal component analysis (PCA) is a powerful linear transformation technique used to explore patterns in data and highly correlated variables. It is useful for distilling variation across many variables onto a reduced feature space, such as a two-dimensional scatterplot. 

## Correlation analysis 

- Notice some problems? 

    - NAs 
    
    - Scaling issues 
    
```{r}

data_original %>%
  corrr::correlate()

```

# Preprocessing 

`recipe` is essential for preprocesssing multiple features at once :^) 

```{r}

pca_recipe <- recipe(~., data = data_original) %>%
  # Imputing NAs using mean 
  step_impute_mean(all_predictors()) %>%
  # Normalize some numeric variables 
  step_normalize(c("age", "trestbps", "chol", "thalach", "oldpeak")) 

```

# PCA analysis 

```{r}

pca_res <- pca_recipe %>% 
  step_pca(all_predictors(), 
           id = "pca") %>% # id argument identifies each PCA step 
  prep()

pca_res %>%
  tidy(id = "pca") 

```

## Screeplot

```{r}
pca_recipe %>%
  step_pca(all_predictors(), 
           id = "pca") %>% # id argument identifies each PCA step 
  prep() %>%
  tidy(id = "pca", type = "variance") %>%
  filter(terms == "percent variance") %>% 
  ggplot(aes(x = component, y = value)) +
    geom_col() +
    labs(x = "PCAs of heart disease",
         y = "% of variance",
         title = "Scree plot")
```

## View factor loadings 

```{r}

pca_recipe %>%
  step_pca(all_predictors(), 
           id = "pca") %>% # id argument identifies each PCA step 
  prep() %>%
  tidy(id = "pca") %>%
  filter(component %in% c("PC1", "PC2")) %>%
  ggplot(aes(x = fct_reorder(terms, value), y = value, 
             fill = component)) +
    geom_col(position = "dodge") +
    coord_flip() +
    labs(x = "Terms",
         y = "Contribtutions",
         fill = "PCAs") 
       
```
# PCA for Machine Learning

Create a 70/30 training/test split

```{r}

# Set seed for reproducibility
set.seed(1234)

# Split 
split_cluster <- initial_split(data_original, prop = 0.7)

# Training set 
train_set <- training(split_cluster)

# Test set 
test_set <- testing(split_cluster)

# Apply the recipe we created above 
final_recipe <- recipe(~., data = train_set) %>%
  # Imputing NAs using mean 
  step_impute_mean(all_predictors()) %>%
  # Normalize some numeric variables 
  step_normalize(c("age", "trestbps", "chol", "thalach", "oldpeak")) %>%
  step_pca(all_predictors()) # id argument identifies each PCA step 

# Preprocessed training set 
ggtrain <- final_recipe %>%
  prep(retain = TRUE) %>%
  juice()

# Preprocessed test set 
ggtest <- final_recipe %>%
  prep() %>%
  bake(test_set)

```