machine_learning_in_r2.Rmd

# R machine learning workflow tutorial

Tongli Su, Yifan Lu

```{r, include=FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
```

### Motivation
Our project demonstrates the standard workflow of a machine learning prediction task in R with the ‘caret’ package. The project could serve as a cheat sheet to carry out simple machine learning prediction tasks for people who are unfamiliar with the essential steps of machine learning tasks. When conducting a machine learning study, our tutorial could be used as a quick guide for R implementation and a reference to check if any of the necessary steps is missing. Most simple machine learning tasks could be done by just copy-pasting the code sections on our cheat sheet with some minor tweaks on the parameters. 

In completing the project, we reviewed and familiarized with the standard machine learning workflow and strengthened our knowledge in R implementation of machine learning tasks. Both my teammate and I were CS major in college, so we are familiar with machine learning, but neither of us has done machine learning projects in R before. Therefore, we explored the functions in ‘caret’ package and learned the implementation of each step in machine learning workflow, from data processing to model evaluation.


### Machine Learning in R
Machine learning is one of the most popular tools to unveil information carried by a dataset. Well-established machine learning packages make the task of making prediction with machine learning models a simple and standarized process. In this blog post, we will go through the steps in carrying out a machine learning prediction task, and provide example with the 'caret' package in R programming language.

Steps in carrying out a machine learning prediction task:

1. Data pre-processing

2. Feature engineering

3. Feature selection

4. Model selection

5. Hyperparameter tuning

6. Model evaluation

### Import Packages and Example Dataset
```{r}
library(ggplot2)
library(dplyr)
library(tidyr)
library(carData)
library(caret)
df <- Salaries
head(df)
```

### Preprocessing
Data preprocessing includes transforming the data into desired form for model input, and scale the data for models sensitive to the scale (most regression models do). One common trick to tackle categorical data is to transform them to binary variables. This can be done in many ways, including one-hot encoding, target encoding, and ordinal encoding. Here we demonstrate using one-hot encoding to transform categorical features. One-hot encoding works by assigning a binary variable to each class of a categorical features. Therefore, the number of new binary variables created equals to the dimension of the orignal categorical variable. For example, if a feature 'Gender' has 'Male' and 'Female', then one-hot encoding will transform it to two variables 'is_Male' and 'is_Female'. 
```{r}
# transform categorical features to binary features with one-hot encoding
dummy <- dummyVars(" ~ .", data=df)
newdata <- data.frame(predict(dummy, newdata = df)) 
head(newdata)
```

Example Preprocessing Methods:

dummyVars: this function helps transform the input to one hot encoding format

findCorrelation: this function takes a correlation matrix as a input and help decide which predictor to be removed to solve colinearity.

findLinearCombos: this function uses QR decomposition to enumerate linear combiantions.


### Data Splitting
Before training a machine learning model, we need to split the dataset to train and test as a standard measure. Sometimes if the dataset is large enough, we might split it to development set and test set. Then, the development set can be further divided to train set and validation set. However, in most cases, we just split the data into train set and test set, then use more data-efficient methods such as k-fold cross validation or leave-one-out cross validation to validate the model.
```{r}
X = df
trainIndex <- createDataPartition(df$salary, p = .8, #the proportion of train set
                                  list = FALSE) #do not return as a list
#Train-test split                             
train <- df[ trainIndex,]
test  <- df[-trainIndex,]

```

Example Splitting Methods:

createDataPartition: this function can be used to create splits of the data.

groupKFold: this function can split data based on groups

### Train Linear Regression Model
Train linear regression model with train set.
```{r}
linearModel <- train(salary ~ ., data = train, 
                 method = "lm", #linear regression model
                 )
linearModel
```

Here, we chose to use linear regression model. Caret has 238 available models. https://topepo.github.io/caret/available-models.html provides more information about model selecton.


### Cross Validation
k-fold cross validation is often used in model selection and performance evaluation. It works by splitting the train dataset into k equal-sized partitions, using k-1 partitions to train the model and then evaluate its performance on the unused one partition. This process is repeated for k times where each partition serves as the test set once. 
```{r}
fitControl <- trainControl(## 10-fold CV
                           method = "repeatedcv",
                           number = 10,
                           ## repeated ten times
                           repeats = 10)
linearModel2 <- train(salary ~ ., data = train, 
                 method = "lm", 
                 trControl = fitControl,
                 )
linearModel2
```

### Make Predictions
Make predictions on the test set with trained model. 
```{r}
prediction <- predict(linearModel, newdata = test)
prediction
```

### Evaluate Performance
Model performance on the test set could be measured by various metrics. The most frequently used ones include precision, recall, F1-score, AUC score, etc. One thing worth noticing is that accuracy is often not a good metric to evaluate model performance when the dataset is highly imbalanced (has a minority class), since in that case predicting all samples to be the majority class would still produce a nice-looking accuracy score. In that case, using metrics like ROC curve and AUC score would be a better option. 
```{r}
postResample(pred = prediction, obs = df$salary)
```

Other measuring performance methods:

confusionMatrix: this function measures the performance of a classification algorithm by demonstrating true positives, true negatives, false positives, and false negatives of the predicted outcomes

twoClassSummary: this function measures the performance of data with two classes by showing ROC, specificity, and sensitivity.

### Citation
https://topepo.github.io/caret/