chapter3.rmd

---
title: "Chapter 3: Logistic regression"
date: "*2020-11-12*"
output:
  html_document:
    toc: true
    toc_float:
        smooth_scroll: true
        collapsed: false
    toc_depth: 2
    number_sections: true

---

```{r, echo=FALSE, results=FALSE}
#Let's make our reports reproducible
set.seed(134856791)
```

***

# Description of data

The data set used this exercise has been previously parsed from [this](https://archive.ics.uci.edu/ml/datasets/Student+Performance) file using an R script available [here](https://github.com/MB-Finski/IODS-project/data).

The original purpose of the study was to find factors associated with the students' school performance. The parsed data consists of students in two Portugese schools. For the purposes of the study the background and alcohol consumption of the students was recorded through questionnaires. The students' performance in mathematics and Portugese language was collected from school reports. The scores for either subject were combined in the above data handling script. A more exhaustive metadata is available [here](https://archive.ics.uci.edu/ml/datasets/Student+Performance).


***

## Variable decoding

For an explanation of the individual variables, please, refer to the following:

***


Variable name  | Explanation
---           | -------------
*school*  | Student's school (either GP or MS)
*sex*  | Gender
*age*  | Age of the student
*address*  | Student's residence as binary of urban (U) versus rural (R)
*famsize*  | Binary of family size: less than three LE3 versus greater than three (GT3)
*Pstatus*  | Parent's cohabitation status: together (T) versus apart (A)
*Medu*  | Mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
*Fedu*  | Father's education (numeric: categories identical to above)
*Mjob*  | Mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
*Fjob*  | Father's job (nominal: see categories above)
*reason* | Reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
*guardian* | Student's guardian (nominal: 'mother', 'father' or 'other')
*traveltime* | Home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
*studytime* | Weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
*failures* | Number of past class failures (numeric: n if 1<=n<3, else 4)
*schoolsup* | Extra educational support (binary: yes or no)
*famsup* | Family educational support (binary: yes or no)
*paid* | Extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
*activities* | Extra-curricular activities (binary: yes or no)
*nursery* | Attended nursery school (binary: yes or no)
*higher* | Wants to take higher education (binary: yes or no)
*internet* | Internet access at home (binary: yes or no)
*romantic* | With a romantic relationship (binary: yes or no)
*famrel* | Quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
*freetime* | Free time after school (numeric: from 1 - very low to 5 - very high)
*goout* | Going out with friends (numeric: from 1 - very low to 5 - very high)
*Dalc* | Workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
*Walc* | Weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
*health* | Current health status (numeric: from 1 - very bad to 5 - very good)
*absences* | Number of school absences (numeric: from 0 to 93)
*G1* | First period combined grade (numeric: from 0 to 20)
*G2* | Second period grade (numeric: from 0 to 20)
*G3* | Final grade (numeric: from 0 to 20, output target)
*alc_use* | Average of Dalc and Walc
*high_use* | Binary: TRUE if alc_use > 2, FALSE otherwise


***
```{r out.width="100%"}

#Read the alc_data -table
alc_data <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", stringsAsFactors = TRUE, sep = ",",header = TRUE)

#... and print out the summray of variable names as requested, although
#the text output is quite ugly:
str(alc_data)

```
***

## Summary table and data exploration

Let's scope out the variables with "high_use" as contrast to identify variables that
may have an association with alcohol use. 

Please note, I have elected to treat the Likert scale variables
as scale variables here so I won't factorize them.

````{r out.width="100%"}

library(gtsummary)

continuous_vars = list(c(age,Medu,Fedu,failures,famrel, freetime, goout,Dalc,Walc,health,absences, G1, G2, G3, alc_use) ~ "continuous")

#Print a summary table of the data with high_use as the contrast. Also add an "overall" column
my_contrast_table <- tbl_summary(alc_data,
                                by = high_use, 
                                digits = all_continuous() ~ 2, 
                                statistic = list(all_continuous() ~ "{mean} ({sd})",
                                all_categorical() ~ "{n} / {N} ({p}%)"),
                                type = continuous_vars) 

#Add some missing elements to the table and print it out
my_contrast_table %>% add_overall() %>% bold_labels() %>% add_p()
````

***

### Interpretation

***

First, it should be ephasized that while I did print out the p-values for the explorative statistical analyses,
the p-values should not be mistaken for effect sizes! I.e. direct direct comparisons of the variables performance as a regressor for "high_use"
should not be made based on the p-values alone. 

However, odds ratios are reversible for outcome and potential exposure (i.e. the OR for an outcome given an
exposure is equal to the OR of an exposure given an outcome). This means that the strategy of finding a "reverse association" between our response and potential regressors
is useful for identifying good regressors.

As a side note, beyond the scope of this chapter, we may be missing out on potential interactions in the data
with this type of modeling strategy. For example, the data is from two different schools which would suggest that we'd be better off using a hierachical
modeling strategy for the data. Yet, as mentioned, this is beyond the scope of the assignment.

Finally, the type of *a priori* exploration we used for identifying the potential regressors increases the risk of finding spurious associations in our
data. Therefore, we should clearly state that our findings of any subsequent analyses are explorative in nature.

***

# Choosing the potential regressors

***

Based on explorative analyses above, I chose the following variables as regressors: 

1. sex
2. absences
3. goout
4. studytime

***

A viable hypothesis for how each of these variables might be associated with alcohol use can be formed:

1. Males have traditionally been viewed as more prone to drinking alcohol.
2. Absences may be the result of or the enabler of excessive drinking.
3. Going out with friends and socializing may be linked with increased alcohol use.
4. Students who study hard don't have time to drink.

***

## (Graphical) Inspection of regressors

***

Let's print a more compact table compared to above with _cross tabulations_ of the categorical regressors:

***

````{r out.width="100%"}

#Create a subset table of the variables of interest
alc_subset <- alc_data %>% select(sex, high_use, absences, goout, studytime)

continuous_vars = list(c(absences) ~ "continuous")


#Print a summary table of the data with high_use as the contrast. Also add an "overall" column
my_contrast_table <- tbl_summary(alc_subset,
                                by = high_use, 
                                digits = all_continuous() ~ 2, 
                                statistic = list(all_continuous() ~ "{mean} ({sd})",
                                all_categorical() ~ "{n} / {N} ({p}%)"),
                                type = continuous_vars) 

#Add some missing elements to the table and print it out
my_contrast_table %>% add_overall() %>% bold_labels()
````

***

### Boxplot of absences

***

Print boxplot of absences:

```{r out.width="100%", warning=FALSE, message=FALSE}
library(ggplot2)

#Create the boxplot
my_boxplot <- ggplot(alc_data, aes(x = high_use, y = absences))+theme(legend.position="none")
my_boxplot <- my_boxplot + geom_boxplot(outlier.colour="black", outlier.shape=16, outlier.size=2, notch=FALSE)

#Adjust outline color
my_boxplot <- my_boxplot + scale_color_manual(values=c("black","black"))

#Adjust fill color
my_boxplot <- my_boxplot + scale_fill_manual(values=c("#b4b4b4","#4e4e4e"))

#Set proper lables for both axes
my_boxplot <- my_boxplot + labs(x= "High alcohol use", y = "Absences",fill = "")

#Set the graph title
my_boxplot <- my_boxplot + ggtitle("Absences VS High alcohol use")

#Print the plot
my_boxplot

```

***

### Interpretation of cross tabulations and boxplot

***

Our hypothesis forming was exploratory (see above) so we have by and large already covered this topic.

The crosstabulations and the boxplot support the presented hypotheses for all our variables: there's a clear association between high_use and all our chosen regressors. 
The distribution of the absences is understandably quite skewed, reminiscent of a chi-square distribution. This is usually not an issue with parametric statistical methods. There are, also, some potential outliers in both groups for absences.

***

# Logistic regression

***

Create the logistic regression model:

```{r out.width="100%", warning=FALSE, message=FALSE}

my_reg_model <- glm(high_use ~ absences + sex + goout + studytime, data = alc_data, family = "binomial")

tbl_regression(my_reg_model, exponentiate = TRUE) %>% bold_labels()

summary(my_reg_model)

```
***

## Interpretation of the model

***
In this case the intercept of the model is of no significant interest to us. All regressors have a statistically significant association with the response.

As an example of interpreting the odds ratios; each increase of one in absences was associated with an odds ratio (OR) of 1.06 (95% CI 1.032 -- 1.10) for high alchohol use.

For the catecorigal variables we interpret the results as follows: male students had 2.25 (95% CI: 1.33 -- 3.84) times greater odds
at having high alcohol use than female students. 

It is crucial to note that this **does not** mean that a student with 1 absence was 1.06 times more likely to
have high alcohol use compared to a student who had none! Odds ratios (usually) cannot be used to make inferences of relative probabilities. Only in some edge cases you can equate RRs and ORs (like under the rare disease assumption).
We can simply state that the *odds* of having high alcohol use was 1.06 times grater for someone with 1 absence compared to a student with no absences.

The numerical value of OR also does not equate to effect size. For example, the absolute odds ratio is very small for absences, but the effect size in this case
is also related to the distribution of absences in the data: a student with 10 absences has an odds ratio of 1.06^10 = 1.79 for high alcohol use.

While the exact odds ratios and even risk ratios can be calculated in our data sample, there remains uncertainty when making interferences from our data
to more general populations. This uncertainty is described using the confidence intervals (CI) for our OR estimates. Various factors like regressor variance and sample size
affect the confidence intervals.

In conclusion, the data mostly confirms our hypotheses: absences, male gender, and going out were associated (OR > 1) with high alcohol consumption. 
Increased studytime was associated with lesser odds (OR < 1) at high alcohol consumption.

***

# Model predictions and performance

***

Print a cross tabulation of outcome vs model prediction with marginal row and column:

```{r out.width="100%", warning=FALSE, message=FALSE}
library("dplyr")

#Add a column model_probability for the outcome:
alc_data <- mutate(alc_data,model_probability = predict(my_reg_model, type = "response"))
#Add a column with binary model_prediction:
alc_data <- mutate(alc_data,model_prediction = model_probability > 0.5)

#Print a cross tabulation
tbl_cross(alc_data,row = high_use, col = model_prediction, percent ="cell")

```

***

Print a scatter plot visualizing predictions vs outcome:

```{r out.width="100%", warning=FALSE, message=FALSE}

my_scatter_plot <- ggplot(alc_data, aes(x = model_probability, y = high_use, color = (model_prediction==high_use)))
my_scatter_plot <- my_scatter_plot+geom_point() + labs(x = "Model probability", y = "High alcohol use",color= "Model prediction correct")
my_scatter_plot
```

***

Calculate training error and compare the method to a guessing strategy:

***

```{r out.width="100%", warning=FALSE, message=FALSE}
#Calculate the precentage of wrong predictions:
mean((alc_data$high_use != alc_data$model_prediction))

#Just for "shits and giggles" run a rudimentary Monte Carlo simulation
#with 1000000 runs to calculate the precentage of wrong predictions made
#by quessing if a student has heavy alc use based on the knowlede of
#high_alc use prevalence in our data

#I.e. we calculate a "guess_vector" with the same proportion of true/false
#guesses as the prevalence of high_alc in our data.

#Place the correct vectors into separate variables to streamline everything
#for the Monte Carlo loop:
high_use <- alc_data$high_use
propability_of_heavy_alc <- mean(high_use)
n_of_students <- length(high_use)

i = 0
mc_runs = 1000000
mc_mean_pred_error = 0

while(i < mc_runs) {
  i <- i + 1

  #Create a vector with random variables from 0 to 1
  #Essentially this is a guess on the high_use status of each student
  random_vector <- runif(n_of_students, 0, 1)

  #Convert the random variable to the corresponding random binary
  guess_vector <- (random_vector <= propability_of_heavy_alc)

  #Claculate the proportion of wrong guesses
  mc_mean_pred_error <- mc_mean_pred_error + mean((high_use != guess_vector))
}
#Calculate the mean
mc_mean_pred_error <- mc_mean_pred_error / mc_runs

#Expected result derived arithmetically
expected_result <- 2 * (propability_of_heavy_alc * (1 - propability_of_heavy_alc))
expected_result

#Print the result of the Monte Carlo run:
mc_mean_pred_error

1- mean(alc_data$high_use)
```

***

## Interpretation of model performance

***

The training error of our model was about 21%. Our guessing strategy was intentionally non-optimal and resulted in the right guess about 59% of the time.
The optimal guessing strategy would be to just guess that no student has high alcohol use which would result in a right guess about 71% of the time (or a
wrong guess 29% of the time). So, our model had about 28% less wrong predictions than the "optimal" guessing strategy. However, this is only valid for the data our 
model was trained with and using fresh data would likely result in inferior performance.

***

# BONUS 1: Ten-fold cross validation

***

```{r out.width="100%", warning=FALSE, message=FALSE}

cost_function <- function(response, probability) {
  mean(response != (probability > 0.5))
}

library(boot)
cross_validation <- cv.glm(data = alc_data, cost = cost_function, glmfit = my_reg_model, K = 10)

# average number of wrong predictions in the cross validation
cross_validation$delta[1]
```

***

Our logistic regression model had smaller testing error (about 21%) compared to the
Datacamp example (about 26%) using 10-fold cross-validation.

There are various methods for selecting best subsets of variables for GLM and it is likely that an even better model can be found.
However, (especially step-wise) selection of regression models from large datasets is extremely prone to bias!

***

# BONUS 2: Training vs testing error

***

```{r}
library("tibble")

#Strip the alc_data table from all but the predictor variables and the response variable
bare_dataset = select(alc_data, -c(Dalc,Walc,alc_use,model_probability,model_prediction))

#Create a regression model with all possible predictors
my_huge_reg_model <- glm(high_use ~ ., data = bare_dataset, family = "binomial")

#Prepare a table to hold our pairs of training and testing errors
comparison_table <- data.frame(matrix(ncol=3,nrow=0, dimnames=list(NULL, c("error", "error_type", "n_regressors"))))

comparison_table$error_type <- as.factor(comparison_table$error_type)

#A loop for backward stepping our regression model
i <- 1
while(i < ncol(bare_dataset)) {
  i <- i + 1

  #Calculate traning error for current regression model
  probability <- predict(my_huge_reg_model, type = "response")
  training_error <- cost_function(bare_dataset$high_use, probability)

  #Calculate testing error for current regression model
  #using 10-fold cross validation
  testing_error <- cv.glm(data = bare_dataset,
                          cost = cost_function,
                          glmfit = my_huge_reg_model,
                          K = 10)$delta[1]
  
  num_of_regressors <- length(coef(my_huge_reg_model)) - 1

  #Add the relevant data to our table
  comparison_table <- add_row(comparison_table, error = training_error,
          error_type = "Training error", 
          n_regressors = num_of_regressors)
  comparison_table <- add_row(comparison_table, error = testing_error,
          error_type = "Testing error", 
          n_regressors = num_of_regressors)

  #Drop one variable out from our regression equation with backward stepping
  #based on the Akaike Information Criteria (AIC).
  #Set k=100 so we continue stepping even if we drop statistically
  #significant regressors
  my_huge_reg_model <- step(my_huge_reg_model, direction = "backward",steps = 1, k = 100, trace = 0)

}

#Print the requested plot comparing training and testing errors
comparison_plot <- ggplot(data = comparison_table, aes(x = n_regressors, y = error, group = error_type, color = error_type))
comparison_plot <- comparison_plot + geom_line() + geom_point()
comparison_plot <- comparison_plot + ggtitle("Training and testing errors vs number of regressors")
comparison_plot <- comparison_plot + labs(x = "Number of predictors", y = "Prediction error", color = "")

comparison_plot

```

***

As is evident from the comparison graph, the increasing number of regressors leads to overfitting. I.e. 
the models become too complex and begin to describe spurious rather than true associations in the data. As a result, the training error grows ever so slightly smaller with
the increasing number of regressors (this would be true even with random variables as predictors). 

Yet, the testing error starts to grow with larger numbers of 
regressors after reaching a minimum at around 5 regressors. This is a result of the decreasing external validity of the overfitted models -> i.e. the predictions
of these models are not generalizable despite the very low training errors.

The "sweet spot" for the number of regressors in our data seems to be around 5 or 6 predictors/regressors.