Skip to content

Commit

Permalink
Merge pull request UBC-MDS#85 from jianridine/main
Browse files Browse the repository at this point in the history
proofreading and render the letax well
  • Loading branch information
jianructose authored Dec 13, 2020
2 parents 08c4af5 + 1520428 commit 22089da
Show file tree
Hide file tree
Showing 4 changed files with 113 additions and 114 deletions.
7 changes: 5 additions & 2 deletions doc/wine_quality_prediction_report.Rmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
---
title: "Predicting Wine Quality Using Physicochemical Properties"

author: "Contributors: Jianru Deng, Yuyan Guo, Vignesh Lakshmi Rajakumar, Cameron Harris"

date: December 12th, 2020
always_allow_html: true
output:
html_document:
Expand Down Expand Up @@ -60,13 +63,13 @@ For this classification problem we determined that we could ignore class imbalan

## Preprocessing of Features

In this analysis there are eleven numeric features and one categorical feature. For this analysis, the numeric features were scaled using a standard scalar transformer, which involves removing the mean and scaling to unit variance ([\@sk-learn]). The categorical feature, wine type, only contained two values (red and white) and was treated as a binary feature.
In this analysis there are eleven numeric features and one categorical feature. For this analysis, the numeric features were scaled using a standard scalar transformer, which involves removing the mean and scaling to unit variance [@sk-learn]. The categorical feature, wine type, only contained two values (red and white) and was treated as a binary feature.

## Modelling and Cross Validation Scores

All models were evaluated through five-fold cross validation on the training data set. Accuracy was used as the primary metric to evaluate model performance. For this classification problem, there is low consequence to a false negative or false positive classifications, therefore recall and precision are of low importance to us. Accuracy provides a simple, clear way to compare model performance.

$$ \\text{Accuracy} = \\frac{\\text{Number of correct predictions}}{\\text{Number of examples}} $$ The following models were evaluated to predict the wine quality label:
$$ \text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Number of examples}} $$ The following models were evaluated to predict the wine quality label:

- Dummy classifier
- Decision tree
Expand Down
31 changes: 16 additions & 15 deletions doc/wine_quality_prediction_report.html

Large diffs are not rendered by default.

187 changes: 91 additions & 96 deletions doc/wine_quality_prediction_report.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,23 +2,27 @@ Predicting Wine Quality Using Physicochemical Properties
================
Contributors: Jianru Deng, Yuyan Guo, Vignesh Lakshmi Rajakumar, Cameron
Harris
December 12th, 2020

- [Acknowledgements](#acknowledgements)
- [Data Summary](#data-summary)
- [Summary](#summary)
- [Introduction of the Data](#introduction-of-the-data)
- [Preprocessing of Features](#preprocessing-of-features)
- [Modelling and Cross Validation
Scores](#modelling-and-cross-validation-scores)
- [Hyperparameter Optimization](#hyperparameter-optimization)
- [Test Data Scores and
Conclusions](#test-data-scores-and-conclusions)
- [Further Improvements and Potential
Problems](#further-improvements-and-potential-problems)
- [References](#references)

## Acknowledgements

The data set was produced by P. Cortez, A. Cerdeira, F. Almeida, T.
Matos and J. Reis. Modeling wine preferences by data mining from
physicochemical properties. In Decision Support Systems, Elsevier,
47(4):547-553, 2009. It was sourced from (Dua and Graff 2017) Dua, D.
47(4):547-553, 2009. It was sourced from (Dua and Graff 2019) Dua, D.
and Graff, C. (2019). UCI Machine Learning Repository
\[<http://archive.ics.uci.edu/ml>\]. Irvine, CA: University of
California, School of Information and Computer Science.
Expand All @@ -30,17 +34,35 @@ help of the Knitr (Xie 2014), kableExtra (Zhu 2020), docopt(de Jonge
2018), pandas (team 2020) package. File paths were managed using the
Here (Müller 2020) package.

## Data Summary
## Summary

In this project, we aim to build a supervised machine learning pipeline
and fit a best performing classification model to predict if a wine is
“good”(6 or higher out of 10) or “bad” (5 or lower out of 10) based on
its physicochemical properties. After carrying out the model comparison,
the best performing model turned out to be the random forest classifier.
It performed reasonably on the unseen test set with test accuracy of
0.85. The test set consisted of 1300 examples, and the model correctly
predicted 85% of them. Also the test set is fairly large, so the test
accuracy can be trusted to be a good approximation of the model
performance on the deployment data. Depending on the specific
application of the model, the test accuracy can either be good or not.
If the downstream application requires higher accuracy, we propose to
keep improving the model with more advanced techniques. More suggestions
can be found in the last section of the report: “Further Improvements
and Potential Problem”.

## Introduction of the Data

The [dataset](https://archive.ics.uci.edu/ml/datasets/Wine+Quality) used
for this prediction task contains physicochemical properties (features)
of nearly 5000 wines and their wine quality as a score from 0-10
(targets). The wine quality scores were determined by human wine taste
preference (median of at least 3 evaluations made by wine experts). Each
expert graded the wine quality between 0 (very bad) and 10 (very
excellent). The features (physicochemical properties) were determined by
analytic tests. Overall there were eleven numeric features and one
categorical feature (see
expert graded the wine quality between 0 (very bad) and 10 (excellent).
The features (physicochemical properties) were determined by analytic
tests. Overall there were eleven numeric features and one categorical
feature (see
[README.md](https://github.com/UBC-MDS/dsci-522-group14/blob/main/README.md)
file for more details on features).

Expand All @@ -61,12 +83,12 @@ a binary classification one (i.e. good vs. bad wine). The first
observation, which came from a distribution of the training target
values, was that not all classes were well represented for the target
values. The second observation was that the majority of wine quality
scores were between 4-6. From this information, we decided that a
multi-class classification approach would yield poor prediction results
and provide little in value in identifying important features for
determining wine quality. Whereas with a binary classification, we could
better predict which types of chemical and physical characteristics can
be attributed to good wines.
scores were between 4-6. From this information, we decided that
multi-class classification would yield poor prediction results and
provide little in value in identifying important features for
determining wine quality. Whereas a binary classification allows us to
make better predictions and determine which types of chemical and
physical characteristics can be attributed to good wines.

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">

Expand Down Expand Up @@ -1009,12 +1031,12 @@ train and test sets. For this task, an 80/20 train/test split was used.

## Preprocessing of Features

As mentioned, there were eleven numeric features and one categorical
feature. For this analysis, the eleven numeric features were scaled
using a standard scalar, which involves removing the mean and scaling to
unit variance (\[@sk-learn\]). The categorical feature, wine type, only
contained two values (red and white) so it was treated as a binary
feature.
In this analysis there are eleven numeric features and one categorical
feature. For this analysis, the numeric features were scaled using a
standard scalar transformer, which involves removing the mean and
scaling to unit variance (Pedregosa et al. 2011). The categorical
feature, wine type, only contained two values (red and white) and was
treated as a binary feature.

## Modelling and Cross Validation Scores

Expand All @@ -1025,12 +1047,12 @@ consequence to a false negative or false positive classifications,
therefore recall and precision are of low importance to us. Accuracy
provides a simple, clear way to compare model performance.

\[ \\text{Accuracy} = \\frac{\\text{Number of correct predictions}}{\\text{Number of examples}} \]
\[ \text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Number of examples}} \]
The following models were evaluated to predict the wine quality label:

- Dummy classifier
- Decision tree
- RBF SVM: SVC
- Support Vector Classification with Radial Basis Function
- Logistic Regression
- Random Forest

Expand Down Expand Up @@ -1116,13 +1138,13 @@ DummyClassifier

<td style="text-align:right;">

0.0034
0.0039

</td>

<td style="text-align:right;">

0.0014
0.0018

</td>

Expand Down Expand Up @@ -1162,13 +1184,13 @@ Decision Tree

<td style="text-align:right;">

0.0606
0.0463

</td>

<td style="text-align:right;">

0.0088
0.0087

</td>

Expand Down Expand Up @@ -1208,13 +1230,13 @@ RBF SVM

<td style="text-align:right;">

0.7542
0.5288

</td>

<td style="text-align:right;">

0.0760
0.0703

</td>

Expand Down Expand Up @@ -1254,13 +1276,13 @@ Logistic Regression

<td style="text-align:right;">

0.0396
0.0372

</td>

<td style="text-align:right;">

0.0084
0.0068

</td>

Expand Down Expand Up @@ -1300,13 +1322,13 @@ Random Forest

<td style="text-align:right;">

0.8782
0.7427

</td>

<td style="text-align:right;">

0.0382
0.0340

</td>

Expand All @@ -1328,19 +1350,18 @@ Random Forest

</table>

The results in Table 2 show that the Random Forest classified the wine
quality with the highest accuracy on the training data with a validation
score of 0.82. This was not surprising to our team as Random Forest is
one of the most widely used and powerful model for classification
problems.
The results in Table 2 show that the Random Forest predicts wine quality
with the highest accuracy on the training data with a validation score
of 0.82. This was not surprising to our team as Random Forest is one of
the most widely used and powerful model for classification problems.

The SVC with RBF SVM model performed the next best with a validation
score of 0.769.
The SVC with RBF model performed the next best with a validation score
of 0.769.

## Hyperparameter Optimization

Given the cross validation results above, hyperparameter optimization
was carried out on our Random Forest model on the number of trees and
was carried out on the Random Forest model on the number of trees and
maximum tree depth parameters.

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
Expand Down Expand Up @@ -1575,69 +1596,43 @@ Finally, with a Random Forest model containing optimized
hyperparameters, we were able to test the accuracy of our model on the
test data set.

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">

<caption>

Table 4. Random Forest scores on test data set

</caption>
With optimized hyperparameters, the Random Forest model was able to
achieve a test accuracy of 0.848 on the test data set. This is slightly
higher than the validation scores using the training data, which tells
us that we may have got a bit lucky on the test data set. But overall,
the model is doing a decent job at predicting the wine label of “good”
or “bad” given the physicochemical properties as features.

<thead>

<tr>

<th style="text-align:right;">

test\_score

</th>

</tr>

</thead>

<tbody>

<tr>

<td style="text-align:right;">

0.8476923

</td>

</tr>

</tbody>

</table>

With an optimized, Random Forest model, we were able to achieve an
accuracy of 0.848 on the test data set. This is slightly higher than the
validation scores using the training data, which tells us that we may
have got a bit lucky on the test data set. But overall, the model is
doing a decent job at predicting the wine label of “good” or “bad” given
the physicochemical properties as features.

If we recall the main research question:
Recalling the main research question:

> Can we predict if a wine is “good” (6 or higher out of 10) or “bad” (5
> or lower out of 10) based on its physicochemical properties alone?
The results show that with about 85% accuracy, it is possible to predict
whether a wine may be considered “good” (6/10 or higher) or “bad” (5/10
or lower).
The results show that with 85% test accuracy, which means that the
fitted model is able to predict 85% of the test examples correctly. So
it is possible to predict whether a wine may be considered “good” (6/10
or higher) or “bad” (5/10 or lower) given the features we have.

## Further Improvements and Potential Problems

Some further work that may result in higher prediction accuracy and
higher model interpretability could include feature selection and
feature engineering. Based on the result of EDA, some of the features
seem like they could be correlated (e.g. free sulphur dioxide, total
sulphur dioxide, sulphates), which may be removed after feature
selection. The feature engineering may require some domain knowledge, as
the features are very domain-specific. However, adding more relevant
features can be potentially helpful.

Some further work that may result in higher prediction accuracy could
include feature selection optimization. For example, some of the
features seem like they could be correlated (e.g. free sulphur dioxide,
total sulphur dioxide, sulphates).
Except for random forest classifier and all the other classifiers we
tested, there are still some other powerful classification models which
are not tested in this project but may improve the prediction accuracy.
(e.g. XGBoost, LightGBM, Catboost etc)

The original data-set has an quantitative output metric: a rating
between 0-10. This problem could be a candidate for a regression model.
It would be interesting to compare the effectiveness and usefulness of
this consideration and could be explored in a future iteration.
The original data-set has a quantitative output metric: a rating between
0-10. This problem could be a candidate for a regression model. It would
be interesting to compare the effectiveness and usefulness of this
consideration and could be explored in a future iteration.

Another point of interest in problem is the subjectivity of wine
quality. The current data set uses a median rating from multiple
Expand All @@ -1658,7 +1653,7 @@ Language*. <https://CRAN.R-project.org/package=docopt>.

<div id="ref-Dua:2019">

Dua, Dheeru, and Casey Graff. 2017. “UCI Machine Learning Repository.”
Dua, Dheeru, and Casey Graff. 2019. “UCI Machine Learning Repository.”
University of California, Irvine, School of Information; Computer
Sciences. <http://archive.ics.uci.edu/ml>.

Expand Down
2 changes: 1 addition & 1 deletion doc/wine_refs.bib
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ @Manual{kableExtra
@misc{Dua:2019 ,
author = "Dua, Dheeru and Graff, Casey",
year = "2017",
year = "2019",
title = "{UCI} Machine Learning Repository",
url = "http://archive.ics.uci.edu/ml",
institution = "University of California, Irvine, School of Information and Computer Sciences"
Expand Down

0 comments on commit 22089da

Please sign in to comment.