Skip to content

Commit

Permalink
Merge pull request #48 from jianridine/main
Browse files Browse the repository at this point in the history
add/format citations
  • Loading branch information
jianructose authored Nov 29, 2020
2 parents 0fae19d + e6254a4 commit e94213d
Show file tree
Hide file tree
Showing 4 changed files with 182 additions and 47 deletions.
46 changes: 23 additions & 23 deletions doc/wine_quality_prediction_report.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@ output:
html_document:
toc: true
bibliography: wine_refs.bib

---

```{r setup, include=FALSE}
Expand All @@ -18,12 +17,12 @@ library(here)
library(kableExtra)
```

## Acknowledgements

The data set was produced by P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009. It was sourced from Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
The data set was produced by P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009. It was sourced from [@Dua:2019] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [<http://archive.ics.uci.edu/ml>]. Irvine, CA: University of California, School of Information and Computer Science.

All machine learning processing and analysis was done using Sci-kit Learn (@sk-learn). Tables and figures in this report were created with the help of the Knitr (@knitr) package. File paths were managed using the Here (@here) package.
All machine learning processing and analysis was done using Sci-kit Learn [@sk-learn], python [@Python], R[@R]. Tables and figures in this report were created with the help of the Knitr [@knitr], kableExtra [@kableExtra], docopt[@docopt], pandas [@reback2020pandas] package. File paths were managed using the Here [@here] package.

## Data Summary

Expand All @@ -47,30 +46,29 @@ kable(head(wine_data, 10),
```

Next we analyze the target classes for this data set in the figure below.
Next we analyze the target classes for this data set in the figure below.

```{r, fig.align='center', out.width='15%', fig.cap='Figure 1. Target Class Distribution', echo=FALSE}
knitr::include_graphics(here('results', 'eda_target.png'))
```

For this classification problem we determined that we could ignore class imbalance related intricacies and could begin splitting the data into train and test sets. For this task, an 80/20 train/test split was used.
For this classification problem we determined that we could ignore class imbalance related intricacies and could begin splitting the data into train and test sets. For this task, an 80/20 train/test split was used.

## Preprocessing of Features

As mentioned, there were eleven numeric features and one categorical feature. For this analysis, the eleven numeric features were scaled using a standard scalar, which involves removing the mean and scaling to unit variance (@sk-learn). The categorical feature, wine type, only contained two values (red and white) so it was treated as a binary feature.
As mentioned, there were eleven numeric features and one categorical feature. For this analysis, the eleven numeric features were scaled using a standard scalar, which involves removing the mean and scaling to unit variance ([\@sk-learn]). The categorical feature, wine type, only contained two values (red and white) so it was treated as a binary feature.

## Modelling and Cross Validation Scores

All models were evaluated through five-fold cross validation on the training data set. Accuracy was used as the primary metric to evaluate model performance. For this classification problem, there is low consequence to a false negative or false positive classifications, therefore recall and precision are of low importance to us. Accuracy provides a simple, clear way to compare model performance.

$$ \text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Number of examples}} $$
The following models were evaluated to predict the wine quality label:
$$ \\text{Accuracy} = \\frac{\\text{Number of correct predictions}}{\\text{Number of examples}} $$ The following models were evaluated to predict the wine quality label:

- Dummy classifier
- Decision tree
- RBF SVM: SVC
- Logistic Regression
- Random Forest
- Dummy classifier
- Decision tree
- RBF SVM: SVC
- Logistic Regression
- Random Forest

```{r model results, echo=FALSE, warnings=FALSE, messages=FALSE}
model_results <- read_csv(here("results", "model_comparison.csv"), col_types = cols())
Expand All @@ -80,13 +78,13 @@ kable(model_results,
kable_styling(full_width = F)
```

The results in Table 2 show that the Random Forest classified the wine quality with the highest accuracy on the training data with a validation score of `r round(model_results$mean_valid_accuracy[5], 3)`. This was not surprising to our team as Random Forest is one of the most widely used and powerful model for classification problems.
The results in Table 2 show that the Random Forest classified the wine quality with the highest accuracy on the training data with a validation score of `r round(model_results$mean_valid_accuracy[5], 3)`. This was not surprising to our team as Random Forest is one of the most widely used and powerful model for classification problems.

The SVC with RBF SVM model performed the next best with a validation score of `r round(model_results$mean_valid_accuracy[3], 3)`.
The SVC with RBF SVM model performed the next best with a validation score of `r round(model_results$mean_valid_accuracy[3], 3)`.

## Hyperparameter Optimization

Given the cross validation results above, hyperparameter optimization was carried out on our Random Forest model on the number of trees and maximum tree depth parameters.
Given the cross validation results above, hyperparameter optimization was carried out on our Random Forest model on the number of trees and maximum tree depth parameters.

```{r hyperparameter results, messages=FALSE, warnings=FALSE, echo=FALSE}
hyperparameter_results <- read_csv(here("results", "hyperparameter_result.csv"), col_types = cols())
Expand All @@ -95,11 +93,12 @@ kable(hyperparameter_results, caption = "Table 3. Hyperparameter optimization re
kable_styling(full_width = F)
```

The results of optimization two key hyperparameters resulted in slightly improved validation results (note that in Table 3, "test score" is comparable to "valid score" from Table 2).

## Test Data Scores and Conclusions

Finally, with a Random Forest model containing optimized hyperparameters, we were able to test the accuracy of our model on the test data set.
Finally, with a Random Forest model containing optimized hyperparameters, we were able to test the accuracy of our model on the test data set.

```{r test scores, warnings=FALSE, messages=FALSE, echo=FALSE}
test_scores <- read_csv(here("results", "test_score.csv"), col_types = cols())
Expand All @@ -108,18 +107,19 @@ kable(test_scores, caption = "Table 4. Random Forest scores on test data set") %
kable_styling(full_width = F)
```
With an optimized, Random Forest model, we were able to achieve an accuracy of `r round(test_scores$test_score, 3)` on the test data set. This is slightly higher than the validation scores using the training data, which tells us that we may have got a bit lucky on the test data set. But overall, the model is doing a decent job at predicting the wine label of "good" or "bad" given the physicochemical properties as features.

If we recall the main research question:
With an optimized, Random Forest model, we were able to achieve an accuracy of `r round(test_scores$test_score, 3)` on the test data set. This is slightly higher than the validation scores using the training data, which tells us that we may have got a bit lucky on the test data set. But overall, the model is doing a decent job at predicting the wine label of "good" or "bad" given the physicochemical properties as features.

>Can we predict if a wine is "good" (6 or higher out of 10) or "bad" (5 or lower out of 10) based on its physicochemical properties alone?
If we recall the main research question:

> Can we predict if a wine is "good" (6 or higher out of 10) or "bad" (5 or lower out of 10) based on its physicochemical properties alone?
The results show that with about 85% accuracy, it is possible to predict whether a wine may be considered "good" (6/10 or higher) or "bad" (5/10 or lower).

Some further work that may result in higher prediction accuracy could include feature selection optimization. For example, some of the features seem like they could be correlated (e.g. free sulphur dioxide, total sulphur dioxide, sulphates).

The original data-set has an quantitative output metric: a rating between 0-10. This problem could be a candidate for a regression model. It would be interesting to compare the effectiveness and usefulness of this consideration and could be explored in a future iteration.
The original data-set has an quantitative output metric: a rating between 0-10. This problem could be a candidate for a regression model. It would be interesting to compare the effectiveness and usefulness of this consideration and could be explored in a future iteration.

Another point of interest in problem is the subjectivity of wine quality. The current data set uses a median rating from multiple tastings from multiple wine experts as an estimation of quality. While we feel that this estimate is a good enough proxy, it is something to be aware of when using this model.
Another point of interest in problem is the subjectivity of wine quality. The current data set uses a median rating from multiple tastings from multiple wine experts as an estimation of quality. While we feel that this estimate is a good enough proxy, it is something to be aware of when using this model.

## References
78 changes: 69 additions & 9 deletions doc/wine_quality_prediction_report.html

Large diffs are not rendered by default.

72 changes: 59 additions & 13 deletions doc/wine_quality_prediction_report.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,15 +18,17 @@ Harris
The data set was produced by P. Cortez, A. Cerdeira, F. Almeida, T.
Matos and J. Reis. Modeling wine preferences by data mining from
physicochemical properties. In Decision Support Systems, Elsevier,
47(4):547-553, 2009. It was sourced from Dua, D. and Graff, C. (2019).
UCI Machine Learning Repository \[<http://archive.ics.uci.edu/ml>\].
Irvine, CA: University of California, School of Information and Computer
Science.
47(4):547-553, 2009. It was sourced from (Dua and Graff 2017) Dua, D.
and Graff, C. (2019). UCI Machine Learning Repository
\[<http://archive.ics.uci.edu/ml>\]. Irvine, CA: University of
California, School of Information and Computer Science.

All machine learning processing and analysis was done using Sci-kit
Learn (Pedregosa et al. (2011)). Tables and figures in this report were
created with the help of the Knitr (Xie (2014)) package. File paths were
managed using the Here (Müller (2020)) package.
Learn (Pedregosa et al. 2011), python (Van Rossum and Drake 2009), R(R
Core Team 2020). Tables and figures in this report were created with the
help of the Knitr (Xie 2014), kableExtra (Zhu 2020), docopt(de Jonge
2018), pandas (team 2020) package. File paths were managed using the
Here (Müller 2020) package.

## Data Summary

Expand Down Expand Up @@ -991,7 +993,7 @@ below.

<div class="figure" style="text-align: center">

<img src="C:/Users/vignesh/career/dsci-522-group14/results/eda_target.png" alt="Figure 1. Target Class Distribution" width="15%" />
<img src="C:/Users/dengj/ubc_mds_21/ds_block3/tiff_522_ds_workflows/dsci-522-group14/results/eda_target.png" alt="Figure 1. Target Class Distribution" width="15%" />

<p class="caption">

Expand All @@ -1010,9 +1012,9 @@ train and test sets. For this task, an 80/20 train/test split was used.
As mentioned, there were eleven numeric features and one categorical
feature. For this analysis, the eleven numeric features were scaled
using a standard scalar, which involves removing the mean and scaling to
unit variance (Pedregosa et al. (2011)). The categorical feature, wine
type, only contained two values (red and white) so it was treated as a
binary feature.
unit variance (\[@sk-learn\]). The categorical feature, wine type, only
contained two values (red and white) so it was treated as a binary
feature.

## Modelling and Cross Validation Scores

Expand All @@ -1023,7 +1025,7 @@ consequence to a false negative or false positive classifications,
therefore recall and precision are of low importance to us. Accuracy
provides a simple, clear way to compare model performance.

\[ \text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Number of examples}} \]
\[ \\text{Accuracy} = \\frac{\\text{Number of correct predictions}}{\\text{Number of examples}} \]
The following models were evaluated to predict the wine quality label:

- Dummy classifier
Expand Down Expand Up @@ -1645,7 +1647,22 @@ aware of when using this model.

## References

<div id="refs" class="references">
<div id="refs" class="references hanging-indent">

<div id="ref-docopt">

de Jonge, Edwin. 2018. *Docopt: Command-Line Interface Specification
Language*. <https://CRAN.R-project.org/package=docopt>.

</div>

<div id="ref-Dua:2019">

Dua, Dheeru, and Casey Graff. 2017. “UCI Machine Learning Repository.”
University of California, Irvine, School of Information; Computer
Sciences. <http://archive.ics.uci.edu/ml>.

</div>

<div id="ref-here">

Expand All @@ -1662,6 +1679,28 @@ Python.” *Journal of Machine Learning Research* 12: 2825–30.

</div>

<div id="ref-R">

R Core Team. 2020. *R: A Language and Environment for Statistical
Computing*. Vienna, Austria: R Foundation for Statistical Computing.
<https://www.R-project.org/>.

</div>

<div id="ref-reback2020pandas">

team, The pandas development. 2020. *Pandas-Dev/Pandas: Pandas* (version
1.1.1). Zenodo. <https://doi.org/10.5281/zenodo.3993412>.

</div>

<div id="ref-Python">

Van Rossum, Guido, and Fred L. Drake. 2009. *Python 3 Reference Manual*.
Scotts Valley, CA: CreateSpace.

</div>

<div id="ref-knitr">

Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research
Expand All @@ -1671,4 +1710,11 @@ Hall/CRC. <http://www.crcpress.com/product/isbn/9781466561595>.

</div>

<div id="ref-kableExtra">

Zhu, Hao. 2020. *KableExtra: Construct Complex Table with ’Kable’ and
Pipe Syntax*. <https://CRAN.R-project.org/package=kableExtra>.

</div>

</div>
33 changes: 31 additions & 2 deletions doc/wine_refs.bib
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ @Article{sk-learn
volume={12},
pages={2825--2830},
year={2011}
}
}}
@InCollection{knitr,
booktitle = {Implementing Reproducible Computational Research},
Expand Down Expand Up @@ -54,4 +54,33 @@ @misc{Dua:2019
url = "http://archive.ics.uci.edu/ml",
institution = "University of California, Irvine, School of Information and Computer Sciences"
}
@book{Python,
author = {Van Rossum, Guido and Drake, Fred L.},
title = {Python 3 Reference Manual},
year = {2009},
isbn = {1441412697},
publisher = {CreateSpace},
address = {Scotts Valley, CA}
}


@software{reback2020pandas,
author = {The pandas development team},
title = {pandas-dev/pandas: Pandas},
month = {Aug},
year = {2020},
publisher = {Zenodo},
version = {1.1.1},
doi = {10.5281/zenodo.3993412},
url = {https://doi.org/10.5281/zenodo.3993412}
}

@Manual{R,
title = {R: A Language and Environment for Statistical Computing},
author = {{R Core Team}},
organization = {R Foundation for Statistical Computing},
address = {Vienna, Austria},
year = {2020},
url = {https://www.R-project.org/},
}

0 comments on commit e94213d

Please sign in to comment.