Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/main' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
athy9193 committed Nov 28, 2020
2 parents 0af83db + 9c028da commit 25461bb
Showing 1 changed file with 3 additions and 7 deletions.
10 changes: 3 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,19 +10,15 @@ Sang Yoon Lee |[rissangs](https://github.com/rissangs)

Second milestone of a data analysis project for DSCI 522: Data Science workflows, part of Master of Data Science program at the University of British Columbia.

## Introduction
## About

In this project we are trying to predict the quality of a given wine sample using its features, composition and characteristics. Traditional methods of categorizing wine are prone to human error and can vary drastically from expert to expert. We propose a data mining approach to predict human wine taste preferences based on complex data analytical algorithms and classification models. This unbiased and human error free metric can provide a standardized metric that can be used for personalized wine recommendation, Quality assessment and comparison unit. It can also be used by wineries as an important metric which could aid in important business decisions and strategies.
Here we attempt to build a model to predict the quality of a given wine sample using its features, composition and characteristics. Traditional methods of categorizing wine are prone to human error and can vary drastically from expert to expert. We propose a data mining approach to predict human wine taste preferences based on complex data analytical algorithms and classification models. This unbiased and human error free metric can provide a standardized metric that can be used for personalized wine recommendation, Quality assessment and comparison unit. It can also be used by wineries as an important metric which could aid in important business decisions and strategies.

The data set used in this project is created by Paulo Cortez from the University of Minho in Guimarães, Portugal, and A. Cerdeira, F. Almeida, T. Matos and J. Reis from the Viticulture Commission of the Vinho Verde Region in Porto, Portugal. The two datasets are included are related to red and white vinho verde wine samples, from the north of Portugal. It was sourced from the UCI Machine Learning Repository and can be found [here](https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/). Each row in the data set represents summary statistics from a sample of wine based on physicochemical tests with attributes fixed acidity, volatile acidity, citric acid, residual sugar, pH, etc.

We built a classification model using Multi-layer Perceptron classifier to predict the class of a given wine. To summarise, our model classifies wine into one of three classses, bad-normal-good. We have had good prediction accuracies with this model and through our analysis we also found that it generalizes very well. For complete report , please check the report section.

{TO CHANGE TO MODEL FINDINGS
We plan to build a predictive classification model to provide the standardized metric discussed above. In order for the model to abide by the golden rule, we plan to split the data into train and test sets (80% - 20% respectively) and perform exploratory data analysis in order to assess any class imbalance, outliers that needs to be considered when scouting for best model to fit our needs. After the EDA, we see that wine quality ranking seems to be more likely to associate with alcohol, density, free sulfur dioxide, volatile acidity, wine type than the rest of the input features. Hence a multiclass linear classification could be appropriate to estimate the impact each features have on wine quality ranking.

The outcome or the Standardized metric we are trying to establish is to classify all wines into three classes (Poor, Normal, Excellent). One likely model suitable for this classification is linear regression and set a threshold for each class in the predicted probabilities. Since our data set is reasonably sized with 1598 observations, we can choose a higher cross-validation of ~50 folds. We will use this accuracy to tune our model for the best fit. After doing so, we re-fit the model on the entire training data set, and then evaluate it’s performance on the test data set. This gives a deeper understanding of our model. We will use this information to address classification errors and report them as a table in the final report.}

For this Milestone we have performed an EDA on the data set which can be found <a href=https://github.com/UBC-MDS/Wine_Quality_Predictor/blob/main/eda/wine_EDA.md>here</a>

## Report

Expand Down

0 comments on commit 25461bb

Please sign in to comment.