From c263db6f41c706f4c2f32aeeb204da86439750a8 Mon Sep 17 00:00:00 2001 From: Bruhat Musunuru <61818239+BruhatM@users.noreply.github.com> Date: Sat, 28 Nov 2020 09:21:12 -0800 Subject: [PATCH] Update README.md --- README.md | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index ade6189..da21fc2 100644 --- a/README.md +++ b/README.md @@ -10,19 +10,15 @@ Sang Yoon Lee |[rissangs](https://github.com/rissangs) Second milestone of a data analysis project for DSCI 522: Data Science workflows, part of Master of Data Science program at the University of British Columbia. -## Introduction +## About -In this project we are trying to predict the quality of a given wine sample using its features, composition and characteristics. Traditional methods of categorizing wine are prone to human error and can vary drastically from expert to expert. We propose a data mining approach to predict human wine taste preferences based on complex data analytical algorithms and classification models. This unbiased and human error free metric can provide a standardized metric that can be used for personalized wine recommendation, Quality assessment and comparison unit. It can also be used by wineries as an important metric which could aid in important business decisions and strategies. +Here we attempt to build a model to predict the quality of a given wine sample using its features, composition and characteristics. Traditional methods of categorizing wine are prone to human error and can vary drastically from expert to expert. We propose a data mining approach to predict human wine taste preferences based on complex data analytical algorithms and classification models. This unbiased and human error free metric can provide a standardized metric that can be used for personalized wine recommendation, Quality assessment and comparison unit. It can also be used by wineries as an important metric which could aid in important business decisions and strategies. The data set used in this project is created by Paulo Cortez from the University of Minho in Guimarães, Portugal, and A. Cerdeira, F. Almeida, T. Matos and J. Reis from the Viticulture Commission of the Vinho Verde Region in Porto, Portugal. The two datasets are included are related to red and white vinho verde wine samples, from the north of Portugal. It was sourced from the UCI Machine Learning Repository and can be found [here](https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/). Each row in the data set represents summary statistics from a sample of wine based on physicochemical tests with attributes fixed acidity, volatile acidity, citric acid, residual sugar, pH, etc. + We built a classification model using Multi-layer Perceptron classifier to predict the class of a given wine. To summarise, our model classifies wine into one of three classses, bad-normal-good. We have had good prediction accuracies with this model and through our analysis we also found that it generalizes very well. For complete report , please check the report section. -{TO CHANGE TO MODEL FINDINGS - We plan to build a predictive classification model to provide the standardized metric discussed above. In order for the model to abide by the golden rule, we plan to split the data into train and test sets (80% - 20% respectively) and perform exploratory data analysis in order to assess any class imbalance, outliers that needs to be considered when scouting for best model to fit our needs. After the EDA, we see that wine quality ranking seems to be more likely to associate with alcohol, density, free sulfur dioxide, volatile acidity, wine type than the rest of the input features. Hence a multiclass linear classification could be appropriate to estimate the impact each features have on wine quality ranking. - - The outcome or the Standardized metric we are trying to establish is to classify all wines into three classes (Poor, Normal, Excellent). One likely model suitable for this classification is linear regression and set a threshold for each class in the predicted probabilities. Since our data set is reasonably sized with 1598 observations, we can choose a higher cross-validation of ~50 folds. We will use this accuracy to tune our model for the best fit. After doing so, we re-fit the model on the entire training data set, and then evaluate it’s performance on the test data set. This gives a deeper understanding of our model. We will use this information to address classification errors and report them as a table in the final report.} -For this Milestone we have performed an EDA on the data set which can be found here ## Report