Merge pull request #59 from athy9193/main

Update report to address Issue #42 and #46 & Dependency Diagram
UBC-MDS · Dec 11, 2020 · 78a1ccc · 78a1ccc
2 parents 4eb7789 + 0dbcf5a
commit 78a1ccc
Show file tree

Hide file tree

Showing 9 changed files with 172 additions and 853 deletions.
diff --git a/.gitignore b/.gitignore
@@ -3,3 +3,4 @@
 **.Rproj
 **.Rproj.user/
 .Rproj.user
+.Rhistory
diff --git a/Makefile.dot b/Makefile.dot
@@ -0,0 +1,43 @@
+digraph G {
+n2[label="all", color="red"];
+n5[label="data/processed/processed.csv", color="red"];
+n13[label="data/processed/processed_test.csv", color="red"];
+n12[label="data/processed/processed_train.csv", color="red"];
+n7[label="data/raw/winequality-red.csv", color="green"];
+n9[label="data/raw/winequality-white.csv", color="green"];
+n4[label="eda/wine_eda.py", color="green"];
+n17[label="reports/reports.Rmd", color="green"];
+n16[label="reports/reports.md", color="green"];
+n18[label="reports/wine_refs.bib", color="green"];
+n14[label="results/best_Model.pkl", color="red"];
+n10[label="results/final_model_quality.png", color="red"];
+n3[label="results/wine_quality_rank_per_feature.png", color="red"];
+n8[label="src/download_data.py", color="green"];
+n15[label="src/fit_wine_quality_predict_model.py", color="green"];
+n6[label="src/pre_processing_wine.py", color="green"];
+n11[label="src/wine_quality_test_results.py", color="green"];
+n16 -> n2 ; 
+n10 -> n2 ; 
+n3 -> n2 ; 
+n7 -> n5 ; 
+n9 -> n5 ; 
+n6 -> n5 ; 
+n7 -> n13 ; 
+n9 -> n13 ; 
+n6 -> n13 ; 
+n7 -> n12 ; 
+n9 -> n12 ; 
+n6 -> n12 ; 
+n8 -> n7 ; 
+n8 -> n9 ; 
+n17 -> n16 ; 
+n18 -> n16 ; 
+n12 -> n14 ; 
+n15 -> n14 ; 
+n13 -> n10 ; 
+n12 -> n10 ; 
+n14 -> n10 ; 
+n11 -> n10 ; 
+n5 -> n3 ; 
+n4 -> n3 ; 
+}
diff --git a/Makefile.png b/Makefile.png
diff --git a/README.md b/README.md
@@ -60,6 +60,14 @@ conda activate wine_env
      - tidyverse==1.3.0
 
 
+## Dependency Diagram
+
+The diagram below shows the structure of how the project and this repo is structured to produce the final results.
+
+![](Makefile.png)
+
+
+
 ## License
 
 The Wine Quality Predictor materials here are licensed under the

diff --git a/reports/.Rhistory b/reports/.Rhistory
diff --git a/reports/reports.Rmd b/reports/reports.Rmd
@@ -38,15 +38,25 @@ The data set used in this project is the results of a chemical analysis of the P
 
 There are two datasets for red and white wine samples. For each wine sample observation , the inputs contains measurements of various objective physicochemical tests, and the output is the median wine quality ratings given by experts on the scale from 0 (very bad) and 10 (very excellent).The author notes that data on grape types, wine brand, wind selling price among other are not available due to privacy and logistics issues. There are 1599 observations for red wine and 4898 observations of white wine.
 
+```{r, fig.cap = "Figure 1: Distribution of type of wine", out.width= "50%", fig.align='center'}
+
+knitr::include_graphics("../eda/wine_EDA_files/distribution_of_type_of_wine.png")
+```
+
 ### Analysis
 
-At the preprocessing stage, we decided to combine the red and white data set as well as group the data in bigger classification, namely "poor", "normal" and "excellent" for scale "1-4", "5-6" and "7-9" so as to have bigger sample size. We acknowledge that the data is imbalanced, hence instead of only using accuracy based to judge the model performance, we also include f1-score and use it as our main assessment metric. f-1 score is metric that combine both the precision and recall metrics, which focus on the false negative and false positive rate of the data and would be appropriate to use with an imbalanced data set.
+At the preprocessing stage, we decided to combine the red and white data set as well as group the data in bigger classification, namely "poor", "normal" and "excellent" for scale "1-4", "5-6" and "7-9" so as to have bigger sample size (as per Figure 2). We acknowledge that the data is imbalanced, hence instead of only using accuracy based to judge the model performance, we also include f1-score and use it as our main assessment metric. f-1 score is metric that combine both the precision and recall metrics, which focus on the false negative and false positive rate of the data and would be appropriate to use with an imbalanced data set.{Bruhat: to add more justification for f-1 micro score}
+
+```{r, fig.cap = "Figure 2: Regrouping of wine quality classification", out.width= "50%", fig.align='center'}
+
+knitr::include_graphics("wine_classification.png")
+```
 
 In this project we are trying to predict the quality of a given wine sample using wine attributes obtained from various physicochemical tests. Based on our literary review, we found that researchers from Karadeniz Technical Univeristy used Random Forest Algorithm had also tried to classify between red wine and white wine for the same dataset [@er2016classification]. They further used 3 different data mining algorithms namely k-nearest-neighbourhood random forests and support vector machine learning to classify the quality of both red wine and white wine. This motivates us to proceed with to use cross-validation to select the best model for our analysis.
 
 We eventually decided to pick neutral network Multi-layer Perception (MLP) model as the model that yield the best results after running the various machine learning models through the train data set, comparing their performance based on f1-score and checking consistency across cross-validation runs. We noticed that random forest recorded high f1-validation score at 0.84, however, it also had a large gap between train and validation with a perfect train score of 1. This caused us to think the model has overfitted. Logistic regression also showed promising f1 validation score results in our case, yet this high results were not consistent across cross-validation splits. Hence, with most models struggled to get to the 0.8 f1-score mark without significantly overfitting on the train set, while MLP shows consistent results across all cross-validation splits, our final choice landed on MLP model because we think it would generalize better.
 
-```{r, fig.cap = "Table 1: Score results among different machine learning model we have explore"}
+```{r, fig.cap = "Table 1: Score results among different machine learning model we have explore", fig.align='center'}
 knitr::include_graphics("models_c_revised.png")
 ```
 
@@ -58,22 +68,22 @@ The Python and R programming languages [@R; @Python] and the following Python an
 
 Looking at the distribution plot of the respective wine quality group interacting with each explanatory features, we can see that higher quality wine seems to be more associated with higher `alcohol` level and lower `density`. Lower `volatile acidity` also seems to be indicative of better wine. Better ranked wine also seem to have `higher free sulfur dioxide` level than poor wine though the relationship is not that clear based on the plot. The rest of the features do not seems be very distinguishable among different quality wine.
 
-```{r distribution plot, fig.cap = "Figure 1: Distribution plot between wine quality and various attributes from physicochemical test"}
+```{r distribution plot, fig.cap = "Figure 3: Distribution plot between wine quality and various attributes from physicochemical test", fig.align='center'}
 
 knitr::include_graphics("../eda/wine_EDA_files/wine_quality_rank_per_feature.png")
 
 ```
 
 Since this is a multi-class classification, our goal was to find a model that was consistent and able to recognize patterns from our data. We choose to use a neutral network Multi-layer Perception (MLP) model as it was consistent and showed promising results. If we take a look at the accuracy scores and f1 scores across cross validation splits, we can see that it is pretty consistent which was not the case with many models.
 
-```{r, echo=FALSE,out.width="50%", out.height="20%",fig.cap="Figure 2: Accuracy scores and f1 scores across cross validation splits for neutral network Multi-layer Perception (MLP) model",fig.show='hold',fig.align='center'}
+```{r, echo=FALSE,out.width="50%", out.height="20%",fig.cap="Figure 4: Accuracy scores and f1 scores across cross validation splits for neutral network Multi-layer Perception (MLP) model",fig.show='hold',fig.align='center'}
 knitr::include_graphics(c("f1_revised.png","accuracy_plot_revised.png"))
 ``` 
 
 
 Our model performed quite well on the test data as well. If we take a look at the confusion matrix below. As we discussed earlier, the prediction at the lower end of wine quality spectrum is acceptable. As we can see from the confusion matrix below, ~13% error rate for the lower end of spectrum and also very acceptable false classifications in the high end of spectrum.
 
-```{r, fig.cap = "Figure 3: Confusion Matrix"}
+```{r, fig.cap = "Figure 5: Confusion Matrix", fig.align='center'}
 knitr::include_graphics("../results/final_model_quality.png")
 ```