Merge pull request #48 from jianridine/main

add/format citations
UBC-MDS · Nov 29, 2020 · e94213d · e94213d
2 parents 0fae19d + e6254a4
commit e94213d
Show file tree

Hide file tree

Showing 4 changed files with 182 additions and 47 deletions.
diff --git a/doc/wine_quality_prediction_report.Rmd b/doc/wine_quality_prediction_report.Rmd
@@ -6,7 +6,6 @@ output:
   html_document:
     toc: true
 bibliography: wine_refs.bib
-
 ---
 
 ```{r setup, include=FALSE}
@@ -18,12 +17,12 @@ library(here)
 library(kableExtra)
 
 ```
+
 ## Acknowledgements
 
-The data set was produced by P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
-Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009. It was sourced from Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
+The data set was produced by P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009. It was sourced from [@Dua:2019] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [<http://archive.ics.uci.edu/ml>]. Irvine, CA: University of California, School of Information and Computer Science.
 
-All machine learning processing and analysis was done using Sci-kit Learn (@sk-learn).  Tables and figures in this report were created with the help of the Knitr (@knitr) package. File paths were managed using the Here (@here) package.
+All machine learning processing and analysis was done using Sci-kit Learn [@sk-learn], python [@Python], R[@R]. Tables and figures in this report were created with the help of the Knitr [@knitr], kableExtra [@kableExtra], docopt[@docopt], pandas [@reback2020pandas] package. File paths were managed using the Here [@here] package.
 
 ## Data Summary
 
@@ -47,30 +46,29 @@ kable(head(wine_data, 10),
 
 ```
 
-Next we analyze the target classes for this data set in the figure below. 
+Next we analyze the target classes for this data set in the figure below.
 
 ```{r, fig.align='center', out.width='15%', fig.cap='Figure 1. Target Class Distribution', echo=FALSE}
 knitr::include_graphics(here('results', 'eda_target.png'))
 ```
 
-For this classification problem we determined that we could ignore class imbalance related intricacies and could begin splitting the data into train and test sets. For this task, an 80/20 train/test split was used. 
+For this classification problem we determined that we could ignore class imbalance related intricacies and could begin splitting the data into train and test sets. For this task, an 80/20 train/test split was used.
 
 ## Preprocessing of Features
 
-As mentioned, there were eleven numeric features and one categorical feature. For this analysis, the eleven numeric features were scaled using a standard scalar, which involves removing the mean and scaling to unit variance (@sk-learn). The categorical feature, wine type, only contained two values (red and white) so it was treated as a binary feature.
+As mentioned, there were eleven numeric features and one categorical feature. For this analysis, the eleven numeric features were scaled using a standard scalar, which involves removing the mean and scaling to unit variance ([\@sk-learn]). The categorical feature, wine type, only contained two values (red and white) so it was treated as a binary feature.
 
 ## Modelling and Cross Validation Scores
 
 All models were evaluated through five-fold cross validation on the training data set. Accuracy was used as the primary metric to evaluate model performance. For this classification problem, there is low consequence to a false negative or false positive classifications, therefore recall and precision are of low importance to us. Accuracy provides a simple, clear way to compare model performance.
 
-$$ \text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Number of examples}} $$
-The following models were evaluated to predict the wine quality label:
+$$ \\text{Accuracy} = \\frac{\\text{Number of correct predictions}}{\\text{Number of examples}} $$ The following models were evaluated to predict the wine quality label:
 
-- Dummy classifier
-- Decision tree
-- RBF SVM: SVC
-- Logistic Regression
-- Random Forest
+-   Dummy classifier
+-   Decision tree
+-   RBF SVM: SVC
+-   Logistic Regression
+-   Random Forest
 
 ```{r model results, echo=FALSE, warnings=FALSE, messages=FALSE}
 model_results <- read_csv(here("results", "model_comparison.csv"), col_types = cols())
@@ -80,13 +78,13 @@ kable(model_results,
       kable_styling(full_width = F)
 ```
 
-The results in Table 2 show that the Random Forest classified the wine quality with the highest accuracy on the training data with a validation score of `r round(model_results$mean_valid_accuracy[5], 3)`. This was not surprising to our team as Random Forest is one of the most widely used and powerful model for classification problems. 
+The results in Table 2 show that the Random Forest classified the wine quality with the highest accuracy on the training data with a validation score of `r round(model_results$mean_valid_accuracy[5], 3)`. This was not surprising to our team as Random Forest is one of the most widely used and powerful model for classification problems.
 
-The SVC with RBF SVM model performed the next best with a validation score of `r round(model_results$mean_valid_accuracy[3], 3)`. 
+The SVC with RBF SVM model performed the next best with a validation score of `r round(model_results$mean_valid_accuracy[3], 3)`.
 
 ## Hyperparameter Optimization
 
-Given the cross validation results above, hyperparameter optimization was carried out on our Random Forest model on the number of trees and maximum tree depth parameters. 
+Given the cross validation results above, hyperparameter optimization was carried out on our Random Forest model on the number of trees and maximum tree depth parameters.
 
 ```{r hyperparameter results, messages=FALSE, warnings=FALSE, echo=FALSE}
 hyperparameter_results <- read_csv(here("results", "hyperparameter_result.csv"), col_types = cols())
@@ -95,11 +93,12 @@ kable(hyperparameter_results, caption = "Table 3. Hyperparameter optimization re
       kable_styling(full_width = F)
 
 ```
+
 The results of optimization two key hyperparameters resulted in slightly improved validation results (note that in Table 3, "test score" is comparable to "valid score" from Table 2).
 
 ## Test Data Scores and Conclusions
 
-Finally, with a Random Forest model containing optimized hyperparameters, we were able to test the accuracy of our model on the test data set. 
+Finally, with a Random Forest model containing optimized hyperparameters, we were able to test the accuracy of our model on the test data set.
 
 ```{r test scores, warnings=FALSE, messages=FALSE, echo=FALSE}
 test_scores <- read_csv(here("results", "test_score.csv"), col_types = cols())
@@ -108,18 +107,19 @@ kable(test_scores, caption = "Table 4. Random Forest scores on test data set") %
       kable_styling(full_width = F)
 
 ```
-With an optimized, Random Forest model, we were able to achieve an accuracy of `r round(test_scores$test_score, 3)` on the test data set. This is slightly higher than the validation scores using the training data, which tells us that we may have got a bit lucky on the test data set. But overall, the model is doing a decent job at predicting the wine label of "good" or "bad" given the physicochemical properties as features. 
 
-If we recall the main research question: 
+With an optimized, Random Forest model, we were able to achieve an accuracy of `r round(test_scores$test_score, 3)` on the test data set. This is slightly higher than the validation scores using the training data, which tells us that we may have got a bit lucky on the test data set. But overall, the model is doing a decent job at predicting the wine label of "good" or "bad" given the physicochemical properties as features.
 
->Can we predict if a wine is "good" (6 or higher out of 10) or "bad" (5 or lower out of 10) based on its physicochemical properties alone?
+If we recall the main research question:
+
+> Can we predict if a wine is "good" (6 or higher out of 10) or "bad" (5 or lower out of 10) based on its physicochemical properties alone?
 
 The results show that with about 85% accuracy, it is possible to predict whether a wine may be considered "good" (6/10 or higher) or "bad" (5/10 or lower).
 
 Some further work that may result in higher prediction accuracy could include feature selection optimization. For example, some of the features seem like they could be correlated (e.g. free sulphur dioxide, total sulphur dioxide, sulphates).
 
-The original data-set has an quantitative output metric: a rating between 0-10. This problem could be a candidate for a regression model. It would be interesting to compare the effectiveness and usefulness of this consideration and could be explored in a future iteration. 
+The original data-set has an quantitative output metric: a rating between 0-10. This problem could be a candidate for a regression model. It would be interesting to compare the effectiveness and usefulness of this consideration and could be explored in a future iteration.
 
-Another point of interest in problem is the subjectivity of wine quality. The current data set uses a median rating from multiple tastings from multiple wine experts as an estimation of quality. While we feel that this estimate is a good enough proxy, it is something to be aware of when using this model. 
+Another point of interest in problem is the subjectivity of wine quality. The current data set uses a median rating from multiple tastings from multiple wine experts as an estimation of quality. While we feel that this estimate is a good enough proxy, it is something to be aware of when using this model.
 
 ## References
diff --git a/doc/wine_quality_prediction_report.html b/doc/wine_quality_prediction_report.html
diff --git a/doc/wine_quality_prediction_report.md b/doc/wine_quality_prediction_report.md
@@ -18,15 +18,17 @@ Harris
 The data set was produced by P. Cortez, A. Cerdeira, F. Almeida, T.
 Matos and J. Reis. Modeling wine preferences by data mining from
 physicochemical properties. In Decision Support Systems, Elsevier,
-47(4):547-553, 2009. It was sourced from Dua, D. and Graff, C. (2019).
-UCI Machine Learning Repository \[<http://archive.ics.uci.edu/ml>\].
-Irvine, CA: University of California, School of Information and Computer
-Science.
+47(4):547-553, 2009. It was sourced from (Dua and Graff 2017) Dua, D.
+and Graff, C. (2019). UCI Machine Learning Repository
+\[<http://archive.ics.uci.edu/ml>\]. Irvine, CA: University of
+California, School of Information and Computer Science.
 
 All machine learning processing and analysis was done using Sci-kit
-Learn (Pedregosa et al. (2011)). Tables and figures in this report were
-created with the help of the Knitr (Xie (2014)) package. File paths were
-managed using the Here (Müller (2020)) package.
+Learn (Pedregosa et al. 2011), python (Van Rossum and Drake 2009), R(R
+Core Team 2020). Tables and figures in this report were created with the
+help of the Knitr (Xie 2014), kableExtra (Zhu 2020), docopt(de Jonge
+2018), pandas (team 2020) package. File paths were managed using the
+Here (Müller 2020) package.
 
 ## Data Summary
 
@@ -991,7 +993,7 @@ below.
 
 <div class="figure" style="text-align: center">
 
-<img src="C:/Users/vignesh/career/dsci-522-group14/results/eda_target.png" alt="Figure 1. Target Class Distribution" width="15%" />
+<img src="C:/Users/dengj/ubc_mds_21/ds_block3/tiff_522_ds_workflows/dsci-522-group14/results/eda_target.png" alt="Figure 1. Target Class Distribution" width="15%" />
 
 <p class="caption">
 
@@ -1010,9 +1012,9 @@ train and test sets. For this task, an 80/20 train/test split was used.
 As mentioned, there were eleven numeric features and one categorical
 feature. For this analysis, the eleven numeric features were scaled
 using a standard scalar, which involves removing the mean and scaling to
-unit variance (Pedregosa et al. (2011)). The categorical feature, wine
-type, only contained two values (red and white) so it was treated as a
-binary feature.
+unit variance (\[@sk-learn\]). The categorical feature, wine type, only
+contained two values (red and white) so it was treated as a binary
+feature.
 
 ## Modelling and Cross Validation Scores
 
@@ -1023,7 +1025,7 @@ consequence to a false negative or false positive classifications,
 therefore recall and precision are of low importance to us. Accuracy
 provides a simple, clear way to compare model performance.
 
-\[ \text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Number of examples}} \]
+\[ \\text{Accuracy} = \\frac{\\text{Number of correct predictions}}{\\text{Number of examples}} \]
 The following models were evaluated to predict the wine quality label:
 
   - Dummy classifier
@@ -1645,7 +1647,22 @@ aware of when using this model.
 
 ## References
 
-<div id="refs" class="references">
+<div id="refs" class="references hanging-indent">
+
+<div id="ref-docopt">
+
+de Jonge, Edwin. 2018. *Docopt: Command-Line Interface Specification
+Language*. <https://CRAN.R-project.org/package=docopt>.
+
+</div>
+
+<div id="ref-Dua:2019">
+
+Dua, Dheeru, and Casey Graff. 2017. “UCI Machine Learning Repository.”
+University of California, Irvine, School of Information; Computer
+Sciences. <http://archive.ics.uci.edu/ml>.
+
+</div>
 
 <div id="ref-here">
 
@@ -1662,6 +1679,28 @@ Python.” *Journal of Machine Learning Research* 12: 2825–30.
 
 </div>
 
+<div id="ref-R">
+
+R Core Team. 2020. *R: A Language and Environment for Statistical
+Computing*. Vienna, Austria: R Foundation for Statistical Computing.
+<https://www.R-project.org/>.
+
+</div>
+
+<div id="ref-reback2020pandas">
+
+team, The pandas development. 2020. *Pandas-Dev/Pandas: Pandas* (version
+1.1.1). Zenodo. <https://doi.org/10.5281/zenodo.3993412>.
+
+</div>
+
+<div id="ref-Python">
+
+Van Rossum, Guido, and Fred L. Drake. 2009. *Python 3 Reference Manual*.
+Scotts Valley, CA: CreateSpace.
+
+</div>
+
 <div id="ref-knitr">
 
 Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research
@@ -1671,4 +1710,11 @@ Hall/CRC. <http://www.crcpress.com/product/isbn/9781466561595>.
 
 </div>
 
+<div id="ref-kableExtra">
+
+Zhu, Hao. 2020. *KableExtra: Construct Complex Table with ’Kable’ and
+Pipe Syntax*. <https://CRAN.R-project.org/package=kableExtra>.
+
+</div>
+
 </div>
diff --git a/doc/wine_refs.bib b/doc/wine_refs.bib
@@ -16,7 +16,7 @@ @Article{sk-learn
      volume={12},
      pages={2825--2830},
      year={2011}
-}
+}}
 
 @InCollection{knitr,
     booktitle = {Implementing Reproducible Computational Research},
@@ -54,4 +54,33 @@ @misc{Dua:2019
     url = "http://archive.ics.uci.edu/ml",
     institution = "University of California, Irvine, School of Information and Computer Sciences" 
 }
- 
+ 
+@book{Python,
+  author = {Van Rossum, Guido and Drake, Fred L.},
+  title = {Python 3 Reference Manual},
+  year = {2009},
+  isbn = {1441412697},
+  publisher = {CreateSpace},
+  address = {Scotts Valley, CA}
+}
+
+
+@software{reback2020pandas,
+    author       = {The pandas development team},
+    title        = {pandas-dev/pandas: Pandas},
+    month        = {Aug},
+    year         = {2020},
+    publisher    = {Zenodo},
+    version      = {1.1.1},
+    doi          = {10.5281/zenodo.3993412},
+    url          = {https://doi.org/10.5281/zenodo.3993412}
+}
+
+@Manual{R,
+    title = {R: A Language and Environment for Statistical Computing},
+    author = {{R Core Team}},
+    organization = {R Foundation for Statistical Computing},
+    address = {Vienna, Austria},
+    year = {2020},
+    url = {https://www.R-project.org/},
+  }