diff --git a/reports/Accuracy_plot.png b/reports/accuracy_plot_revised.png
similarity index 100%
rename from reports/Accuracy_plot.png
rename to reports/accuracy_plot_revised.png
diff --git a/reports/cf_matrix.png b/reports/cf_matrix_revised.png
similarity index 100%
rename from reports/cf_matrix.png
rename to reports/cf_matrix_revised.png
diff --git a/reports/f1.png b/reports/f1_revised.png
similarity index 100%
rename from reports/f1.png
rename to reports/f1_revised.png
diff --git a/reports/models_c.png b/reports/models_c_revised.png
similarity index 100%
rename from reports/models_c.png
rename to reports/models_c_revised.png
diff --git a/reports/reports.Rmd b/reports/reports.Rmd
index f3bab57..d329fcb 100644
--- a/reports/reports.Rmd
+++ b/reports/reports.Rmd
@@ -46,7 +46,7 @@ In this project we are trying to predict the quality of a given wine sample usin
 We eventually decided to pick neutral network Multi-layer Perception (MLP) model as the model that yield the best results after running the various machine learning models through the train data set, comparing their performance based on f1-score and checking consistency across cross-validation runs. We noticed that random forest recorded high f1-validation score at 0.84, however, it also had a large gap between train and validation with a perfect train score of 1. This caused us to think the model has overfitted. Logistic regression also showed promising f1 validation score results in our case, yet this high results were not consistent across cross-validation splits. Hence, with most models struggled to get to the 0.8 f1-score mark without significantly overfitting on the train set, while MLP shows consistent results across all cross-validation splits, our final choice landed on MLP model because we think it would generalize better.
 
 ```{r, fig.cap = "Table 1: Score results among different machine learning model we have explore"}
-knitr::include_graphics("models_c.png")
+knitr::include_graphics("models_c_revised.png")
 ```
 
 The Python and R programming languages [@R; @Python] and the following Python and R packages were used to perform the analysis: scikit-learn [@scikit-learn], docoptpython [@docoptpython], docopt [@docopt], altair [@altair], vega-lite [@vega-lite], IPython-ipykernel [@IPython], matplotlib [@matplotlib], scipy [@SciPy], numpy [@harris2020array], pandas [@pandas], graphviz [@graphviz], pandas-profiling [@pandasprofiling2019], knitr [@knitr], tidyverse [@tidyverse], kableExtra [@kableExtra]. The code used to perform the analysis and re-create this report can be found [here](https://github.com/UBC-MDS/Wine_Quality_Predictor#usage)
@@ -64,11 +64,11 @@ knitr::include_graphics("../eda/wine_EDA_files/wine_quality_rank_per_feature.png
 Since this is a multi-class classification, our goal was to find a model that was consistent and able to recognize patterns from our data. We choose to use a neutral network Multi-layer Perception (MLP) model as it was consistent and showed promising results. If we take a look at the accuracy scores and f1 scores across cross validation splits, we can see that it is pretty consistent which was not the case with many models.
 
 ```{r}
-knitr::include_graphics("f1.png")
+knitr::include_graphics("f1_revised.png")
 ```
 
 ```{r}
-knitr::include_graphics("Accuracy_plot.png")
+knitr::include_graphics("accuracy_plot_revised.png")
 ```
 
 
@@ -77,7 +77,7 @@ Figure 2: Accuracy scores and f1 scores across cross validation splits for neutr
 Our model performed quite well on the test data as well. If we take a look at the confusion matrix below. As we discussed earlier, the prediction at the higher end of wine quality spectrum is acceptable. As we can see from the confusion matrix below, \~15% error rate for the higher end of spectrum and also very acceptable false classifications in the low end of spectrum.
 
 ```{r, fig.cap = "Figure 3: Confusion Matrix"}
-knitr::include_graphics("cf_matrix.png")
+knitr::include_graphics("cf_matrix_revised.png")
 ```
 
 Having said that the research also need further improvement in terms of obtaining a more balanced data set for training and cross-validation. More feature engineer and selection could be conducted to minimize the affect of correlation among the explanatory variable. Furthermore, in order to assess the robustness of the predicting model, we need to test the model with deployment data in real world besides testing with our test data.
diff --git a/reports/reports.md b/reports/reports.md
index 7e926ea..316d661 100644
--- a/reports/reports.md
+++ b/reports/reports.md
@@ -112,7 +112,7 @@ because we think it would generalize better.
 
 <div class="figure">
 
-<img src="models_c.png" alt="Table 1: Score results among different machine learning model we have explore" width="1207" />
+<img src="models_c_revised.png" alt="Table 1: Score results among different machine learning model we have explore" width="1207" />
 
 <p class="caption">
 
@@ -166,9 +166,9 @@ was consistent and showed promising results. If we take a look at the
 accuracy scores and f1 scores across cross validation splits, we can see
 that it is pretty consistent which was not the case with many models.
 
-<img src="f1.png" width="439" />
+<img src="f1_revised.png" width="439" />
 
-<img src="Accuracy_plot.png" width="439" />
+<img src="accuracy_plot_revised.png" width="439" />
 
 Figure 2: Accuracy scores and f1 scores across cross validation splits
 for neutral network Multi-layer Perception (MLP) model
@@ -182,7 +182,7 @@ the low end of spectrum.
 
 <div class="figure">
 
-<img src="cf_matrix.png" alt="Figure 3: Confusion Matrix" width="413" />
+<img src="cf_matrix_revised.png" alt="Figure 3: Confusion Matrix" width="413" />
 
 <p class="caption">