diff --git a/temp/model+test.html b/temp/model+test.html new file mode 100644 index 0000000..d8f0e41 --- /dev/null +++ b/temp/model+test.html @@ -0,0 +1,15899 @@ + + + + + +model+test + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ +
+
+
+

Summary of the data set

The data set used in this project is the results of a chemical analysis of the Portuguese "Vinho Verde" wine, conducted by Paulo Cortez, University of Minho, Guimarães, +Portugal A. Cerdeira, F. Almeida, T. +Matos and J. Reis, Viticulture Commission of the Vinho Verde Region(CVRVV), Porto, Portugal @2009. It was sourced from the UCI Machine +Learning Repository.

+

There are two datasets for red and white wine samples. For each wine sample observation , the inputs contains measurements of various objective physicochemical tests, and the output is the median wine quality ratings given by experts on the scale from 0 (very bad) and 10 (very excellent).The author notes that data on grape types, wine brand, wind selling price among other are not available due to privacy and logistics issues. There are 1599 observations for red wine and 4898 observations of white wine.

+ +
+
+
+
+

Data Import

+
+
+
+ +
+ +
+
+
+

After importing the downloaded data, the below tables show the summary statistics of all numeric features in the white wine data set.

+ +
+
+
+ +
+ +
+ + + + +
+ +
+
+
+

Similar table for red wine data set

+ +
+
+
+ +
+ +
+ + + + +
+ +
+
+
+

Base on the brief summary of the data above, there is no missing value, all the features have numeric values, hence there is no major preprocessing needed. We decide to combine the two data sets of red wine and white wine to consider wine type (i.e. red or wine) as another possible features that could link to wine quality. Below is the combined data set.

+ +
+
+
+ +
+ +
+ + + + +
+ +
+
+ +
+ +
+
+ +
+ +
+
+ +
+ +
+ + + + +
+ +
+
+ +
+ +
+ + + + +
+ +
+
+
+

PreProcessor

+
+
+
+ +
+ +
+
+ +
+ +
+
+
+ + +

Model Selection

+
+
+
+
+

Stacking Classifier

+
+
+
+ +
+ +
+
+ +
+ +
+
+
+

ALL other classifiers

+
+
+
+ +
+ +
+ + + + +
+ +
+
+
+

Plotting results

All classifiers

+
+
+
+ +
+ +
+ + + + +
+ +
+
+
+

Plotting stability across cv folds

+
+
+
+ +
+ +
+
+ +
+ +
+ + + + +
+ +
+
+ +
+ +
+
+ +
+ +
+ + + + +
+ +
+
+
+

Here, we can observe that Random forests are more consistent across CV folds

+
+
+
+
+

HyperParameter Optimization

+
+
+
+ +
+ +
+
+ +
+ +
+
+ +
+ +
+ + + + +
+ +
+
+
+

Interpreting Our model

+
+
+
+ +
+ +
+ + + + +
+ +
+
+ +
+ +
+ + + + +
+ +
+ + + + + + + + + diff --git a/temp/model+test.ipynb b/temp/model+test.ipynb new file mode 100644 index 0000000..29151a2 --- /dev/null +++ b/temp/model+test.ipynb @@ -0,0 +1,1602 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 185, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "%matplotlib inline\n", + "import string\n", + "from collections import deque\n", + "\n", + "import matplotlib.pyplot as plt\n", + "import numpy as np\n", + "import pandas as pd\n", + "import altair as alt\n", + "\n", + "# data\n", + "from sklearn import datasets\n", + "from sklearn.compose import ColumnTransformer, make_column_transformer\n", + "from sklearn.dummy import DummyClassifier, DummyRegressor\n", + "from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor\n", + "from sklearn.feature_extraction.text import CountVectorizer\n", + "\n", + "# Feature selection\n", + "from sklearn.feature_selection import RFE, RFECV\n", + "from sklearn.impute import SimpleImputer\n", + "\n", + "# classifiers / models\n", + "from sklearn.linear_model import RidgeClassifier\n", + "from sklearn.linear_model import LogisticRegression\n", + "\n", + "# other\n", + "from sklearn.metrics import accuracy_score, log_loss, make_scorer, mean_squared_error\n", + "from sklearn.model_selection import (\n", + " GridSearchCV,\n", + " RandomizedSearchCV,\n", + " ShuffleSplit,\n", + " cross_val_score,\n", + " cross_validate,\n", + " train_test_split,\n", + ")\n", + "from sklearn.pipeline import Pipeline, make_pipeline\n", + "from sklearn.preprocessing import (\n", + " OneHotEncoder,\n", + " OrdinalEncoder,\n", + " PolynomialFeatures,\n", + " StandardScaler,\n", + ")\n", + "from sklearn.neighbors import KNeighborsClassifier\n", + "from sklearn.neural_network import MLPClassifier\n", + "from sklearn.neighbors import NearestCentroid\n", + "from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis\n", + "\n", + "from sklearn.model_selection import cross_val_predict\n", + "from sklearn.metrics import plot_precision_recall_curve, plot_roc_curve\n", + "from sklearn.metrics import accuracy_score, log_loss, make_scorer, mean_squared_error, confusion_matrix\n", + "\n", + "\n", + "#New import statements to add\n", + "from sklearn.svm import SVC\n", + "from sklearn.ensemble import VotingClassifier\n", + "from catboost import CatBoostClassifier\n", + "from sklearn.metrics import (plot_confusion_matrix)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Summary of the data set\n", + "The data set used in this project is the results of a chemical analysis of the Portuguese \"Vinho Verde\" wine, conducted by [Paulo Cortez, University of Minho, Guimarães,\n", + "Portugal](http://www3.dsi.uminho.pt/pcortez) A. Cerdeira, F. Almeida, T.\n", + "Matos and J. Reis, Viticulture Commission of the Vinho Verde Region(CVRVV), Porto, Portugal @2009. It was sourced from the [UCI Machine\n", + "Learning Repository](https://archive.ics.uci.edu/ml/datasets/wine+quality).\n", + "\n", + "There are two datasets for red and white wine samples. For each wine sample observation , the inputs contains measurements of various objective physicochemical tests, and the output is the median wine quality ratings given by experts on the scale from 0 (very bad) and 10 (very excellent).The author notes that data on grape types, wine brand, wind selling price among other are not available due to privacy and logistics issues. There are 1599 observations for red wine and 4898 observations of white wine." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Data Import" + ] + }, + { + "cell_type": "code", + "execution_count": 71, + "metadata": { + "jupyter": { + "source_hidden": true + } + }, + "outputs": [], + "source": [ + "red_wine = pd.read_csv('data/raw/winequality-red.csv')\n", + "white_wine = pd.read_csv('data/raw/winequality-white.csv')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "jupyter": { + "source_hidden": true + } + }, + "source": [ + "After importing the downloaded data, the below tables show the summary statistics of all numeric features in the white wine data set. " + ] + }, + { + "cell_type": "code", + "execution_count": 72, + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true, + "source_hidden": true + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholquality
count4898.0000004898.0000004898.0000004898.0000004898.0000004898.0000004898.0000004898.0000004898.0000004898.0000004898.0000004898.000000
mean6.8547880.2782410.3341926.3914150.04577235.308085138.3606570.9940273.1882670.48984710.5142675.877909
std0.8438680.1007950.1210205.0720580.02184817.00713742.4980650.0029910.1510010.1141261.2306210.885639
min3.8000000.0800000.0000000.6000000.0090002.0000009.0000000.9871102.7200000.2200008.0000003.000000
25%6.3000000.2100000.2700001.7000000.03600023.000000108.0000000.9917233.0900000.4100009.5000005.000000
50%6.8000000.2600000.3200005.2000000.04300034.000000134.0000000.9937403.1800000.47000010.4000006.000000
75%7.3000000.3200000.3900009.9000000.05000046.000000167.0000000.9961003.2800000.55000011.4000006.000000
max14.2000001.1000001.66000065.8000000.346000289.000000440.0000001.0389803.8200001.08000014.2000009.000000
\n", + "
" + ], + "text/plain": [ + " fixed acidity volatile acidity citric acid residual sugar \\\n", + "count 4898.000000 4898.000000 4898.000000 4898.000000 \n", + "mean 6.854788 0.278241 0.334192 6.391415 \n", + "std 0.843868 0.100795 0.121020 5.072058 \n", + "min 3.800000 0.080000 0.000000 0.600000 \n", + "25% 6.300000 0.210000 0.270000 1.700000 \n", + "50% 6.800000 0.260000 0.320000 5.200000 \n", + "75% 7.300000 0.320000 0.390000 9.900000 \n", + "max 14.200000 1.100000 1.660000 65.800000 \n", + "\n", + " chlorides free sulfur dioxide total sulfur dioxide density \\\n", + "count 4898.000000 4898.000000 4898.000000 4898.000000 \n", + "mean 0.045772 35.308085 138.360657 0.994027 \n", + "std 0.021848 17.007137 42.498065 0.002991 \n", + "min 0.009000 2.000000 9.000000 0.987110 \n", + "25% 0.036000 23.000000 108.000000 0.991723 \n", + "50% 0.043000 34.000000 134.000000 0.993740 \n", + "75% 0.050000 46.000000 167.000000 0.996100 \n", + "max 0.346000 289.000000 440.000000 1.038980 \n", + "\n", + " pH sulphates alcohol quality \n", + "count 4898.000000 4898.000000 4898.000000 4898.000000 \n", + "mean 3.188267 0.489847 10.514267 5.877909 \n", + "std 0.151001 0.114126 1.230621 0.885639 \n", + "min 2.720000 0.220000 8.000000 3.000000 \n", + "25% 3.090000 0.410000 9.500000 5.000000 \n", + "50% 3.180000 0.470000 10.400000 6.000000 \n", + "75% 3.280000 0.550000 11.400000 6.000000 \n", + "max 3.820000 1.080000 14.200000 9.000000 " + ] + }, + "execution_count": 72, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "white_wine.describe()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "jupyter": { + "source_hidden": true + } + }, + "source": [ + "Similar table for red wine data set" + ] + }, + { + "cell_type": "code", + "execution_count": 73, + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true, + "source_hidden": true + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholquality
count1599.0000001599.0000001599.0000001599.0000001599.0000001599.0000001599.0000001599.0000001599.0000001599.0000001599.0000001599.000000
mean8.3196370.5278210.2709762.5388060.08746715.87492246.4677920.9967473.3111130.65814910.4229835.636023
std1.7410960.1790600.1948011.4099280.04706510.46015732.8953240.0018870.1543860.1695071.0656680.807569
min4.6000000.1200000.0000000.9000000.0120001.0000006.0000000.9900702.7400000.3300008.4000003.000000
25%7.1000000.3900000.0900001.9000000.0700007.00000022.0000000.9956003.2100000.5500009.5000005.000000
50%7.9000000.5200000.2600002.2000000.07900014.00000038.0000000.9967503.3100000.62000010.2000006.000000
75%9.2000000.6400000.4200002.6000000.09000021.00000062.0000000.9978353.4000000.73000011.1000006.000000
max15.9000001.5800001.00000015.5000000.61100072.000000289.0000001.0036904.0100002.00000014.9000008.000000
\n", + "
" + ], + "text/plain": [ + " fixed acidity volatile acidity citric acid residual sugar \\\n", + "count 1599.000000 1599.000000 1599.000000 1599.000000 \n", + "mean 8.319637 0.527821 0.270976 2.538806 \n", + "std 1.741096 0.179060 0.194801 1.409928 \n", + "min 4.600000 0.120000 0.000000 0.900000 \n", + "25% 7.100000 0.390000 0.090000 1.900000 \n", + "50% 7.900000 0.520000 0.260000 2.200000 \n", + "75% 9.200000 0.640000 0.420000 2.600000 \n", + "max 15.900000 1.580000 1.000000 15.500000 \n", + "\n", + " chlorides free sulfur dioxide total sulfur dioxide density \\\n", + "count 1599.000000 1599.000000 1599.000000 1599.000000 \n", + "mean 0.087467 15.874922 46.467792 0.996747 \n", + "std 0.047065 10.460157 32.895324 0.001887 \n", + "min 0.012000 1.000000 6.000000 0.990070 \n", + "25% 0.070000 7.000000 22.000000 0.995600 \n", + "50% 0.079000 14.000000 38.000000 0.996750 \n", + "75% 0.090000 21.000000 62.000000 0.997835 \n", + "max 0.611000 72.000000 289.000000 1.003690 \n", + "\n", + " pH sulphates alcohol quality \n", + "count 1599.000000 1599.000000 1599.000000 1599.000000 \n", + "mean 3.311113 0.658149 10.422983 5.636023 \n", + "std 0.154386 0.169507 1.065668 0.807569 \n", + "min 2.740000 0.330000 8.400000 3.000000 \n", + "25% 3.210000 0.550000 9.500000 5.000000 \n", + "50% 3.310000 0.620000 10.200000 6.000000 \n", + "75% 3.400000 0.730000 11.100000 6.000000 \n", + "max 4.010000 2.000000 14.900000 8.000000 " + ] + }, + "execution_count": 73, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "red_wine.describe()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "jupyter": { + "source_hidden": true + } + }, + "source": [ + "Base on the brief summary of the data above, there is no missing value, all the features have numeric values, hence there is no major preprocessing needed. We decide to combine the two data sets of red wine and white wine to consider wine type (i.e. red or wine) as another possible features that could link to wine quality. Below is the combined data set." + ] + }, + { + "cell_type": "code", + "execution_count": 74, + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true, + "source_hidden": true + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholqualitytype
07.00.2700.3620.70.04545.0170.01.001003.000.458.86white
16.30.3000.341.60.04914.0132.00.994003.300.499.56white
28.10.2800.406.90.05030.097.00.995103.260.4410.16white
37.20.2300.328.50.05847.0186.00.995603.190.409.96white
47.20.2300.328.50.05847.0186.00.995603.190.409.96white
..........................................
15946.20.6000.082.00.09032.044.00.994903.450.5810.55red
15955.90.5500.102.20.06239.051.00.995123.520.7611.26red
15966.30.5100.132.30.07629.040.00.995743.420.7511.06red
15975.90.6450.122.00.07532.044.00.995473.570.7110.25red
15986.00.3100.473.60.06718.042.00.995493.390.6611.06red
\n", + "

6497 rows × 13 columns

\n", + "
" + ], + "text/plain": [ + " fixed acidity volatile acidity citric acid residual sugar chlorides \\\n", + "0 7.0 0.270 0.36 20.7 0.045 \n", + "1 6.3 0.300 0.34 1.6 0.049 \n", + "2 8.1 0.280 0.40 6.9 0.050 \n", + "3 7.2 0.230 0.32 8.5 0.058 \n", + "4 7.2 0.230 0.32 8.5 0.058 \n", + "... ... ... ... ... ... \n", + "1594 6.2 0.600 0.08 2.0 0.090 \n", + "1595 5.9 0.550 0.10 2.2 0.062 \n", + "1596 6.3 0.510 0.13 2.3 0.076 \n", + "1597 5.9 0.645 0.12 2.0 0.075 \n", + "1598 6.0 0.310 0.47 3.6 0.067 \n", + "\n", + " free sulfur dioxide total sulfur dioxide density pH sulphates \\\n", + "0 45.0 170.0 1.00100 3.00 0.45 \n", + "1 14.0 132.0 0.99400 3.30 0.49 \n", + "2 30.0 97.0 0.99510 3.26 0.44 \n", + "3 47.0 186.0 0.99560 3.19 0.40 \n", + "4 47.0 186.0 0.99560 3.19 0.40 \n", + "... ... ... ... ... ... \n", + "1594 32.0 44.0 0.99490 3.45 0.58 \n", + "1595 39.0 51.0 0.99512 3.52 0.76 \n", + "1596 29.0 40.0 0.99574 3.42 0.75 \n", + "1597 32.0 44.0 0.99547 3.57 0.71 \n", + "1598 18.0 42.0 0.99549 3.39 0.66 \n", + "\n", + " alcohol quality type \n", + "0 8.8 6 white \n", + "1 9.5 6 white \n", + "2 10.1 6 white \n", + "3 9.9 6 white \n", + "4 9.9 6 white \n", + "... ... ... ... \n", + "1594 10.5 5 red \n", + "1595 11.2 6 red \n", + "1596 11.0 6 red \n", + "1597 10.2 5 red \n", + "1598 11.0 6 red \n", + "\n", + "[6497 rows x 13 columns]" + ] + }, + "execution_count": 74, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "white_wine['type'] = 'white'\n", + "red_wine['type'] = 'red'\n", + "wine_df = pd.concat([white_wine, red_wine], axis = 0)\n", + "wine_df" + ] + }, + { + "cell_type": "code", + "execution_count": 75, + "metadata": { + "jupyter": { + "source_hidden": true + } + }, + "outputs": [], + "source": [ + "bins = (1, 4, 6, 9)\n", + "rating_groups = ['poor','normal','excellent']\n", + "wine_df['quality'] = pd.cut(wine_df['quality'], bins = bins, labels = rating_groups)" + ] + }, + { + "cell_type": "code", + "execution_count": 76, + "metadata": { + "jupyter": { + "source_hidden": true + } + }, + "outputs": [], + "source": [ + "train_df, test_df = train_test_split(wine_df,test_size = 0.2 , random_state = 123)\n", + "\n", + "#train_df['type'] = train_df['type'].astype('category')\n", + "#test_df['type'] = test_df['type'].astype('category')\n", + "X_train = train_df.drop(columns = ['quality'], axis=1)\n", + "y_train = train_df['quality']\n", + "\n", + "X_test = test_df.drop(columns = ['quality'], axis=1)\n", + "y_test = test_df['quality']\n" + ] + }, + { + "cell_type": "code", + "execution_count": 77, + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true, + "source_hidden": true + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "normal 3967\n", + "excellent 1028\n", + "poor 202\n", + "Name: quality, dtype: int64" + ] + }, + "execution_count": 77, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "y_train.value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": 78, + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true, + "source_hidden": true + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Int64Index: 5197 entries, 1554 to 3582\n", + "Data columns (total 13 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 fixed acidity 5197 non-null float64 \n", + " 1 volatile acidity 5197 non-null float64 \n", + " 2 citric acid 5197 non-null float64 \n", + " 3 residual sugar 5197 non-null float64 \n", + " 4 chlorides 5197 non-null float64 \n", + " 5 free sulfur dioxide 5197 non-null float64 \n", + " 6 total sulfur dioxide 5197 non-null float64 \n", + " 7 density 5197 non-null float64 \n", + " 8 pH 5197 non-null float64 \n", + " 9 sulphates 5197 non-null float64 \n", + " 10 alcohol 5197 non-null float64 \n", + " 11 quality 5197 non-null category\n", + " 12 type 5197 non-null object \n", + "dtypes: category(1), float64(11), object(1)\n", + "memory usage: 533.0+ KB\n" + ] + } + ], + "source": [ + "train_df.info()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# PreProcessor" + ] + }, + { + "cell_type": "code", + "execution_count": 207, + "metadata": {}, + "outputs": [], + "source": [ + "numeric_features = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', \n", + " 'chlorides', 'free sulfur dioxide','total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']\n", + "binary_features = ['type']\n", + "\n", + "numeric_transformer = make_pipeline(SimpleImputer(), StandardScaler())\n", + "binary_transformer = make_pipeline(OneHotEncoder(drop=\"if_binary\", dtype=int))\n", + "\n", + "preprocessor = ColumnTransformer(\n", + " transformers = [\n", + " ('num', numeric_transformer, numeric_features),\n", + " ('bin', binary_transformer, binary_features)\n", + " ]\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 211, + "metadata": {}, + "outputs": [], + "source": [ + "#DataStructure to store results\n", + "results={}\n", + "# helper function from lectures 573, UBC MDS\n", + "def mean_std_cross_val_scores(model, X_train, y_train, **kwargs):\n", + " \"\"\"\n", + " Returns mean and std of cross validation\n", + " \"\"\"\n", + " scores = cross_validate(model, \n", + " X_train, y_train, n_jobs=-1, \n", + " **kwargs) \n", + " \n", + " mean_scores = pd.DataFrame(scores).mean()\n", + " #std_scores = pd.DataFrame(scores).std()\n", + " out_col = []\n", + "\n", + " for i in range(len(mean_scores)): \n", + " out_col.append(mean_scores[i])\n", + "\n", + " return pd.Series(data = out_col, index = mean_scores.index)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + " \n", + "# Model Selection\n", + " \n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Stacking Classifier" + ] + }, + { + "cell_type": "code", + "execution_count": 212, + "metadata": {}, + "outputs": [], + "source": [ + "#Stacking Classifier\n", + "scoring_metric = {'f1_micro'}\n", + "pipe_rf = make_pipeline(preprocessor, RandomForestClassifier(bootstrap=False, max_depth=20,\n", + " max_features='sqrt', n_estimators=1800,\n", + " random_state=123))\n", + "pipe_catboost = make_pipeline(preprocessor, CatBoostClassifier(verbose=0, random_state=123))\n", + "pipe_svc = make_pipeline(preprocessor, SVC(probability=True))\n", + "classifiers = {\n", + " \"svm\": pipe_svc,\n", + " 'random forest' : pipe_rf,\n", + " 'CatBoost' : pipe_catboost,\n", + "}" + ] + }, + { + "cell_type": "code", + "execution_count": 213, + "metadata": {}, + "outputs": [], + "source": [ + "#Testing stacking model\n", + "stacking_model = StackingClassifier(list(classifiers.items()), \n", + " final_estimator=RandomForestClassifier())\n", + "\n", + "results['StackingClf'] = mean_std_cross_val_scores(stacking_model, X_train, y_train, return_train_score=True, scoring=scoring_metric)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### ALL other classifiers" + ] + }, + { + "cell_type": "code", + "execution_count": 214, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
StackingClfRidgeClassifierRandom ForestKNNMLP ClassifierNearest CentroidQDA
fit_time172.5207580.03120318.6037350.02517814.3872230.0145310.020165
score_time0.7251420.0096270.5309300.1113330.0125450.0124880.011142
test_f1_micro0.8343280.7791050.8431770.7977690.8079610.5020230.715223
train_f1_micro0.9989900.7823271.0000000.8531850.9873490.5040400.723205
\n", + "
" + ], + "text/plain": [ + " StackingClf RidgeClassifier Random Forest KNN \\\n", + "fit_time 172.520758 0.031203 18.603735 0.025178 \n", + "score_time 0.725142 0.009627 0.530930 0.111333 \n", + "test_f1_micro 0.834328 0.779105 0.843177 0.797769 \n", + "train_f1_micro 0.998990 0.782327 1.000000 0.853185 \n", + "\n", + " MLP Classifier Nearest Centroid QDA \n", + "fit_time 14.387223 0.014531 0.020165 \n", + "score_time 0.012545 0.012488 0.011142 \n", + "test_f1_micro 0.807961 0.502023 0.715223 \n", + "train_f1_micro 0.987349 0.504040 0.723205 " + ] + }, + "execution_count": 214, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#Model selection plot \n", + "\n", + "classifiers_plot = {\n", + " \"RidgeClassifier\": RidgeClassifier(random_state=123),\n", + " \"Random Forest\":RandomForestClassifier(bootstrap=False, max_depth=20,\n", + " max_features='sqrt', n_estimators=1800,\n", + " random_state=123),\n", + " \"KNN\": KNeighborsClassifier(n_neighbors=5),\n", + " \"MLP Classifier\":MLPClassifier(alpha=0.05, hidden_layer_sizes=(50, 100, 50),\n", + " learning_rate='adaptive', max_iter=1000,random_state=123),\n", + " \"Nearest Centroid\": NearestCentroid(),\n", + " \"QDA\" :QuadraticDiscriminantAnalysis()\n", + "}\n", + "for (name, model) in classifiers_plot.items():\n", + " pipe_iter = make_pipeline(preprocessor, model)\n", + " results[name] = mean_std_cross_val_scores(pipe_iter, X_train, y_train, return_train_score=True, scoring=scoring_metric)\n", + "pd.DataFrame(results)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Plotting results\n", + "### All classifiers" + ] + }, + { + "cell_type": "code", + "execution_count": 220, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + "" + ], + "text/plain": [ + "alt.Chart(...)" + ] + }, + "execution_count": 220, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#alt.Chart(results).encode\n", + "plot_results = pd.DataFrame(results).T\n", + "plot_results =plot_results.reset_index()\n", + "bar = alt.Chart(plot_results).mark_bar().encode(\n", + " alt.X('test_f1_micro', axis=alt.Axis(title='F1 Micro score')),\n", + " alt.Y('index', sort='-x', axis=alt.Axis(title='Classifier')),\n", + ").properties(\n", + " width=alt.Step(40) # controls width of bar.\n", + ")\n", + "bar" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Plotting stability across cv folds" + ] + }, + { + "cell_type": "code", + "execution_count": 256, + "metadata": {}, + "outputs": [], + "source": [ + "scores_rf = cross_validate(pipe_rf, X_train, y_train, \n", + " return_train_score=True,scoring = scoring_metric, n_jobs=-1, cv=20 )" + ] + }, + { + "cell_type": "code", + "execution_count": 265, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + "" + ], + "text/plain": [ + "alt.Chart(...)" + ] + }, + "execution_count": 265, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "plot_rf = pd.DataFrame(scores_rf)\n", + "bar = alt.Chart(plot_rf).mark_bar().encode(\n", + " x= alt.X('test_f1_micro', axis=alt.Axis(title='F1 Micro score'), bin=alt.Bin(maxbins=6)),\n", + " y= alt.Y('count()'),\n", + ")\n", + "bar" + ] + }, + { + "cell_type": "code", + "execution_count": 267, + "metadata": {}, + "outputs": [], + "source": [ + "#pipe_mlp = make_pipeline(preprocessor, model)\n", + "scores_sml = cross_validate(stacking_model, X_train, y_train, \n", + " return_train_score=True,scoring = scoring_metric, n_jobs=-1, cv=20 )" + ] + }, + { + "cell_type": "code", + "execution_count": 270, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + "" + ], + "text/plain": [ + "alt.Chart(...)" + ] + }, + "execution_count": 270, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "plot_sml = pd.DataFrame(scores_sml)\n", + "alt.Chart(plot_sml).mark_bar().encode(\n", + " x= alt.X('test_f1_micro', axis=alt.Axis(title='F1 Micro score'), bin=alt.Bin(maxbins=6)),\n", + " y= alt.Y('count()'),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Here, we can observe that Random forests are more consistent across CV folds" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# HyperParameter Optimization" + ] + }, + { + "cell_type": "code", + "execution_count": 160, + "metadata": {}, + "outputs": [], + "source": [ + "param_dist = {\n", + " 'randomforestclassifier__max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100,300,500,450,200,250, 600, 700],\n", + " 'randomforestclassifier__bootstrap': [True, False],\n", + " 'randomforestclassifier__max_features': ['auto', 'sqrt'],\n", + " 'randomforestclassifier__min_samples_leaf': [1, 2, 4],\n", + " 'randomforestclassifier__min_samples_split': [2, 5, 10],\n", + " 'randomforestclassifier__n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]\n", + "}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "random_search = RandomizedSearchCV(pipe_rf, param_distributions=param_dist, n_jobs=-1, n_iter=20, cv=10, scoring = scoring_metric)\n", + "random_search.fit(X_train, y_train)" + ] + }, + { + "cell_type": "code", + "execution_count": 271, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Best cv score from grid search: 0.829\n" + ] + }, + { + "data": { + "text/plain": [ + "{'randomforestclassifier__n_estimators': 1800,\n", + " 'randomforestclassifier__min_samples_split': 2,\n", + " 'randomforestclassifier__min_samples_leaf': 1,\n", + " 'randomforestclassifier__max_features': 'sqrt',\n", + " 'randomforestclassifier__max_depth': 20,\n", + " 'randomforestclassifier__bootstrap': False}" + ] + }, + "execution_count": 271, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "print(\"Best cv score from grid search: %.3f\" % random_search.best_score_)\n", + "random_search.best_params_" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Interpreting Our model" + ] + }, + { + "cell_type": "code", + "execution_count": 273, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.8523076923076923" + ] + }, + "execution_count": 273, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "best_model_pipe = random_search.best_estimator_\n", + "best_model_pipe.fit(X_train, y_train)\n", + "best_model_pipe.score(X_test, y_test)" + ] + }, + { + "cell_type": "code", + "execution_count": 275, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "from sklearn.metrics import (plot_confusion_matrix)\n", + "plot_confusion_matrix(best_model_pipe, X_test, y_test, cmap = plt.cm.Blues, normalize='true')\n", + "predictions_m = best_model_pipe.predict(X_test)\n", + "cm = confusion_matrix(y_test, predictions_m)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python [conda env:573]", + "language": "python", + "name": "conda-env-573-py" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}