From feb3a44c23bf4d8493525bc7067b5b75df9c327f Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 31 Dec 2022 14:20:28 -0800 Subject: [PATCH 01/45] starting work on ch5+6; categorical type change; remove commented out R code --- source/classification1.md | 67 ++++++++++----------------------------- 1 file changed, 17 insertions(+), 50 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index db09a47a..fe974ac1 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -83,7 +83,7 @@ By the end of the chapter, readers will be able to do the following: ```{index} see: feature ; predictor ``` -In many situations, we want to make `predictions` based on the current situation +In many situations, we want to make predictions based on the current situation as well as past experiences. For instance, a doctor may want to diagnose a patient as either diseased or healthy based on their symptoms and the doctor's past experience with patients; an email provider might want to tag a given @@ -206,65 +206,32 @@ total set of variables per image in this data set is: ```{index} info ``` -Below we use `.info()` to preview the data frame. This method can -make it easier to inspect the data when we have a lot of columns, -as it prints the data such that the columns go down -the page (instead of across). +Below we use the `info` method to preview the data frame. This method can +make it easier to inspect the data when we have a lot of columns: +it prints only the column names down the page (instead of across), +as well as their data types and the number of non-missing entries. ```{code-cell} ipython3 cancer.info() ``` -From the summary of the data above, we can see that `Class` is of type object. - -+++ - -Given that we only have two different values in our `Class` column (B for benign and M -for malignant), we only expect to get two names back. +From the summary of the data above, we can see that `Class` is of type `object`. +We can use the `unique` method on the `Class` column to see all unique values +present in that column. We see that there are two diagnoses: +benign, represented by `'B'`, and malignant, represented by `'M'`. ```{code-cell} ipython3 cancer['Class'].unique() ``` -```{code-cell} ipython3 -:tags: [remove-cell] +Since we will be working with `Class` as a categorical statistical variable, +it is a good idea to convert it to the `category` type using the `astype` method +on the `cancer` data frame. We will verify the result using the `info` method +again. -## The above was based on the following text and code in R textbook. ## -####################################################################### -# Below we use `glimpse` \index{glimpse} to preview the data frame. This function can -# make it easier to inspect the data when we have a lot of columns, -# as it prints the data such that the columns go down -# the page (instead of across). - -# ```{r 05-glimpse} -# glimpse(cancer) -# ``` - -# From the summary of the data above, we can see that `Class` is of type character -# (denoted by ``). Since we will be working with `Class` as a -# categorical statistical variable, we will convert it to a factor using the -# function `as_factor`. \index{factor!as\_factor} - -# ```{r 05-class} -# cancer <- cancer |> -# mutate(Class = as_factor(Class)) -# glimpse(cancer) -# ``` - -# Recall that factors have what are called "levels", which you can think of as categories. We -# can verify the levels of the `Class` column by using the `levels` \index{levels}\index{factor!levels} function. -# This function should return the name of each category in that column. Given -# that we only have two different values in our `Class` column (B for benign and M -# for malignant), we only expect to get two names back. Note that the `levels` function requires a *vector* argument; -# so we use the `pull` function to extract a single column (`Class`) and -# pass that into the `levels` function to see the categories -# in the `Class` column. - -# ```{r 05-levels} -# cancer |> -# pull(Class) |> -# levels() -# ``` +```{code-cell} ipython3 +cancer['Class'] = cancer['Class'].astype('category') +cancer.info() ``` ### Exploring the cancer data @@ -273,7 +240,7 @@ cancer['Class'].unique() ``` Before we start doing any modeling, let's explore our data set. Below we use -the `.groupby()`, `.count()` methods to find the number and percentage +the `groupby` and `count` methods to find the number and percentage of benign and malignant tumor observations in our data set. When paired with `.groupby()`, `.count()` counts the number of observations in each `Class` group. Then we calculate the percentage in each group by dividing by the total number of observations. We have 357 (63\%) benign and 212 (37\%) malignant tumor observations. From a507994626778752ce61d8906858aa95ceb52165 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 31 Dec 2022 15:21:22 -0800 Subject: [PATCH 02/45] value counts, class name remap, replace in ch5 --- source/classification1.md | 119 ++++++++++++++++++++------------------ 1 file changed, 64 insertions(+), 55 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index fe974ac1..c8818220 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -15,38 +15,6 @@ kernelspec: (classification)= # Classification I: training & predicting -```{code-cell} ipython3 -:tags: [remove-cell] - -import random - -import altair as alt -from altair_saver import save -import numpy as np -import pandas as pd -import sklearn -from sklearn.compose import make_column_transformer -from sklearn.neighbors import KNeighborsClassifier -from sklearn.pipeline import Pipeline, make_pipeline -from sklearn.metrics.pairwise import euclidean_distances -from sklearn.preprocessing import StandardScaler -import plotly.express as px -import plotly.graph_objs as go -from plotly.offline import iplot, plot -from IPython.display import HTML -from myst_nb import glue - -alt.data_transformers.disable_max_rows() - -# alt.renderers.enable('altair_saver', fmts=['vega-lite', 'png']) - -# # Handle large data sets by not embedding them in the notebook -# alt.data_transformers.enable('data_server') - -# # Save a PNG blob as a backup for when the Altair plots do not render -# alt.renderers.enable('mimetype') -``` - ## Overview In previous chapters, we focused solely on descriptive and exploratory data analysis questions. @@ -155,10 +123,11 @@ guide patient treatment. Our first step is to load, wrangle, and explore the data using visualizations in order to better understand the data we are working with. We start by -loading the `pandas` package needed for our analysis. +loading the `pandas` and `altair` packages needed for our analysis. ```{code-cell} ipython3 import pandas as pd +import altair as alt ``` In this case, the file containing the breast cancer data set is a `.csv` @@ -215,6 +184,9 @@ as well as their data types and the number of non-missing entries. cancer.info() ``` +```{index} unique +``` + From the summary of the data above, we can see that `Class` is of type `object`. We can use the `unique` method on the `Class` column to see all unique values present in that column. We see that there are two diagnoses: @@ -224,59 +196,95 @@ benign, represented by `'B'`, and malignant, represented by `'M'`. cancer['Class'].unique() ``` -Since we will be working with `Class` as a categorical statistical variable, +We will also improve the readability of our analysis +by renaming `'M'` to `'Malignant'` and `'B'` to `'Benign'` using the `replace` +method. The `replace` method takes one argument: a dictionary that maps +previous values to desired new values. +Furthermore, since we will be working with `Class` as a categorical statistical variable, it is a good idea to convert it to the `category` type using the `astype` method -on the `cancer` data frame. We will verify the result using the `info` method -again. +on the `cancer` data frame. We will verify the result using the `info` +and `unique` methods again. + +```{index} replace +``` ```{code-cell} ipython3 +cancer['Class'] = cancer['Class'].replace({ + 'M' : 'Malignant', + 'B' : 'Benign' + }) cancer['Class'] = cancer['Class'].astype('category') cancer.info() ``` +```{code-cell} ipython3 +cancer['Class'].unique() +``` + ### Exploring the cancer data ```{index} groupby, count ``` +```{code-cell} ipython3 +:tags: [remove-cell] +from myst_nb import glue +import numpy as np +glue("benign_count", cancer['Class'].value_counts()['Benign']) +glue("benign_pct", int(np.round(100*cancer['Class'].value_counts(normalize=True)['Benign']))) +glue("malignant_count", cancer['Class'].value_counts()['Malignant']) +glue("malignant_pct", int(np.round(100*cancer['Class'].value_counts(normalize=True)['Malignant']))) +``` + Before we start doing any modeling, let's explore our data set. Below we use the `groupby` and `count` methods to find the number and percentage -of benign and malignant tumor observations in our data set. When paired with `.groupby()`, `.count()` counts the number of observations in each `Class` group. -Then we calculate the percentage in each group by dividing by the total number of observations. We have 357 (63\%) benign and 212 (37\%) malignant tumor observations. +of benign and malignant tumor observations in our data set. When paired with +`groupby`, `count` counts the number of observations for each value of the `Class` +variable. Then we calculate the percentage in each group by dividing by the total +number of observations and multiplying by 100. We have +{glue:}`benign_count` ({glue:}`benign_pct`\%) benign and +{glue:}`malignant_count` ({glue:}`malignant_pct`\%) malignant +tumor observations. ```{code-cell} ipython3 -num_obs = len(cancer) explore_cancer = pd.DataFrame() explore_cancer['count'] = cancer.groupby('Class')['ID'].count() -explore_cancer['percentage'] = explore_cancer['count'] / num_obs * 100 +explore_cancer['percentage'] = 100 * explore_cancer['count']/len(cancer) explore_cancer ``` -```{index} visualization; scatter +```{index} value_counts ``` -Next, let's draw a scatter plot to visualize the relationship between the -perimeter and concavity variables. Rather than use `altair's` default palette, -we select our own colorblind-friendly colors—`"#efb13f"` -for light orange and `"#86bfef"` for light blue—and - pass them as the `scale` argument in the `color` argument. -We also make the category labels ("B" and "M") more readable by -changing them to "Benign" and "Malignant" using `.apply()` method on the dataframe. +The `pandas` package also has a more convenient specialized `value_counts` method for +counting the number of occurrences of each value in a column. If we pass no arguments +to the method, it outputs a series containing the number of occurences +of each value. If we instead pass the argument `normalize=True`, it instead prints the fraction +of occurrences of each value. ```{code-cell} ipython3 -:tags: [] +cancer['Class'].value_counts() +``` -colors = ["#86bfef", "#efb13f"] -cancer["Class"] = cancer["Class"].apply( - lambda x: "Malignant" if (x == "M") else "Benign" -) +```{code-cell} ipython3 +cancer['Class'].value_counts(normalize=True) +``` + +```{index} visualization; scatter +``` + +Next, let's draw a colored scatter plot to visualize the relationship between the +perimeter and concavity variables. Recall that `altair's` default palette +is colorblind-friendly, so we can stick with that here. + +```{code-cell} ipython3 perim_concav = ( alt.Chart(cancer) .mark_point(opacity=0.6, filled=True, size=40) .encode( x=alt.X("Perimeter", title="Perimeter (standardized)"), y=alt.Y("Concavity", title="Concavity (standardized)"), - color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"), + color=alt.Color("Class", title="Diagnosis"), ) ) perim_concav @@ -305,7 +313,8 @@ you classify that new observation? If the standardized concavity and perimeter values are 1 and 1 respectively, the point would lie in the middle of the orange cloud of malignant points and thus we could probably classify it as malignant. Based on our visualization, it seems like -the *prediction of an unobserved label* might be possible. +it may be possible to make accurate predictions of the `Class` variable (i.e., a diagnosis) for +tumor images with unknown diagnoses. +++ From fd9a88212cffc27f4b6a36089a47dc872105f9ad Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 31 Dec 2022 15:28:42 -0800 Subject: [PATCH 03/45] remove warnings --- source/classification1.md | 8 ++++++++ source/classification2.md | 8 ++++++++ 2 files changed, 16 insertions(+) diff --git a/source/classification1.md b/source/classification1.md index c8818220..c64de132 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -12,6 +12,14 @@ kernelspec: name: python3 --- +```{code-cell} ipython3 +:tags: [remove-cell] +import warnings +def warn(*args, **kwargs): + pass +warnings.warn = warn +``` + (classification)= # Classification I: training & predicting diff --git a/source/classification2.md b/source/classification2.md index 8daf6c38..8087cfb4 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -15,6 +15,14 @@ kernelspec: (classification2)= # Classification II: evaluation & tuning +```{code-cell} ipython3 +:tags: [remove-cell] +import warnings +def warn(*args, **kwargs): + pass +warnings.warn = warn +``` + ```{code-cell} ipython3 :tags: [remove-cell] From 8b20e7f19c58a2a42964310242c30909ce09ced9 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 31 Dec 2022 15:42:46 -0800 Subject: [PATCH 04/45] polished ch5+6 up to euclidean dist --- source/classification1.md | 78 +++++++++++++++++++++------------------ source/classification2.md | 14 +++---- 2 files changed, 49 insertions(+), 43 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index c64de132..063b0356 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -12,12 +12,16 @@ kernelspec: name: python3 --- +%```{code-cell} ipython3 +%:tags: [remove-cell] +%import warnings +%def warn(*args, **kwargs): +% pass +%warnings.warn = warn +%``` + ```{code-cell} ipython3 -:tags: [remove-cell] -import warnings -def warn(*args, **kwargs): - pass -warnings.warn = warn +from sklearn.metrics.pairwise import euclidean_distances ``` (classification)= @@ -332,6 +336,8 @@ tumor images with unknown diagnoses. :tags: [remove-cell] new_point = [2, 4] +glue("new_point_1_0", new_point[0]) +glue("new_point_1_1", new_point[1]) attrs = ["Perimeter", "Concavity"] points_df = pd.DataFrame( {"Perimeter": new_point[0], "Concavity": new_point[1], "Class": ["Unknown"]} @@ -342,8 +348,6 @@ perim_concav_with_new_point_df = pd.concat((cancer, points_df), ignore_index=Tru my_distances = euclidean_distances(perim_concav_with_new_point_df.loc[:, attrs])[ len(cancer) ][:-1] -glue("1-new_point_0", new_point[0]) -glue("1-new_point_1", new_point[1]) ``` ```{index} K-nearest neighbors; classification @@ -361,8 +365,11 @@ $K$ for us. We will cover how to choose $K$ ourselves in the next chapter. To illustrate the concept of $K$-nearest neighbors classification, we will walk through an example. Suppose we have a -new observation, with standardized perimeter of {glue:}`1-new_point_0` and standardized concavity of {glue:}`1-new_point_1`, whose -diagnosis "Class" is unknown. This new observation is depicted by the red, diamond point in {numref}`fig:05-knn-2`. +new observation, with standardized perimeter +of {glue:}`new_point_1_0` and standardized concavity +of {glue:}`new_point_1_1`, whose +diagnosis "Class" is unknown. This new observation is +depicted by the red, diamond point in {numref}`fig:05-knn-2`. ```{code-cell} ipython3 :tags: [remove-cell] @@ -370,21 +377,16 @@ diagnosis "Class" is unknown. This new observation is depicted by the red, diamo perim_concav_with_new_point = ( alt.Chart( perim_concav_with_new_point_df, - # title="Scatter plot of concavity versus perimeter with new observation represented as a red diamond.", ) .mark_point(opacity=0.6, filled=True, size=40) .encode( x=alt.X("Perimeter", title="Perimeter (standardized)"), y=alt.Y("Concavity", title="Concavity (standardized)"), - color=alt.Color( - "Class", - scale=alt.Scale(range=["#86bfef", "#efb13f", "red"]), - title="Diagnosis", - ), + color=alt.Color("Class", title="Diagnosis"), shape=alt.Shape( "Class", scale=alt.Scale(range=["circle", "circle", "diamond"]) ), - size=alt.condition("datum.Class == 'Unknown'", alt.value(80), alt.value(30)) + size=alt.condition("datum.Class == 'Unknown'", alt.value(80), alt.value(30)), ) ) glue('fig:05-knn-2', perim_concav_with_new_point, display=True) @@ -410,10 +412,11 @@ glue("1-neighbor_per", round(near_neighbor_df.iloc[0, :]['Perimeter'], 1)) glue("1-neighbor_con", round(near_neighbor_df.iloc[0, :]['Concavity'], 1)) ``` -{numref}`fig:05-knn-3` shows that the nearest point to this new observation is **malignant** and -located at the coordinates ({glue:}`1-neighbor_per`, {glue:}`1-neighbor_con`). The idea here is that if a point is close to another in the scatter plot, -then the perimeter and concavity values are similar, and so we may expect that -they would have the same diagnosis. +{numref}`fig:05-knn-3` shows that the nearest point to this new observation is +**malignant** and located at the coordinates ({glue:}`1-neighbor_per`, +{glue:}`1-neighbor_con`). The idea here is that if a point is close to another +in the scatter plot, then the perimeter and concavity values are similar, +and so we may expect that they would have the same diagnosis. ```{code-cell} ipython3 :tags: [remove-cell] @@ -430,7 +433,9 @@ glue('fig:05-knn-3', (perim_concav_with_new_point + line), display=True) :::{glue:figure} fig:05-knn-3 :name: fig:05-knn-3 -Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a malignant label. +Scatter plot of concavity versus perimeter. The new observation is represented +as a red diamond with a line to the one nearest neighbor, which has a malignant +label. ::: ```{code-cell} ipython3 @@ -447,8 +452,8 @@ perim_concav_with_new_point_df2 = pd.concat((cancer, points_df2), ignore_index=T my_distances2 = euclidean_distances(perim_concav_with_new_point_df2.loc[:, attrs])[ len(cancer) ][:-1] -glue("2-new_point_0", new_point[0]) -glue("2-new_point_1", new_point[1]) +glue("new_point_2_0", new_point[0]) +glue("new_point_2_1", new_point[1]) ``` ```{code-cell} ipython3 @@ -457,7 +462,6 @@ glue("2-new_point_1", new_point[1]) perim_concav_with_new_point2 = ( alt.Chart( perim_concav_with_new_point_df2, - # title="Scatter plot of concavity versus perimeter with new observation represented as a red diamond.", ) .mark_point(opacity=0.6, filled=True, size=40) .encode( @@ -465,7 +469,6 @@ perim_concav_with_new_point2 = ( y=alt.Y("Concavity", title="Concavity (standardized)"), color=alt.Color( "Class", - scale=alt.Scale(range=["#86bfef", "#efb13f", "red"]), title="Diagnosis", ), shape=alt.Shape( @@ -493,9 +496,10 @@ glue("2-neighbor_con", round(near_neighbor_df2.iloc[0, :]['Concavity'], 1)) glue('fig:05-knn-4', (perim_concav_with_new_point2 + line2), display=True) ``` -Suppose we have another new observation with standardized perimeter {glue:}`2-new_point_0` and -concavity of {glue:}`2-new_point_1`. Looking at the scatter plot in {numref}`fig:05-knn-4`, how would you -classify this red, diamond observation? The nearest neighbor to this new point is a +Suppose we have another new observation with standardized perimeter +{glue:}`new_point_2_0` and concavity of {glue:}`new_point_2_1`. Looking at the +scatter plot in {numref}`fig:05-knn-4`, how would you classify this red, +diamond observation? The nearest neighbor to this new point is a **benign** observation at ({glue:}`2-neighbor_per`, {glue:}`2-neighbor_con`). Does this seem like the right prediction to make for this observation? Probably not, if you consider the other nearby points. @@ -505,7 +509,9 @@ not, if you consider the other nearby points. :::{glue:figure} fig:05-knn-4 :name: fig:05-knn-4 -Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a benign label. +Scatter plot of concavity versus perimeter. The new observation is represented +as a red diamond with a line to the one nearest neighbor, which has a benign +label. ::: ```{code-cell} ipython3 @@ -575,13 +581,13 @@ next chapter. ```{index} distance; K-nearest neighbors, straight line; distance ``` -We decide which points are the $K$ "nearest" to our new observation -using the *straight-line distance* (we will often just refer to this as *distance*). -Suppose we have two observations $a$ and $b$, each having two predictor variables, $x$ and $y$. -Denote $a_x$ and $a_y$ to be the values of variables $x$ and $y$ for observation $a$; -$b_x$ and $b_y$ have similar definitions for observation $b$. -Then the straight-line distance between observation $a$ and $b$ on the x-y plane can -be computed using the following formula: +We decide which points are the $K$ "nearest" to our new observation using the +*straight-line distance* (we will often just refer to this as *distance*). +Suppose we have two observations $a$ and $b$, each having two predictor +variables, $x$ and $y$. Denote $a_x$ and $a_y$ to be the values of variables +$x$ and $y$ for observation $a$; $b_x$ and $b_y$ have similar definitions for +observation $b$. Then the straight-line distance between observation $a$ and +$b$ on the x-y plane can be computed using the following formula: $$\mathrm{Distance} = \sqrt{(a_x -b_x)^2 + (a_y - b_y)^2}$$ diff --git a/source/classification2.md b/source/classification2.md index 8087cfb4..2eca1945 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -15,13 +15,13 @@ kernelspec: (classification2)= # Classification II: evaluation & tuning -```{code-cell} ipython3 -:tags: [remove-cell] -import warnings -def warn(*args, **kwargs): - pass -warnings.warn = warn -``` +%```{code-cell} ipython3 +%:tags: [remove-cell] +%import warnings +%def warn(*args, **kwargs): +% pass +%warnings.warn = warn +%``` ```{code-cell} ipython3 :tags: [remove-cell] From bd28be9eb67f3adbd1556cfb1d0a81b9fe05e9ae Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 31 Dec 2022 16:02:18 -0800 Subject: [PATCH 05/45] minor bugfix --- source/classification1.md | 7 ++----- source/classification2.md | 4 ++-- 2 files changed, 4 insertions(+), 7 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index 063b0356..3e86ba0f 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -12,15 +12,12 @@ kernelspec: name: python3 --- -%```{code-cell} ipython3 -%:tags: [remove-cell] +```{code-cell} ipython3 +:tags: [remove-cell] %import warnings %def warn(*args, **kwargs): % pass %warnings.warn = warn -%``` - -```{code-cell} ipython3 from sklearn.metrics.pairwise import euclidean_distances ``` diff --git a/source/classification2.md b/source/classification2.md index 2eca1945..4a07e738 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -15,13 +15,13 @@ kernelspec: (classification2)= # Classification II: evaluation & tuning -%```{code-cell} ipython3 +```{code-cell} ipython3 %:tags: [remove-cell] %import warnings %def warn(*args, **kwargs): % pass %warnings.warn = warn -%``` +``` ```{code-cell} ipython3 :tags: [remove-cell] From 9499a732ba6b6d63e0fc8d1674ade53f5d474308 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 31 Dec 2022 16:04:36 -0800 Subject: [PATCH 06/45] minor bugfix --- source/classification1.md | 8 ++++---- source/classification2.md | 10 +++++----- 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index 3e86ba0f..b37c4276 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -14,10 +14,10 @@ kernelspec: ```{code-cell} ipython3 :tags: [remove-cell] -%import warnings -%def warn(*args, **kwargs): -% pass -%warnings.warn = warn +#import warnings +#def warn(*args, **kwargs): +# pass +#warnings.warn = warn from sklearn.metrics.pairwise import euclidean_distances ``` diff --git a/source/classification2.md b/source/classification2.md index 4a07e738..0c3e53b4 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -16,11 +16,11 @@ kernelspec: # Classification II: evaluation & tuning ```{code-cell} ipython3 -%:tags: [remove-cell] -%import warnings -%def warn(*args, **kwargs): -% pass -%warnings.warn = warn +:tags: [remove-cell] +#import warnings +#def warn(*args, **kwargs): +# pass +#warnings.warn = warn ``` ```{code-cell} ipython3 From 294103af4ac8dc09ae3b8741200a30b92adcb194 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 31 Dec 2022 16:08:57 -0800 Subject: [PATCH 07/45] fixed worksheets link at end of chp --- source/classification1.md | 2 +- source/classification2.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index b37c4276..f2f06af8 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -1995,7 +1995,7 @@ Scatter plot of smoothness versus area where background color indicates the deci Practice exercises for the material covered in this chapter can be found in the accompanying -[worksheets repository](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets#readme) +[worksheets repository](https://github.com/UBC-DSCI/data-science-a-first-intro-python-worksheets#readme) in the "Classification I: training and predicting" row. You can launch an interactive version of the worksheet in your browser by clicking the "launch binder" button. You can also preview a non-interactive version of the worksheet by clicking "view worksheet." diff --git a/source/classification2.md b/source/classification2.md index 0c3e53b4..17390aec 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -2127,7 +2127,7 @@ Estimated accuracy versus the number of predictors for the sequence of models bu Practice exercises for the material covered in this chapter can be found in the accompanying -[worksheets repository](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets#readme) +[worksheets repository](https://github.com/UBC-DSCI/data-science-a-first-intro-python-worksheets#readme) in the "Classification II: evaluation and tuning" row. You can launch an interactive version of the worksheet in your browser by clicking the "launch binder" button. You can also preview a non-interactive version of the worksheet by clicking "view worksheet." From 1ad6164c04754ec5d1e8f2c8200b7935f7ff3fc7 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 31 Dec 2022 19:22:34 -0800 Subject: [PATCH 08/45] fix minor section heading wording in Ch1 --- source/intro.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/source/intro.md b/source/intro.md index 9683b4ef..f953319e 100644 --- a/source/intro.md +++ b/source/intro.md @@ -38,7 +38,7 @@ By the end of the chapter, readers will be able to do the following: - Read tabular data with `read_csv`. - Use `help()` to access help and documentation tools in Python. - Create new variables and objects in Python. -- Create and organize subsets of tabular data using `[]`, `loc[]`, and `sort_values` +- Create and organize subsets of tabular data using `[]`, `loc[]`, `sort_values`, and `head` - Visualize data with an `altair` bar plot. ## Canadian languages data set @@ -588,7 +588,7 @@ with multiple kinds of `category`. The data frame `aboriginal_lang` contains only 67 rows, and looks like it only contains Aboriginal languages. So it looks like the `loc[]` operation gave us the result we wanted! -### Using `sort_values` to order and `head` to select rows by value +### Using `sort_values` and `head` to select rows by ordered values ```{index} pandas.DataFrame; sort_values, pandas.DataFrame; head ``` From ee90b8e543d1f6d48e27ff523506ebffd74c872a Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 31 Dec 2022 23:21:43 -0800 Subject: [PATCH 09/45] added nsmallest + note; better chaining for dist comps; removed comments; fixed colors (not working yet) --- source/classification1.md | 114 ++++++++++++++++---------------------- 1 file changed, 48 insertions(+), 66 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index f2f06af8..8b81f7a6 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -18,7 +18,15 @@ kernelspec: #def warn(*args, **kwargs): # pass #warnings.warn = warn + +from myst_nb import glue +import numpy as np from sklearn.metrics.pairwise import euclidean_distances +from IPython.display import HTML + +import plotly.express as px +import plotly.graph_objs as go +from plotly.offline import iplot, plot ``` (classification)= @@ -237,8 +245,6 @@ cancer['Class'].unique() ```{code-cell} ipython3 :tags: [remove-cell] -from myst_nb import glue -import numpy as np glue("benign_count", cancer['Class'].value_counts()['Benign']) glue("benign_pct", int(np.round(100*cancer['Class'].value_counts(normalize=True)['Benign']))) glue("malignant_count", cancer['Class'].value_counts()['Malignant']) @@ -600,6 +606,13 @@ the $K=5$ neighbors that are nearest to our new point. You will see in the code below, we compute the straight-line distance using the formula above: we square the differences between the two observations' perimeter and concavity coordinates, add the squared differences, and then take the square root. +In order to find the $K=5$ nearest neighbors, we will use the `nsmallest` function from `pandas`. + +> **Note:** Recall that in the {ref}`intro` chapter, we used `sort_values` followed by `head` to obtain +> the ten rows with the *largest* values of a variable. We could have instead used the `nlargest` function +> from `pandas` for this purpose. The `nsmallest` and `nlargest` functions achieve the same goal +> as `sort_values` followed by `head`, but are slightly more efficient because they are specialized for this purpose. +> In general, it is good to use more specialized functions when they are available! ```{code-cell} ipython3 :tags: [remove-cell] @@ -620,7 +633,6 @@ perim_concav_with_new_point3 = ( y=alt.Y("Concavity", title="Concavity (standardized)"), color=alt.Color( "Class", - scale=alt.Scale(range=["#86bfef", "#efb13f", "red"]), title="Diagnosis", ), shape=alt.Shape( @@ -647,13 +659,14 @@ Scatter plot of concavity versus perimeter with new observation represented as a ```{code-cell} ipython3 new_obs_Perimeter = 0 new_obs_Concavity = 3.5 -cancer_dist = cancer.loc[:, ["ID", "Perimeter", "Concavity", "Class"]] -cancer_dist = cancer_dist.assign(dist_from_new = np.sqrt( - (cancer_dist["Perimeter"] - new_obs_Perimeter) ** 2 - + (cancer_dist["Concavity"] - new_obs_Concavity) ** 2 -)) -# sort the rows in ascending order and take the first 5 rows -cancer_dist = cancer_dist.sort_values(by="dist_from_new").head(5) +cancer_dist = (cancer + .loc[:, ["Perimeter", "Concavity", "Class"]] + .assign(dist_from_new = ( + (cancer["Perimeter"] - new_obs_Perimeter) ** 2 + + (cancer["Concavity"] - new_obs_Concavity) ** 2 + )**(1/2)) + .nsmallest(5, "dist_from_new") + ) cancer_dist ``` @@ -662,36 +675,21 @@ we computed the `dist_from_new` variable (the distance to the new observation) for each of the 5 nearest neighbors in the training data. -```{code-cell} ipython3 -:tags: [remove-cell] - -## Couldn't find ways to have nice Latex equations in pandas dataframe - -# cancer_dist_eq = cancer_dist.copy() -# cancer_dist_eq['Perimeter'] = round(cancer_dist_eq['Perimeter'], 2) -# cancer_dist_eq['Concavity'] = round(cancer_dist_eq['Concavity'], 2) -# for i in list(cancer_dist_eq.index): -# cancer_dist_eq.loc[ -# i, "Distance" -# ] = f"[({new_obs_Perimeter} - {round(cancer_dist_eq.loc[i, 'Perimeter'], 2)})² + ({new_obs_Concavity} - {round(cancer_dist_eq.loc[i, 'Concavity'], 2)})²]¹/² = {round(cancer_dist_eq.loc[i, 'dist_from_new'], 2)}" -# cancer_dist_eq[["Perimeter", "Concavity", "Distance", "Class"]] -``` - ```{table} Evaluating the distances from the new observation to each of its 5 nearest neighbors :name: tab:05-multiknn-mathtable | Perimeter | Concavity | Distance | Class | |-----------|-----------|----------------------------------------|-------| -| 0.24 | 2.65 | $\sqrt{(0-0.24)^2+(3.5-2.65)^2}=0.88$| B | -| 0.75 | 2.87 | $\sqrt{(0-0.75)^2+(3.5-2.87)^2}=0.98$| M | -| 0.62 | 2.54 | $\sqrt{(0-0.62)^2+(3.5-2.54)^2}=1.14$| M | -| 0.42 | 2.31 | $\sqrt{(0-0.42)^2+(3.5-2.31)^2}=1.26$| M | -| -1.16 | 4.04 | $\sqrt{(0-(-1.16))^2+(3.5-4.04)^2}=1.28$| B | +| 0.24 | 2.65 | $\sqrt{(0-0.24)^2+(3.5-2.65)^2}=0.88$| Benign | +| 0.75 | 2.87 | $\sqrt{(0-0.75)^2+(3.5-2.87)^2}=0.98$| Malignant | +| 0.62 | 2.54 | $\sqrt{(0-0.62)^2+(3.5-2.54)^2}=1.14$| Malignant | +| 0.42 | 2.31 | $\sqrt{(0-0.42)^2+(3.5-2.31)^2}=1.26$| Malignant | +| -1.16 | 4.04 | $\sqrt{(0-(-1.16))^2+(3.5-4.04)^2}=1.28$| Benign | ``` +++ The result of this computation shows that 3 of the 5 nearest neighbors to our new observation are -malignant (`M`); since this is the majority, we classify our new observation as malignant. +malignant; since this is the majority, we classify our new observation as malignant. These 5 neighbors are circled in {numref}`fig:05-multiknn-3`. ```{code-cell} ipython3 @@ -758,18 +756,20 @@ three predictors. new_obs_Perimeter = 0 new_obs_Concavity = 3.5 new_obs_Symmetry = 1 -cancer_dist2 = cancer.loc[:, ["ID", "Perimeter", "Concavity", "Symmetry", "Class"]] -cancer_dist2["dist_from_new"] = np.sqrt( - (cancer_dist2["Perimeter"] - new_obs_Perimeter) ** 2 - + (cancer_dist2["Concavity"] - new_obs_Concavity) ** 2 - + (cancer_dist2["Symmetry"] - new_obs_Symmetry) ** 2 -) -# sort the rows in ascending order and take the first 5 rows -cancer_dist2 = cancer_dist2.sort_values(by="dist_from_new").head(5) +cancer_dist2 = (cancer + .loc[:, ["Perimeter", "Concavity", "Symmetry", "Class"]] + .assign(dist_from_new = ( + (cancer["Perimeter"] - new_obs_Perimeter) ** 2 + + (cancer["Concavity"] - new_obs_Concavity) ** 2 + + (cancer["Symmetry"] - new_obs_Symmetry) ** 2 + )**(1/2)) + .nsmallest(5, "dist_from_new") + ) cancer_dist2 ``` -Based on $K=5$ nearest neighbors with these three predictors we would classify the new observation as malignant since 4 out of 5 of the nearest neighbors are malignant class. +Based on $K=5$ nearest neighbors with these three predictors we would classify +the new observation as malignant since 4 out of 5 of the nearest neighbors are malignant class. {numref}`fig:05-more` shows what the data look like when we visualize them as a 3-dimensional scatter with lines from the new observation to its five nearest neighbors. @@ -832,7 +832,7 @@ for i, d in enumerate(fig.data): fig.data[i].marker.symbol = symbols[fig.data[i].name] # specify trace names and colors in a dict -colors = {"Malignant": "#86bfef", "Benign": "#efb13f", "Unknown": "red"} +colors = {"Malignant": "#ff7f0e", "Benign": "#1f77b4", "Unknown": "red"} # set all colors in fig for i, d in enumerate(fig.data): @@ -861,7 +861,6 @@ for neighbor_df in neighbor_df_list: fig.update_layout(margin=dict(l=0, r=0, b=0, t=1), template="plotly_white") plot(fig, filename="img/classification1/fig05-more.html", auto_open=False) -# display(HTML("img/classification1/fig05-more.html")) ``` ```{code-cell} ipython3 @@ -874,7 +873,10 @@ display(HTML("img/classification1/fig05-more.html")) :name: fig:05-more :figclass: caption-hack -3D scatter plot of the standardized symmetry, concavity, and perimeter variables. Note that in general we recommend against using 3D visualizations; here we show the data in 3D only to illustrate what higher dimensions and nearest neighbors look like, for learning purposes. +3D scatter plot of the standardized symmetry, concavity, and perimeter +variables. Note that in general we recommend against using 3D visualizations; +here we show the data in 3D only to illustrate what higher dimensions and +nearest neighbors look like, for learning purposes. ``` +++ @@ -884,9 +886,8 @@ display(HTML("img/classification1/fig05-more.html")) In order to classify a new observation using a $K$-nearest neighbor classifier, we have to do the following: 1. Compute the distance between the new observation and each observation in the training set. -2. Sort the data table in ascending order according to the distances. -3. Choose the top $K$ rows of the sorted table. -4. Classify the new observation based on a majority vote of the neighbor classes. +2. Find the $K$ rows corresponding to the $K$ smallest distances. +3. Classify the new observation based on a majority vote of the neighbor classes. +++ @@ -901,29 +902,10 @@ or predict the class for multiple new observations. Thankfully, in Python, the $K$-nearest neighbors algorithm is implemented in [the `scikit-learn` Python package](https://scikit-learn.org/stable/index.html) {cite:p}`sklearn_api` along with many [other models](https://scikit-learn.org/stable/user_guide.html) that you will encounter in this and future chapters of the book. Using the functions -in the `scikit-learn` package will help keep our code simple, readable and accurate; the +in the `scikit-learn` package (named `sklearn` in Python) will help keep our code simple, readable and accurate; the less we have to code ourselves, the fewer mistakes we will likely make. We start by importing `KNeighborsClassifier` from the `sklearn.neighbors` module. -```{code-cell} ipython3 -:tags: [remove-cell] - -## The above was based on: - -# Coding the $K$-nearest neighbors algorithm in R ourselves can get complicated, -# especially if we want to handle multiple classes, more than two variables, -# or predict the class for multiple new observations. Thankfully, in R, -# the $K$-nearest neighbors algorithm is -# implemented in [the `parsnip` R package](https://parsnip.tidymodels.org/) [@parsnip] -# included in `tidymodels`, along with -# many [other models](https://www.tidymodels.org/find/parsnip/) \index{tidymodels}\index{parsnip} -# that you will encounter in this and future chapters of the book. The `tidymodels` collection -# provides tools to help make and use models, such as classifiers. Using the packages -# in this collection will help keep our code simple, readable and accurate; the -# less we have to code ourselves, the fewer mistakes we will likely make. We -# start by loading `tidymodels`. -``` - ```{code-cell} ipython3 from sklearn.neighbors import KNeighborsClassifier ``` From ece61a828cd5524e73c8fc3bf2c91440e7359229 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 31 Dec 2022 23:49:23 -0800 Subject: [PATCH 10/45] initial fit and predict polished; model spec -> model object --- source/classification1.md | 102 ++++++++++++-------------------------- 1 file changed, 31 insertions(+), 71 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index 8b81f7a6..22368ada 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -915,103 +915,63 @@ We will use the `cancer` data set from above, with perimeter and concavity as predictors and $K = 5$ neighbors to build our classifier. Then we will use the classifier to predict the diagnosis label for a new observation with perimeter 0, concavity 3.5, and an unknown diagnosis label. Let's pick out our two desired -predictor variables and class label and store them as a new data set named `cancer_train`: +predictor variables and class label and store them with the name `cancer_train`: ```{code-cell} ipython3 -cancer_train = cancer.loc[:, ['Class', 'Perimeter', 'Concavity']] +cancer_train = cancer[['Class', 'Perimeter', 'Concavity']] cancer_train ``` -```{index} scikit-learn; model instance, scikit-learn; KNeighborsClassifier +```{index} scikit-learn; model object, scikit-learn; KNeighborsClassifier ``` -Next, we create a *model specification* for $K$-nearest neighbors classification -by creating a `KNeighborsClassifier` instance, specifying that we want to use $K = 5$ neighbors -(we will discuss how to choose $K$ in the next chapter) and the straight-line -distance (`weights="uniform"`). The `weights` argument controls -how neighbors vote when classifying a new observation; by setting it to `"uniform"`, -each of the $K$ nearest neighbors gets exactly 1 vote as described above. Other choices, -which weigh each neighbor's vote differently, can be found on -[the `scikit-learn` website](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html?highlight=kneighborsclassifier#sklearn.neighbors.KNeighborsClassifier). +Next, we create a *model object* for $K$-nearest neighbors classification +by creating a `KNeighborsClassifier` instance, specifying that we want to use $K = 5$ neighbors; +we will discuss how to choose $K$ in the next chapter. -```{code-cell} ipython3 -:tags: [remove-cell] +> **Note:** You can specify the `weights` argument in order to control +> how neighbors vote when classifying a new observation. The default is `"uniform"`, where +> each of the $K$ nearest neighbors gets exactly 1 vote as described above. Other choices, +> which weigh each neighbor's vote differently, can be found on +> [the `scikit-learn` website](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html?highlight=kneighborsclassifier#sklearn.neighbors.KNeighborsClassifier). -## The above was based on: - -# Next, we create a *model specification* for \index{tidymodels!model specification} $K$-nearest neighbors classification -# by calling the `nearest_neighbor` function, specifying that we want to use $K = 5$ neighbors -# (we will discuss how to choose $K$ in the next chapter) and the straight-line -# distance (`weight_func = "rectangular"`). The `weight_func` argument controls -# how neighbors vote when classifying a new observation; by setting it to `"rectangular"`, -# each of the $K$ nearest neighbors gets exactly 1 vote as described above. Other choices, -# which weigh each neighbor's vote differently, can be found on -# [the `parsnip` website](https://parsnip.tidymodels.org/reference/nearest_neighbor.html). -# In the `set_engine` \index{tidymodels!engine} argument, we specify which package or system will be used for training -# the model. Here `kknn` is the R package we will use for performing $K$-nearest neighbors classification. -# Finally, we specify that this is a classification problem with the `set_mode` function. -``` ```{code-cell} ipython3 -knn_spec = KNeighborsClassifier(n_neighbors=5) -knn_spec +knn = KNeighborsClassifier(n_neighbors=5) +knn ``` ```{index} scikit-learn; X & y ``` -In order to fit the model on the breast cancer data, we need to call `fit` on the classifier object and pass the data in the argument. We also need to specify what variables to use as predictors and what variable to use as the target. Below, the `X=cancer_train[["Perimeter", "Concavity"]]` and the `y=cancer_train['Class']` argument specifies -that `Class` is the target variable (the one we want to predict), -and both `Perimeter` and `Concavity` are to be used as the predictors. +In order to fit the model on the breast cancer data, we need to call `fit` on +the model object. The `X` argument is used to specify the data for the predictor +variables, while the `y` argument is used to specify the data for the response variable. +So below, we set `X=cancer_train[["Perimeter", "Concavity"]]` and +`y=cancer_train['Class']` to specify that `Class` is the target +variable (the one we want to predict), and both `Perimeter` and `Concavity` are +to be used as the predictors. Note that the `fit` function might look like it does not +do much from the outside, but it is actually doing all the heavy lifting to train +the K-nearest neighbors model, and modifies the `knn` model object. ```{code-cell} ipython3 -:tags: [remove-cell] - -## The above was based on: - -# In order to fit the model on the breast cancer data, we need to pass the model specification -# and the data set to the `fit` function. We also need to specify what variables to use as predictors -# and what variable to use as the target. Below, the `Class ~ Perimeter + Concavity` argument specifies -# that `Class` is the target variable (the one we want to predict), -# and both `Perimeter` and `Concavity` are to be used as the predictors. - - -# We can also use a convenient shorthand syntax using a period, `Class ~ .`, to indicate -# that we want to use every variable *except* `Class` \index{tidymodels!model formula} as a predictor in the model. -# In this particular setup, since `Concavity` and `Perimeter` are the only two predictors in the `cancer_train` -# data frame, `Class ~ Perimeter + Concavity` and `Class ~ .` are equivalent. -# In general, you can choose individual predictors using the `+` symbol, or you can specify to -# use *all* predictors using the `.` symbol. -``` - -```{code-cell} ipython3 -knn_spec.fit(X=cancer_train[["Perimeter", "Concavity"]], y=cancer_train["Class"]); -``` - -```{code-cell} ipython3 -:tags: [remove-cell] - -# Here you can see the final trained model summary. It confirms that the computational engine used -# to train the model was `kknn::train.kknn`. It also shows the fraction of errors made by -# the nearest neighbor model, but we will ignore this for now and discuss it in more detail -# in the next chapter. -# Finally, it shows (somewhat confusingly) that the "best" weight function -# was "rectangular" and "best" setting of $K$ was 5; but since we specified these earlier, -# R is just repeating those settings to us here. In the next chapter, we will actually -# let R find the value of $K$ for us. +knn.fit(X=cancer_train[["Perimeter", "Concavity"]], y=cancer_train["Class"]); ``` ```{index} scikit-learn; predict ``` -Finally, we make the prediction on the new observation by calling `predict` on the classifier object, -passing the new observation itself. As above, -when we ran the $K$-nearest neighbors -classification algorithm manually, the `knn_fit` object classifies the new observation as "Malignant". Note that the `predict` function outputs a `numpy` array with the model's prediction. +After using the `fit` function, we can make a prediction on a new observation +by calling `predict` on the classifier object, passing the new observation +itself. As above, when we ran the $K$-nearest neighbors classification +algorithm manually, the `knn` model object classifies the new observation as +"Malignant". Note that the `predict` function outputs an `array` with the +model's prediction; you can actually make multiple predictions at the same +time using the `predict` function, which is why the output is stored as an `array`. ```{code-cell} ipython3 new_obs = pd.DataFrame({'Perimeter': [0], 'Concavity': [3.5]}) -knn_spec.predict(new_obs) +knn.predict(new_obs) ``` Is this predicted malignant label the true class for this observation? From e874666eb164375f1537fde3d62373b80f7651dc Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sun, 1 Jan 2023 21:38:29 -0800 Subject: [PATCH 11/45] polishing preprocessing --- source/classification1.md | 240 ++++++++++++++++++-------------------- 1 file changed, 112 insertions(+), 128 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index 22368ada..fc9084f1 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -903,8 +903,16 @@ the $K$-nearest neighbors algorithm is implemented in [the `scikit-learn` Python package](https://scikit-learn.org/stable/index.html) {cite:p}`sklearn_api` along with many [other models](https://scikit-learn.org/stable/user_guide.html) that you will encounter in this and future chapters of the book. Using the functions in the `scikit-learn` package (named `sklearn` in Python) will help keep our code simple, readable and accurate; the -less we have to code ourselves, the fewer mistakes we will likely make. We -start by importing `KNeighborsClassifier` from the `sklearn.neighbors` module. +less we have to code ourselves, the fewer mistakes we will likely make. +Before getting started with $K$-nearest neighbors, we need to tell the `sklearn` package +that we prefer using `pandas` data frames over regular arrays via the `set_config` function. +```{code-cell} ipython3 +from sklearn import set_config +set_config(transform_output="pandas") +``` + +We can now get started with $K$-nearest neighbors. The first step is to + import the `KNeighborsClassifier` from the `sklearn.neighbors` module. ```{code-cell} ipython3 from sklearn.neighbors import KNeighborsClassifier @@ -1030,19 +1038,26 @@ is said to be *standardized*, and all variables in a data set will have a mean o and a standard deviation of 1. To illustrate the effect that standardization can have on the $K$-nearest neighbor algorithm, we will read in the original, unstandardized Wisconsin breast cancer data set; we have been using a standardized version of the data set up -until now. To keep things simple, we will just use the `Area`, `Smoothness`, and `Class` +until now. We will apply the same initial wrangling steps as we did earlier, +and to keep things simple we will just use the `Area`, `Smoothness`, and `Class` variables: ```{code-cell} ipython3 -unscaled_cancer = pd.read_csv("data/unscaled_wdbc.csv") -unscaled_cancer = unscaled_cancer[['Class', 'Area', 'Smoothness']] +unscaled_cancer = ( + pd.read_csv("data/unscaled_wdbc.csv") + .loc[:, ['Class', 'Area', 'Smoothness']] + .replace({ + 'M' : 'Malignant', + 'B' : 'Benign' + }) + ) +unscaled_cancer['Class'] = unscaled_cancer['Class'].astype('category') unscaled_cancer ``` Looking at the unscaled and uncentered data above, you can see that the differences between the values for area measurements are much larger than those for -smoothness. Will this affect -predictions? In order to find out, we will create a scatter plot of these two +smoothness. Will this affect predictions? In order to find out, we will create a scatter plot of these two predictors (colored by diagnosis) for both the unstandardized data we just loaded, and the standardized version of that same data. But first, we need to standardize the `unscaled_cancer` data set with `scikit-learn`. @@ -1053,32 +1068,28 @@ standardize the `unscaled_cancer` data set with `scikit-learn`. ```{index} double: scikit-learn; pipeline ``` -In the `scikit-learn` framework, all data preprocessing and modeling can be built using a [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html?highlight=pipeline#sklearn.pipeline.Pipeline), or a more convenient function [`make_pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline) for simple pipeline construction. -Here we will initialize a preprocessor using `make_column_transformer` for -the `unscaled_cancer` data above, specifying -that we want to standardize the predictors `Area` and `Smoothness`: +The `scikit-learn` framework provides a collection of *preprocessors* used to manipulate +data in the [`preprocessing` module](https://scikit-learn.org/stable/modules/preprocessing.html). +Here we will use the `StandardScaler` transformer to standardize the predictor variables in +the `unscaled_cancer` data. In order to tell the `StandardScaler` which variables to standardize, +we wrap it in a +[`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer) object +using the [`make_column_transformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html#sklearn.compose.make_column_transformer) function. `ColumnTransformer` objects also enable the use of multiple preprocessors at +once, which is especially handy when you want to apply different preprocessing to each of the predictor variables. +The primary argument of the `make_column_transformer` function is a sequence of +pairs of (1) a preprocessor, and (2) the columns to which you want to apply that preprocessor. +In the present case, we just have the one `StandardScaler` preprocessor to apply to the `Area` and `Smoothness` columns. ```{code-cell} ipython3 -:tags: [remove-cell] - -## The above was based on: +from sklearn.preprocessing import StandardScaler +from sklearn.compose import make_column_transformer -# In the `tidymodels` framework, all data preprocessing happens -# using a `recipe` from [the `recipes` R package](https://recipes.tidymodels.org/) [@recipes] -# Here we will initialize a recipe \index{recipe} \index{tidymodels!recipe|see{recipe}} for -# the `unscaled_cancer` data above, specifying -# that the `Class` variable is the target, and all other variables are predictors: -``` - -```{code-cell} ipython3 preprocessor = make_column_transformer( (StandardScaler(), ["Area", "Smoothness"]), ) preprocessor ``` -So far, we have built a preprocessor so that each of the predictors have a mean of 0 and standard deviation of 1. - ```{index} scikit-learn; ColumnTransformer, scikit-learn; StandardScaler, scikit-learn; fit_transform ``` @@ -1088,68 +1099,74 @@ So far, we have built a preprocessor so that each of the predictors have a mean ```{index} scikit-learn; fit, scikit-learn; transform ``` -You can now see that the recipe includes a scaling and centering step for all predictor variables. -Note that when you add a step to a `ColumnTransformer`, you must specify what columns to apply the step to. -Here we specified that `StandardScaler` should be applied to -all predictor variables. - -```{index} see: fit, transform, fit_transform; scikit-learn -``` - -At this point, the data are not yet scaled and centered. To actually scale and center -the data, we need to call `fit` and `transform` on the unscaled data ( can be combined into `fit_transform`). +You can see that the preprocessor includes a single standardization step +that is applied to the `Area` and `Smoothness` columns. +Note that here we specified which columns to apply the preprocessing step to +by individual names; this approach can become quite difficult, e.g., when we have many +predictor variables. Rather than writing out the column names individually, +we can instead use the +[`make_column_selector`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_selector.html#sklearn.compose.make_column_selector) function. For example, if we wanted to standardize all *numerical* predictors, +we would use `make_column_selector` and specify the `dtype_include` argument to be `np.number` +(from the `numpy` package). This creates a preprocessor equivalent to the one we +created previously. ```{code-cell} ipython3 -:tags: [remove-cell] +import numpy as np +from sklearn.compose import make_column_selector -# So far, there is not much in the recipe; just a statement about the number of targets -# and predictors. Let's add -# scaling (`step_scale`) \index{recipe!step\_scale} and -# centering (`step_center`) \index{recipe!step\_center} steps for -# all of the predictors so that they each have a mean of 0 and standard deviation of 1. -# Note that `tidyverse` actually provides `step_normalize`, which does both centering and scaling in -# a single recipe step; in this book we will keep `step_scale` and `step_center` separate -# to emphasize conceptually that there are two steps happening. -# The `prep` function finalizes the recipe by using the data (here, `unscaled_cancer`) \index{tidymodels!prep}\index{prep|see{tidymodels}} -# to compute anything necessary to run the recipe (in this case, the column means and standard -# deviations): +preprocessor = make_column_transformer( + (StandardScaler(), make_column_selector(dtype_include=np.number)), + ) +preprocessor ``` -```{code-cell} ipython3 -:tags: [remove-cell] - -# You can now see that the recipe includes a scaling and centering step for all predictor variables. -# Note that when you add a step to a recipe, you must specify what columns to apply the step to. -# Here we used the `all_predictors()` \index{recipe!all\_predictors} function to specify that each step should be applied to -# all predictor variables. However, there are a number of different arguments one could use here, -# as well as naming particular columns with the same syntax as the `select` function. -# For example: - -# - `all_nominal()` and `all_numeric()`: specify all categorical or all numeric variables -# - `all_predictors()` and `all_outcomes()`: specify all predictor or all target variables -# - `Area, Smoothness`: specify both the `Area` and `Smoothness` variable -# - `-Class`: specify everything except the `Class` variable - -# You can find a full set of all the steps and variable selection functions -# on the [`recipes` reference page](https://recipes.tidymodels.org/reference/index.html). - -# At this point, we have calculated the required statistics based on the data input into the -# recipe, but the data are not yet scaled and centered. To actually scale and center -# the data, we need to apply the `bake` \index{tidymodels!bake} \index{bake|see{tidymodels}} function to the unscaled data. +```{index} see: fit, transform, fit_transform; scikit-learn ``` +We are now ready to standardize the numerical predictor columns in the `unscaled_cancer` data frame. +This happens in two steps. We first use the `fit` function to compute the values necessary to apply +the standardization (the mean and standard deviation of each variable), passing the `unscaled_cancer` data as an argument. +Then we use the `transform` function to actually apply the standardization. +It may seem a bit unnecessary to use two steps---`fit` *and* `transform`---to standardize the data. +However, we do this in two steps so that we can specify a different data set in the `transform` step if we want. +This enables us to compute the quantities needed to standardize using one data set, and then +apply that standardization to another data set. + ```{code-cell} ipython3 preprocessor.fit(unscaled_cancer) scaled_cancer = preprocessor.transform(unscaled_cancer) -# scaled_cancer = preprocessor.fit_transform(unscaled_cancer) -scaled_cancer = pd.DataFrame(scaled_cancer, columns=['Area', 'Smoothness']) -scaled_cancer['Class'] = unscaled_cancer['Class'] scaled_cancer ``` +```{code-cell} ipython3 + +``` +It looks like our `Smoothness` and `Area` variables have been standardized. Woohoo! +But there are two important things to notice about the new `scaled_cancer` data frame. First, it only keeps +the columns from the input to `transform` (here, `unscaled_cancer`) that had a preprocessing step applied +to them. The default behavior of the `ColumnTransformer` that we build using `make_column_transformer` +is to *drop* the remaining columns. This default behavior works well with the rest of `sklearn` (as we will see below +in the {ref}`08:puttingittogetherworkflow` section), but for visualizing the result of preprocessing it can be useful to keep the other columns +in our original data frame, such as the `Class` variable here. +To keep other columns, we need to set the `remainder` argument to `'passthrough'` in the `make_column_transformer` function. + Furthermore, you can see that the new column names---{glue:}`scaled_cancer.columns[0]` +and {glue:}`scaled_cancer.columns[1]`---include the name +of the preprocessing step separated by underscores. This default behavior is useful in `sklearn` because we sometimes want to apply +multiple different preprocessing steps to the same columns; but again, for visualization it can be useful to preserve +the original column names. To keep original column names, we need to set the `verbose_feature_names_out` argument to `False`. + +> **Note:** Only specify the `remainder` and `verbose_feature_names_out` arguments when you want to examine the result +> of your preprocessing step. In most cases, you should leave these arguments at their default values. -It may seem redundant that we had to both `fit` *and* `transform` to scale and center the data. - However, we do this in two steps so we can specify a different data set in the `transform` step if we want. - For example, we may want to specify new data that were not part of the training set. +```{code-cell} ipython3 +preprocessor_keep_all = make_column_transformer( + (StandardScaler(), make_column_selector(dtype_include=np.number)), + remainder='passthrough', + verbose_feature_names_out=False + ) +preprocessor_keep_all.fit(unscaled_cancer) +scaled_cancer_all = preprocessor_keep_all.transform(unscaled_cancer) +scaled_cancer_all +``` You may wonder why we are doing so much work just to center and scale our variables. Can't we just manually scale and center the `Area` and @@ -1158,33 +1175,14 @@ technically *yes*; but doing so is error-prone. In particular, we might accidentally forget to apply the same centering / scaling when making predictions, or accidentally apply a *different* centering / scaling than what we used while training. Proper use of a `ColumnTransformer` helps keep our code simple, -readable, and error-free. Furthermore, note that using `fit` and `transform` on the preprocessor is required only when you want to inspect the result of the preprocessing steps -yourself. You will see further on in -Section {ref}`08:puttingittogetherworkflow` that `scikit-learn` provides tools to -automatically streamline the preprocesser and the model so that you can call`fit` +readable, and error-free. Furthermore, note that using `fit` and `transform` on +the preprocessor is required only when you want to inspect the result of the +preprocessing steps +yourself. You will see further on in the +{ref}`08:puttingittogetherworkflow` section that `scikit-learn` provides tools to +automatically streamline the preprocesser and the model so that you can call `fit` and `transform` on the `Pipeline` as necessary without additional coding effort. -```{code-cell} ipython3 -:tags: [remove-cell] - -# It may seem redundant that we had to both `bake` *and* `prep` to scale and center the data. -# However, we do this in two steps so we can specify a different data set in the `bake` step if we want. -# For example, we may want to specify new data that were not part of the training set. - -# You may wonder why we are doing so much work just to center and -# scale our variables. Can't we just manually scale and center the `Area` and -# `Smoothness` variables ourselves before building our $K$-nearest neighbor model? Well, -# technically *yes*; but doing so is error-prone. In particular, we might -# accidentally forget to apply the same centering / scaling when making -# predictions, or accidentally apply a *different* centering / scaling than what -# we used while training. Proper use of a `recipe` helps keep our code simple, -# readable, and error-free. Furthermore, note that using `prep` and `bake` is -# required only when you want to inspect the result of the preprocessing steps -# yourself. You will see further on in Section -# \@ref(puttingittogetherworkflow) that `tidymodels` provides tools to -# automatically apply `prep` and `bake` as necessary without additional coding effort. -``` - {numref}`fig:05-scaling-plt` shows the two scatter plots side-by-side—one for `unscaled_cancer` and one for `scaled_cancer`. Each has the same new observation annotated with its $K=3$ nearest neighbors. In the original unstandardized data plot, you can see some odd choices @@ -1214,7 +1212,7 @@ def class_dscp(x): attrs = ["Area", "Smoothness"] -new_obs = pd.DataFrame({"Class": ["Unknwon"], "Area": 400, "Smoothness": 0.135}) +new_obs = pd.DataFrame({"Class": ["Unknown"], "Area": 400, "Smoothness": 0.135}) unscaled_cancer["Class"] = unscaled_cancer["Class"].apply(class_dscp) area_smoothness_new_df = pd.concat((unscaled_cancer, new_obs), ignore_index=True) my_distances = euclidean_distances(area_smoothness_new_df.loc[:, attrs])[ @@ -1231,7 +1229,6 @@ area_smoothness_new_point = ( y=alt.Y("Smoothness"), color=alt.Color( "Class", - scale=alt.Scale(range=["#86bfef", "#efb13f", "red"]), title="Diagnosis", ), shape=alt.Shape( @@ -1288,13 +1285,13 @@ area_smoothness_new_point = area_smoothness_new_point + line1 + line2 + line3 :tags: [remove-cell] attrs = ["Area", "Smoothness"] -new_obs_scaled = pd.DataFrame({"Class": ["Unknwon"], "Area": -0.72, "Smoothness": 2.8}) -scaled_cancer["Class"] = scaled_cancer["Class"].apply(class_dscp) +new_obs_scaled = pd.DataFrame({"Class": ["Unknown"], "Area": -0.72, "Smoothness": 2.8}) +scaled_cancer_all["Class"] = scaled_cancer_all["Class"].apply(class_dscp) area_smoothness_new_df_scaled = pd.concat( - (scaled_cancer, new_obs_scaled), ignore_index=True + (scaled_cancer_all, new_obs_scaled), ignore_index=True ) my_distances_scaled = euclidean_distances(area_smoothness_new_df_scaled.loc[:, attrs])[ - len(scaled_cancer) + len(scaled_cancer_all) ][:-1] area_smoothness_new_point_scaled = ( alt.Chart( @@ -1307,7 +1304,6 @@ area_smoothness_new_point_scaled = ( y=alt.Y("Smoothness", title="Smoothness (standardized)"), color=alt.Color( "Class", - scale=alt.Scale(range=["#86bfef", "#efb13f", "red"]), title="Diagnosis", ), shape=alt.Shape( @@ -1319,21 +1315,21 @@ area_smoothness_new_point_scaled = ( min_3_idx_scaled = np.argpartition(my_distances_scaled, 3)[:3] neighbor1_scaled = pd.concat( ( - scaled_cancer.loc[min_3_idx_scaled[0], attrs], + scaled_cancer_all.loc[min_3_idx_scaled[0], attrs], new_obs_scaled[attrs].T, ), axis=1, ).T neighbor2_scaled = pd.concat( ( - scaled_cancer.loc[min_3_idx_scaled[1], attrs], + scaled_cancer_all.loc[min_3_idx_scaled[1], attrs], new_obs_scaled[attrs].T, ), axis=1, ).T neighbor3_scaled = pd.concat( ( - scaled_cancer.loc[min_3_idx_scaled[2], attrs], + scaled_cancer_all.loc[min_3_idx_scaled[2], attrs], new_obs_scaled[attrs].T, ), axis=1, @@ -1380,24 +1376,6 @@ Comparison of K = 3 nearest neighbors with standardized and unstandardized data. ```{code-cell} ipython3 :tags: [remove-cell] -# Could not find something mimicing `facet_zoom` in R, here are 2 plots trying to -# illustrate similar points -# 1. interactive plot which allows zooming in/out -glue('fig:05-scaling-plt-interactive', area_smoothness_new_point.interactive()) -``` - -+++ {"tags": ["remove-cell"]} - -:::{glue:figure} fig:05-scaling-plt-interactive -:name: fig:05-scaling-plt-interactive - -Close-up of three nearest neighbors for unstandardized data. -::: - -```{code-cell} ipython3 -:tags: [remove-cell] - -# 2. Static plot, Zoom-in zoom_area_smoothness_new_point = ( alt.Chart( area_smoothness_new_df, @@ -1409,7 +1387,6 @@ zoom_area_smoothness_new_point = ( y=alt.Y("Smoothness", scale=alt.Scale(domain=(0.08, 0.14))), color=alt.Color( "Class", - scale=alt.Scale(range=["#86bfef", "#efb13f", "red"]), title="Diagnosis", ), shape=alt.Shape( @@ -1514,7 +1491,7 @@ in the training data that were tagged as malignant. attrs = ["Perimeter", "Concavity"] new_point = [2, 2] new_point_df = pd.DataFrame( - {"Class": ["Unknwon"], "Perimeter": new_point[0], "Concavity": new_point[1]} + {"Class": ["Unknown"], "Perimeter": new_point[0], "Concavity": new_point[1]} ) rare_cancer["Class"] = rare_cancer["Class"].apply(class_dscp) rare_cancer_with_new_df = pd.concat((rare_cancer, new_point_df), ignore_index=True) @@ -1799,6 +1776,13 @@ placed in a `Pipeline`. We will now place these steps in a `Pipeline` using the `make_pipeline` function, and finally we will call `.fit()` to run the whole `Pipeline` on the `unscaled_cancer` data. +all data preprocessing and modeling can be +built using a +[`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html?highlight=pipeline#sklearn.pipeline.Pipeline), +or a more convenient function +[`make_pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline) +for simple pipeline construction. + ```{code-cell} ipython3 :tags: [remove-cell] From c5c8769f2375d29a2edb5f9001b577a65eb53dd6 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sun, 1 Jan 2023 22:19:23 -0800 Subject: [PATCH 12/45] balancing polished --- source/classification1.md | 119 ++++++++++++++++---------------------- 1 file changed, 50 insertions(+), 69 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index fc9084f1..414597fd 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -1429,39 +1429,25 @@ what the data would look like if the cancer was rare. We will do this by picking only 3 observations from the malignant group, and keeping all of the benign observations. We choose these 3 observations using the `.head()` method, which takes the number of rows to select from the top (`n`). -The new imbalanced data is shown in {numref}`fig:05-unbalanced`. +We use the [`concat`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) +function from `pandas` to glue the two resulting filtered +data frames back together by passing them together in a sequence. +The new imbalanced data is shown in {numref}`fig:05-unbalanced`, +and we print the counts of the classes using the `value_counts` function. ```{code-cell} ipython3 -:tags: [remove-cell] - -# To better illustrate the problem, let's revisit the scaled breast cancer data, -# `cancer`; except now we will remove many of the observations of malignant tumors, simulating -# what the data would look like if the cancer was rare. We will do this by -# picking only 3 observations from the malignant group, and keeping all -# of the benign observations. We choose these 3 observations using the `slice_head` -# function, which takes two arguments: a data frame-like object, -# and the number of rows to select from the top (`n`). -# The new imbalanced data is shown in Figure \@ref(fig:05-unbalanced). -``` - -```{code-cell} ipython3 -cancer = pd.read_csv("data/wdbc.csv") rare_cancer = pd.concat( - (cancer.query("Class == 'B'"), cancer.query("Class == 'M'").head(3)) -) -colors = ["#86bfef", "#efb13f"] -rare_cancer["Class"] = rare_cancer["Class"].apply( - lambda x: "Malignant" if (x == "M") else "Benign" -) + (cancer[cancer["Class"] == 'Benign'], + cancer[cancer["Class"] == 'Malignant'].head(3) + )) + rare_plot = ( - alt.Chart( - rare_cancer - ) + alt.Chart(rare_cancer) .mark_point(opacity=0.6, filled=True, size=40) .encode( x=alt.X("Perimeter", title="Perimeter (standardized)"), y=alt.Y("Concavity", title="Concavity (standardized)"), - color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"), + color=alt.Color("Class", title="Diagnosis"), ) ) rare_plot @@ -1474,6 +1460,10 @@ rare_plot Imbalanced data. ``` +```{code-cell} ipython3 +rare_cancer['Class'].value_counts() +``` + +++ Suppose we now decided to use $K = 7$ in $K$-nearest neighbor classification. @@ -1510,7 +1500,6 @@ rare_plot = ( y=alt.Y("Concavity", title="Concavity (standardized)"), color=alt.Color( "Class", - scale=alt.Scale(range=["#86bfef", "#efb13f", "red"]), title="Diagnosis", ), shape=alt.Shape( @@ -1525,9 +1514,9 @@ min_7_idx = np.argpartition(my_distances, 7)[:7] # For loop: each iteration adds a line segment of corresponding color for i in range(7): - clr = "#86bfef" + clr = "#1f77b4" if rare_cancer.iloc[min_7_idx[i], :]["Class"] == "Malignant": - clr = "#efb13f" + clr = "#ff7f0e" neighbor = pd.concat( ( rare_cancer.iloc[min_7_idx[i], :][attrs], @@ -1560,21 +1549,24 @@ always "benign," corresponding to the blue color. ```{code-cell} ipython3 :tags: [remove-cell] -knn_spec = KNeighborsClassifier(n_neighbors=7) -knn_spec.fit(X=rare_cancer.loc[:, ["Perimeter", "Concavity"]], y=rare_cancer["Class"]) +knn = KNeighborsClassifier(n_neighbors=7) +knn.fit(X=rare_cancer.loc[:, ["Perimeter", "Concavity"]], y=rare_cancer["Class"]) # create a prediction pt grid per_grid = np.linspace( - rare_cancer["Perimeter"].min(), rare_cancer["Perimeter"].max(), 100 + rare_cancer["Perimeter"].min(), rare_cancer["Perimeter"].max(), 50 ) con_grid = np.linspace( - rare_cancer["Concavity"].min(), rare_cancer["Concavity"].max(), 100 + rare_cancer["Concavity"].min(), rare_cancer["Concavity"].max(), 50 ) pcgrid = np.array(np.meshgrid(per_grid, con_grid)).reshape(2, -1).T pcgrid = pd.DataFrame(pcgrid, columns=["Perimeter", "Concavity"]) -knnPredGrid = knn_spec.predict(pcgrid) +pcgrid + +knnPredGrid = knn.predict(pcgrid) prediction_table = pcgrid.copy() prediction_table["Class"] = knnPredGrid +prediction_table # create the scatter plot rare_plot = ( @@ -1585,7 +1577,7 @@ rare_plot = ( .encode( x=alt.X("Perimeter", title="Perimeter (standardized)"), y=alt.Y("Concavity", title="Concavity (standardized)"), - color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"), + color=alt.Color("Class", title="Diagnosis"), ) ) @@ -1595,7 +1587,7 @@ prediction_plot = ( prediction_table, title="Imbalanced data", ) - .mark_point(opacity=0.02, filled=True, size=200) + .mark_point(opacity=0.05, filled=True, size=300) .encode( x=alt.X( "Perimeter", @@ -1611,10 +1603,10 @@ prediction_plot = ( domain=(rare_cancer["Concavity"].min(), rare_cancer["Concavity"].max()) ), ), - color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"), + color=alt.Color("Class", title="Diagnosis"), ) ) -rare_plot + prediction_plot +#rare_plot + prediction_plot glue("fig:05-upsample-2", (rare_plot + prediction_plot)) ``` @@ -1633,27 +1625,16 @@ Despite the simplicity of the problem, solving it in a statistically sound manne fairly nuanced, and a careful treatment would require a lot more detail and mathematics than we will cover in this textbook. For the present purposes, it will suffice to rebalance the data by *oversampling* the rare class. In other words, we will replicate rare observations multiple times in our data set to give them more -voting power in the $K$-nearest neighbor algorithm. In order to do this, we will need an oversampling -step with the `resample` function from the `sklearn` Python package. -We show below how to do this, and also -use the `.groupby()` and `.count()` methods to see that our classes are now balanced: - -```{code-cell} ipython3 -:tags: [remove-cell] - -# Despite the simplicity of the problem, solving it in a statistically sound manner is actually -# fairly nuanced, and a careful treatment would require a lot more detail and mathematics than we will cover in this textbook. -# For the present purposes, it will suffice to rebalance the data by *oversampling* the rare class. \index{oversampling} -# In other words, we will replicate rare observations multiple times in our data set to give them more -# voting power in the $K$-nearest neighbor algorithm. In order to do this, we will add an oversampling -# step to the earlier `uc_recipe` recipe with the `step_upsample` function from the `themis` R package. \index{recipe!step\_upsample} -# We show below how to do this, and also -# use the `group_by` and `summarize` functions to see that our classes are now balanced: -``` - -```{code-cell} ipython3 -rare_cancer['Class'].value_counts() -``` +voting power in the $K$-nearest neighbor algorithm. In order to do this, we will +first separate the classes out into their own data frames by filtering. +Then, we will +use the [`resample`](https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html) function +from the `sklearn` package to increase the number of `Malignant` observations to be the same as the number +of `Benign` observations. We set the `n_samples` argument to be the number of `Malignant` observations we want. +We also set the `random_state` to be some integer +so that our results are reproducible; if we do not set this argument, we will get a different upsampling each time +we run the code. Finally, we use the `value_counts` method + to see that our classes are now balanced. ```{code-cell} ipython3 from sklearn.utils import resample @@ -1664,7 +1645,7 @@ malignant_cancer_upsample = resample( malignant_cancer, n_samples=len(benign_cancer), random_state=100 ) upsampled_cancer = pd.concat((malignant_cancer_upsample, benign_cancer)) -upsampled_cancer.groupby(by='Class')['Class'].count() +upsampled_cancer['Class'].value_counts() ``` Now suppose we train our $K$-nearest neighbor classifier with $K=7$ on this *balanced* data. @@ -1677,13 +1658,13 @@ closer to the benign tumor observations. ```{code-cell} ipython3 :tags: [remove-cell] -knn_spec = KNeighborsClassifier(n_neighbors=7) -knn_spec.fit( +knn = KNeighborsClassifier(n_neighbors=7) +knn.fit( X=upsampled_cancer.loc[:, ["Perimeter", "Concavity"]], y=upsampled_cancer["Class"] ) # create a prediction pt grid -knnPredGrid = knn_spec.predict(pcgrid) +knnPredGrid = knn.predict(pcgrid) prediction_table = pcgrid prediction_table["Class"] = knnPredGrid @@ -1706,21 +1687,21 @@ rare_plot = ( domain=(rare_cancer["Concavity"].min(), rare_cancer["Concavity"].max()) ), ), - color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"), + color=alt.Color("Class", title="Diagnosis"), ) ) # add a prediction layer, also scatter plot upsampled_plot = ( alt.Chart(prediction_table) - .mark_point(opacity=0.02, filled=True, size=200) + .mark_point(opacity=0.05, filled=True, size=300) .encode( x=alt.X("Perimeter", title="Perimeter (standardized)"), y=alt.Y("Concavity", title="Concavity (standardized)"), - color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"), + color=alt.Color("Class", title="Diagnosis"), ) ) -rare_plot + upsampled_plot +#rare_plot + upsampled_plot glue("fig:05-upsample-plot", (rare_plot + upsampled_plot)) ``` @@ -1759,7 +1740,7 @@ First we will load the data, create a model, and specify a preprocessor for how unscaled_cancer = pd.read_csv("data/unscaled_wdbc.csv") # create the KNN model -knn_spec = KNeighborsClassifier(n_neighbors=7) +knn = KNeighborsClassifier(n_neighbors=7) # create the centering / scaling preprocessor preprocessor = make_column_transformer( @@ -1800,7 +1781,7 @@ for simple pipeline construction. ``` ```{code-cell} ipython3 -knn_fit = make_pipeline(preprocessor, knn_spec).fit( +knn_fit = make_pipeline(preprocessor, knn).fit( X=unscaled_cancer.loc[:, ["Area", "Smoothness"]], y=unscaled_cancer["Class"] ) @@ -1819,7 +1800,7 @@ one with `Area = 500` and `Smoothness = 0.075`, and one with `Area = 1500` and ` # As before, the fit object lists the function that trains the model as well as the "best" settings # for the number of neighbors and weight function (for now, these are just the values we chose -# manually when we created `knn_spec` above). But now the fit object also includes information about +# manually when we created `knn` above). But now the fit object also includes information about # the overall workflow, including the centering and scaling preprocessing steps. # In other words, when we use the `predict` function with the `knn_fit` object to make a prediction for a new # observation, it will first apply the same recipe steps to the new observation. From a9deb2efd25101455f46f27a44a0d0ddfc25a2e3 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sun, 1 Jan 2023 22:38:53 -0800 Subject: [PATCH 13/45] pipelines --- source/classification1.md | 110 +++++++++++++++----------------------- 1 file changed, 43 insertions(+), 67 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index 414597fd..d2564ca5 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -1714,30 +1714,28 @@ Upsampled data with background color indicating the decision of the classifier. +++ (08:puttingittogetherworkflow)= -## Putting it together in a `pipeline` +## Putting it together in a `Pipeline` ```{index} scikit-learn; pipeline ``` -The `scikit-learn` package collection also provides the `pipeline`, a way to chain together multiple data analysis steps without a lot of otherwise necessary code for intermediate steps. -To illustrate the whole pipeline, let's start from scratch with the `unscaled_wdbc.csv` data. -First we will load the data, create a model, and specify a preprocessor for how the data should be preprocessed: +The `scikit-learn` package collection also provides the [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html?highlight=pipeline#sklearn.pipeline.Pipeline), +a way to chain together multiple data analysis steps without a lot of otherwise necessary code for intermediate steps. +To illustrate the whole workflow, let's start from scratch with the `unscaled_wdbc.csv` data. +First we will load the data, create a model, and specify a preprocessor for the data. ```{code-cell} ipython3 -:tags: [remove-cell] - -# The `tidymodels` package collection also provides the `workflow`, -# a way to chain\index{tidymodels!workflow}\index{workflow|see{tidymodels}} -# together multiple data analysis steps without a lot of otherwise necessary code -# for intermediate steps. -# To illustrate the whole pipeline, let's start from scratch with the `unscaled_wdbc.csv` data. -# First we will load the data, create a model, -# and specify a recipe for how the data should be preprocessed: -``` +# load the unscaled cancer data, make Class readable +unscaled_cancer = ( + pd.read_csv("data/unscaled_wdbc.csv") + .replace({ + 'M' : 'Malignant', + 'B' : 'Benign' + }) + ) +# make Class a categorical type +unscaled_cancer['Class'] = unscaled_cancer['Class'].astype('category') -```{code-cell} ipython3 -# load the unscaled cancer data -unscaled_cancer = pd.read_csv("data/unscaled_wdbc.csv") # create the KNN model knn = KNeighborsClassifier(n_neighbors=7) @@ -1748,74 +1746,47 @@ preprocessor = make_column_transformer( ) ``` -You will also notice that we did not call `.fit()` on the preprocessor; this is unnecessary when it is -placed in a `Pipeline`. - ```{index} scikit-learn; make_pipeline, scikit-learn; fit ``` -We will now place these steps in a `Pipeline` using the `make_pipeline` function, -and finally we will call `.fit()` to run the whole `Pipeline` on the `unscaled_cancer` data. - -all data preprocessing and modeling can be -built using a -[`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html?highlight=pipeline#sklearn.pipeline.Pipeline), -or a more convenient function -[`make_pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline) -for simple pipeline construction. +Next we place these steps in a `Pipeline` using +the [`make_pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline) function. +The `make_pipeline` function takes a list of steps to apply in your data analysis; in this +case, we just have the `preprocessor` and `knn` steps. +Finally, we call `fit` on the pipeline. +Notice that we do not need to separately call `fit` and `transform` on the `preprocessor`; the +pipeline handles doing this properly for us. +Also notice that when we call `fit` on the pipeline, we can pass +the whole `unscaled_cancer` data frame to the `X` argument, since the preprocessing +step drops all the variables except the two we listed: `Area` and `Smoothness`. +For the `y` response variable argument, we pass the `unscaled_cancer["Class"]` series as before. ```{code-cell} ipython3 -:tags: [remove-cell] - -# Note that each of these steps is exactly the same as earlier, except for one major difference: -# we did not use the `select` function to extract the relevant variables from the data frame, -# and instead simply specified the relevant variables to use via the -# formula `Class ~ Area + Smoothness` (instead of `Class ~ .`) in the recipe. -# You will also notice that we did not call `prep()` on the recipe; this is unnecessary when it is -# placed in a workflow. +from sklearn.pipeline import make_pipeline -# We will now place these steps in a `workflow` using the `add_recipe` and `add_model` functions, \index{tidymodels!add\_recipe}\index{tidymodels!add\_model} -# and finally we will use the `fit` function to run the whole workflow on the `unscaled_cancer` data. -# Note another difference from earlier here: we do not include a formula in the `fit` function. This \index{tidymodels!fit} -# is again because we included the formula in the recipe, so there is no need to respecify it: -``` - -```{code-cell} ipython3 knn_fit = make_pipeline(preprocessor, knn).fit( - X=unscaled_cancer.loc[:, ["Area", "Smoothness"]], y=unscaled_cancer["Class"] + X=unscaled_cancer, + y=unscaled_cancer["Class"] ) knn_fit ``` As before, the fit object lists the function that trains the model. But now the fit object also includes information about -the overall workflow, including the standardizing preprocessing step. +the overall workflow, including the standardization preprocessing step. In other words, when we use the `predict` function with the `knn_fit` object to make a prediction for a new observation, it will first apply the same preprocessing steps to the new observation. As an example, we will predict the class label of two new observations: one with `Area = 500` and `Smoothness = 0.075`, and one with `Area = 1500` and `Smoothness = 0.1`. -```{code-cell} ipython3 -:tags: [remove-cell] - -# As before, the fit object lists the function that trains the model as well as the "best" settings -# for the number of neighbors and weight function (for now, these are just the values we chose -# manually when we created `knn` above). But now the fit object also includes information about -# the overall workflow, including the centering and scaling preprocessing steps. -# In other words, when we use the `predict` function with the `knn_fit` object to make a prediction for a new -# observation, it will first apply the same recipe steps to the new observation. -# As an example, we will predict the class label of two new observations: -# one with `Area = 500` and `Smoothness = 0.075`, and one with `Area = 1500` and `Smoothness = 0.1`. -``` - ```{code-cell} ipython3 new_observation = pd.DataFrame({"Area": [500, 1500], "Smoothness": [0.075, 0.1]}) prediction = knn_fit.predict(new_observation) prediction ``` -The classifier predicts that the first observation is benign ("B"), while the second is -malignant ("M"). {numref}`fig:05-workflow-plot-show` visualizes the predictions that this +The classifier predicts that the first observation is benign, while the second is +malignant. {numref}`fig:05-workflow-plot-show` visualizes the predictions that this trained $K$-nearest neighbor model will make on a large range of new observations. Although you have seen colored prediction map visualizations like this a few times now, we have not included the code to generate them, as it is a little bit complicated. @@ -1829,12 +1800,13 @@ predict the label of each, and visualize the predictions with a colored scatter > visualizations in their own data analyses. ```{code-cell} ipython3 +:tags: [remove-output] # create the grid of area/smoothness vals, and arrange in a data frame are_grid = np.linspace( - unscaled_cancer["Area"].min(), unscaled_cancer["Area"].max(), 100 + unscaled_cancer["Area"].min(), unscaled_cancer["Area"].max(), 50 ) smo_grid = np.linspace( - unscaled_cancer["Smoothness"].min(), unscaled_cancer["Smoothness"].max(), 100 + unscaled_cancer["Smoothness"].min(), unscaled_cancer["Smoothness"].max(), 50 ) asgrid = np.array(np.meshgrid(are_grid, smo_grid)).reshape(2, -1).T asgrid = pd.DataFrame(asgrid, columns=["Area", "Smoothness"]) @@ -1871,7 +1843,7 @@ unscaled_plot = ( ) ), ), - color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"), + color=alt.Color("Class", title="Diagnosis"), ) ) @@ -1882,13 +1854,17 @@ prediction_plot = ( .encode( x=alt.X("Area"), y=alt.Y("Smoothness"), - color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"), + color=alt.Color("Class", title="Diagnosis"), ) ) - unscaled_plot + prediction_plot ``` +```{code-cell} ipython3 +:tags: [remove-input] +glue("fig:05-upsample-2", (unscaled_plot + prediction_plot)) +``` + ```{figure}  :name: fig:05-workflow-plot-show :figclass: caption-hack @@ -1908,7 +1884,7 @@ You can launch an interactive version of the worksheet in your browser by clicki You can also preview a non-interactive version of the worksheet by clicking "view worksheet." If you instead decide to download the worksheet and run it on your own machine, make sure to follow the instructions for computer setup -found in Chapter {ref}`move-to-your-own-machine`. This will ensure that the automated feedback +found in the {ref}`move-to-your-own-machine` chapter. This will ensure that the automated feedback and guidance that the worksheets provide will function as intended. +++ From d5b8af33bc8ff4eca0008dec66efac3827935745 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sun, 1 Jan 2023 22:39:58 -0800 Subject: [PATCH 14/45] learning objs --- source/classification1.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/source/classification1.md b/source/classification1.md index d2564ca5..5f6a2c9a 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -55,7 +55,8 @@ By the end of the chapter, readers will be able to do the following: - Compute, by hand, the straight-line (Euclidean) distance between points on a graph when there are two predictor variables. - Explain the $K$-nearest neighbor classification algorithm. - Perform $K$-nearest neighbor classification in Python using `scikit-learn`. -- Use `StandardScaler` to preprocess data to be centered, scaled, and balanced. +- Use `StandardScaler` and `make_column_transformer` to preprocess data to be centered and scaled. +- Use `resample` to preprocess data to be balanced. - Combine preprocessing and model training using `make_pipeline`. +++ From c1c8151358526ede0146225cc7077bed22143bb6 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sun, 1 Jan 2023 22:42:09 -0800 Subject: [PATCH 15/45] mute warnings in ch5 --- source/classification1.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index 5f6a2c9a..9642f217 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -14,10 +14,10 @@ kernelspec: ```{code-cell} ipython3 :tags: [remove-cell] -#import warnings -#def warn(*args, **kwargs): -# pass -#warnings.warn = warn +import warnings +def warn(*args, **kwargs): + pass +warnings.warn = warn from myst_nb import glue import numpy as np From 863ca91eadf862a8d54b3fe6fe3c1ba2ee3a3847 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sun, 1 Jan 2023 22:46:12 -0800 Subject: [PATCH 16/45] warn mute code; fixed links at end --- source/classification2.md | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/source/classification2.md b/source/classification2.md index 8daf6c38..ef6b2d21 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -15,6 +15,14 @@ kernelspec: (classification2)= # Classification II: evaluation & tuning +```{code-cell} ipython3 +:tags: [remove-cell] +#import warnings +#def warn(*args, **kwargs): +# pass +#warnings.warn = warn +``` + ```{code-cell} ipython3 :tags: [remove-cell] @@ -2119,13 +2127,13 @@ Estimated accuracy versus the number of predictors for the sequence of models bu Practice exercises for the material covered in this chapter can be found in the accompanying -[worksheets repository](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets#readme) +[worksheets repository](https://github.com/UBC-DSCI/data-science-a-first-intro-python-worksheets#readme) in the "Classification II: evaluation and tuning" row. You can launch an interactive version of the worksheet in your browser by clicking the "launch binder" button. You can also preview a non-interactive version of the worksheet by clicking "view worksheet." If you instead decide to download the worksheet and run it on your own machine, make sure to follow the instructions for computer setup -found in Chapter {ref}`move-to-your-own-machine`. This will ensure that the automated feedback +found in the {ref}`move-to-your-own-machine` chapter. This will ensure that the automated feedback and guidance that the worksheets provide will function as intended. +++ From b2df7420388764648886fcdda7376150fe94e674 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sun, 1 Jan 2023 22:46:43 -0800 Subject: [PATCH 17/45] restore cls2 to main branch --- source/classification2.md | 10 +--------- 1 file changed, 1 insertion(+), 9 deletions(-) diff --git a/source/classification2.md b/source/classification2.md index 17390aec..8daf6c38 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -15,14 +15,6 @@ kernelspec: (classification2)= # Classification II: evaluation & tuning -```{code-cell} ipython3 -:tags: [remove-cell] -#import warnings -#def warn(*args, **kwargs): -# pass -#warnings.warn = warn -``` - ```{code-cell} ipython3 :tags: [remove-cell] @@ -2127,7 +2119,7 @@ Estimated accuracy versus the number of predictors for the sequence of models bu Practice exercises for the material covered in this chapter can be found in the accompanying -[worksheets repository](https://github.com/UBC-DSCI/data-science-a-first-intro-python-worksheets#readme) +[worksheets repository](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets#readme) in the "Classification II: evaluation and tuning" row. You can launch an interactive version of the worksheet in your browser by clicking the "launch binder" button. You can also preview a non-interactive version of the worksheet by clicking "view worksheet." From c8e3a40ede8e6397cfea8ef6e7b7cd407459c16a Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 7 Jan 2023 19:45:03 -0800 Subject: [PATCH 18/45] remove caption hack; minor fix to learning objs --- source/classification2.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/source/classification2.md b/source/classification2.md index ef6b2d21..ab3b26f5 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -64,7 +64,7 @@ By the end of the chapter, readers will be able to do the following: - Describe what training, validation, and test data sets are and how they are used in classification. - Split data into training, validation, and test data sets. - Describe what a random seed is and its importance in reproducible data analysis. -- Set the random seed in Python using the `numpy.random.seed` function or `random_state` argument in some of the `scikit-learn` functions. +- Set the random seed in Python using either the `numpy.random.seed` function or `random_state` argument in `scikit-learn` functions. - Evaluate classification accuracy in Python using a validation data set and appropriate metrics. - Execute cross-validation in Python to choose the number of neighbors in a $K$-nearest neighbors classifier. - Describe the advantages and disadvantages of the $K$-nearest neighbors classification algorithm. @@ -146,7 +146,6 @@ books on this topic. ```{figure} img/ML-paradigm-test.png :name: fig:06-ML-paradigm-test -:figclass: caption-hack Process for splitting the data and finding the prediction accuracy. ``` From 384ac141d6b6738d038d913b77cf9868b9431600 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 7 Jan 2023 19:45:25 -0800 Subject: [PATCH 19/45] Remove caption hack --- source/classification2.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/source/classification2.md b/source/classification2.md index ab3b26f5..e3e83714 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -113,7 +113,6 @@ labels for new observations without known class labels. ```{figure} img/training_test.jpeg :name: fig:06-training-test -:figclass: caption-hack Splitting the data into training and testing sets. ``` @@ -349,7 +348,6 @@ perim_concav ```{figure}  :name: fig:06-precode -:figclass: caption-hack Scatter plot of tumor cell concavity versus smoothness colored by diagnosis label. ``` @@ -937,7 +935,6 @@ resulting in 5 different choices for the **validation set**; we call this ```{figure} img/cv.png :name: fig:06-cv-image -:figclass: caption-hack 5-fold cross-validation. ``` @@ -1509,7 +1506,6 @@ process is summarized in {numref}`fig:06-overview`. ```{figure} img/train-test-overview.jpeg :name: fig:06-overview -:figclass: caption-hack Overview of KNN classification. ``` From d80c8c3c81eb340dba856bd3a6f4801e66dffdd4 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 7 Jan 2023 20:09:52 -0800 Subject: [PATCH 20/45] initial improved seed explanation --- source/classification2.md | 126 ++++++++++++++++---------------------- 1 file changed, 54 insertions(+), 72 deletions(-) diff --git a/source/classification2.md b/source/classification2.md index e3e83714..58755c5f 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -186,115 +186,97 @@ The trick is that in Python—and other programming languages—randomne is not actually random! Instead, Python uses a *random number generator* that produces a sequence of numbers that are completely determined by a - *seed value*. Once you set the seed value -using the `np.random.seed` function or the `random_state` argument, everything after that point may *look* random, + *seed value*. Once you set the seed value, everything after that point may *look* random, but is actually totally reproducible. As long as you pick the same seed value, you get the same result! ```{index} sample; numpy.random.choice ``` -Let's use an example to investigate how seeds work in Python. Say we want -to randomly pick 10 numbers from 0 to 9 in Python using the `np.random.choice` function, -but we want it to be reproducible. Before using the sample function, -we call `np.random.seed`, and pass it any integer as an argument. -Here, we pass in the number `1`. +Let's use an example to investigate how randomness works in Python. Say we +have a series object containing the integers from 0 to 9. We want +to randomly pick 10 numbers from that list, but we want it to be reproducible. +We construct a `RandomState` object from the `numpy` package (with the short name `np`), +and pass it any integer as the `seed` argument. Below we use the seed number `1`. At +that point, the `RandomState` object can be used to keep track of the randomness that Python +uses by passing it to functions that need randomness. For example, we can call the `sample` method +on the series of numbers, passing it the `RandomState` object that we created +as the `random_state` argument, and `n = 10` to indicate that we want 10 samples. ```{code-cell} ipython3 import numpy as np -np.random.seed(1) -random_numbers = np.random.choice(range(10), size=10, replace=True) -random_numbers -``` +rnd = np.random.RandomState(seed = 1) +nums_0_to_9 = pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) -You can see that `random_numbers` is a list of 10 numbers +random_numbers1 = nums_0_to_9.sample(n = 10, random_state = rnd).values +random_numbers1 +``` +You can see that `random_numbers1` is a list of 10 numbers from 0 to 9 that, from all appearances, looks random. If -we run the `np.random.choice` function again, we will -get a fresh batch of 10 numbers that also look random. +we run the `sample` method again, passing it the `rnd` object as the `random_state`, +we will get a fresh batch of 10 numbers that also look random. ```{code-cell} ipython3 -random_numbers = np.random.choice(range(10), size=10, replace=True) -random_numbers +random_numbers2 = nums_0_to_9.sample(n = 10, random_state = rnd).values +random_numbers2 ``` If we want to force Python to produce the same sequences of random numbers, -we can simply call the `np.random.seed` function again with the same argument -value. +we can simply re-initialize the random state with the seed value `1` and then +call the `sample` method again. ```{code-cell} ipython3 -np.random.seed(1) -random_numbers = np.random.choice(range(10), size=10, replace=True) -random_numbers +rnd = np.random.RandomState(seed = 1) +random_numbers1_again = nums_0_to_9.sample(n = 10, random_state = rnd).values +random_numbers1_again ``` ```{code-cell} ipython3 -random_numbers = np.random.choice(range(10), size=10, replace=True) -random_numbers +random_numbers2_again = nums_0_to_9.sample(n = 10, random_state = rnd).values +random_numbers2_again ``` -And if we choose -a different value for the seed—say, 4235—we +Notice that after re-initializing the `RandomState` object, we get the same +two sequences of numbers in the same order. `random_numbers1` and `random_numbers1_again` +produce the same sequence of numbers, and the same can be said about `random_numbers2` and +`random_numbers2_again`. And if we choose a different value for the seed---say, 4235---we obtain a different sequence of random numbers. ```{code-cell} ipython3 -np.random.seed(4235) -random_numbers = np.random.choice(range(10), size=10, replace=True) +rnd = np.random.RandomState(seed = 4235) +random_numbers = nums_0_to_9.sample(n = 10, random_state = rnd).values random_numbers ``` ```{code-cell} ipython3 -random_numbers = np.random.choice(range(10), size=10, replace=True) +random_numbers = nums_0_to_9.sample(n = 10, random_state = rnd).values random_numbers ``` In other words, even though the sequences of numbers that Python is generating *look* random, they are totally determined when we set a seed value! -So what does this mean for data analysis? Well, `np.random.choice` is certainly -not the only function that uses randomness in R. Many of the functions -that we use in `scikit-learn`, `numpy`, and beyond use randomness—many of them -without even telling you about it. -Also note that when Python starts up, it creates its own seed to use. So if you do not -explicitly call the `np.random.seed` function in your code or specify the `random_state` -argument in `scikit-learn` functions (where it is available), your results will -likely not be reproducible. -And finally, be careful to set the seed *only once* at the beginning of a data -analysis. Each time you set the seed, you are inserting your own human input, -thereby influencing the analysis. If you use `np.random.choice` many times -throughout your analysis, the randomness that Python uses will not look -as random as it should. - -Different argument values in `np.random.seed` lead to different patterns of randomness, but as long as -you pick the same argument value your result will be the same. - -```{code-cell} ipython3 -:tags: [remove-cell] - -# In other words, even though the sequences of numbers that R is generating *look* -# random, they are totally determined when we set a seed value! - -# So what does this mean for data analysis? Well, `sample` is certainly -# not the only function that uses randomness in R. Many of the functions -# that we use in `tidymodels`, `tidyverse`, and beyond use randomness—many of them -# without even telling you about it. So at the beginning of every data analysis you -# do, right after loading packages, you should call the `set.seed` function and -# pass it an integer that you pick. -# Also note that when R starts up, it creates its own seed to use. So if you do not -# explicitly call the `set.seed` function in your code, your results will -# likely not be reproducible. -# And finally, be careful to set the seed *only once* at the beginning of a data -# analysis. Each time you set the seed, you are inserting your own human input, -# thereby influencing the analysis. If you use `set.seed` many times -# throughout your analysis, the randomness that R uses will not look -# as random as it should. - -# In summary: if you want your analysis to be reproducible, i.e., produce *the same result* each time you -# run it, make sure to use `set.seed` exactly once at the beginning of the analysis. -# Different argument values in `set.seed` lead to different patterns of randomness, but as long as -# you pick the same argument value your result will be the same. -# In the remainder of the textbook, we will set the seed once at the beginning of each chapter. -``` +So what does this mean for data analysis? Well, `sample` is certainly not the +only data frame method that uses randomness in Python. Many of the functions +that we use in `scikit-learn`, `pandas`, and beyond use randomness—many +of them without even telling you about it. Also note that when Python starts +up, it creates its own seed to use. So if you do not explicitly call the +`np.random.seed` function in your code or specify the `random_state` argument +in `scikit-learn` functions (where it is available), your results will likely +not be reproducible. And finally, be careful to set the seed *only once* at +the beginning of a data analysis. Each time you set the seed, you are inserting +your own human input, thereby influencing the analysis. If you use +`np.random.choice` many times throughout your analysis, the randomness that +Python uses will not look as random as it should. + +In summary: if you want your analysis to be reproducible, i.e., produce *the same result* +each time you run it, make sure to create a `RandomState` object exactly once +at the beginning of the analysis, and pass that object to each function that +uses randomness via the `random_state` argument. Different `seed` argument values +will lead to different patterns of randomness, but as long as you pick the same +argument value your result will be the same. In the remainder of the textbook, +we will set the seed once at the beginning of each chapter. ## Evaluating accuracy with `scikit-learn` From 14d88257a58f96f269792995baf831cc20d78620 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 7 Jan 2023 21:22:44 -0800 Subject: [PATCH 21/45] random seed section polish done --- source/classification2.md | 132 ++++++++++++++++++++++++-------------- 1 file changed, 84 insertions(+), 48 deletions(-) diff --git a/source/classification2.md b/source/classification2.md index 58755c5f..e51f9b59 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -29,26 +29,26 @@ kernelspec: import altair as alt import numpy as np import pandas as pd -import sklearn -from sklearn.compose import make_column_transformer -from sklearn.metrics import confusion_matrix, plot_confusion_matrix -from sklearn.metrics.pairwise import euclidean_distances -from sklearn.model_selection import ( - GridSearchCV, - RandomizedSearchCV, - cross_validate, - train_test_split, -) -from sklearn.neighbors import KNeighborsClassifier -from sklearn.pipeline import Pipeline, make_pipeline -from sklearn.preprocessing import OneHotEncoder, StandardScaler - -alt.data_transformers.disable_max_rows() -# alt.renderers.enable("mimetype") +# import sklearn +# from sklearn.compose import make_column_transformer +# from sklearn.metrics import confusion_matrix, plot_confusion_matrix +# from sklearn.metrics.pairwise import euclidean_distances +# from sklearn.model_selection import ( +# GridSearchCV, +# RandomizedSearchCV, +# cross_validate, +# train_test_split, +# ) +# from sklearn.neighbors import KNeighborsClassifier +# from sklearn.pipeline import Pipeline, make_pipeline +# from sklearn.preprocessing import OneHotEncoder, StandardScaler +# +# alt.data_transformers.disable_max_rows() +# alt.renderers.enable("mimetype") from myst_nb import glue -pd.options.display.max_colwidth = 100 +#pd.options.display.max_colwidth = 100 ``` ## Overview @@ -196,61 +196,62 @@ value, you get the same result! Let's use an example to investigate how randomness works in Python. Say we have a series object containing the integers from 0 to 9. We want to randomly pick 10 numbers from that list, but we want it to be reproducible. -We construct a `RandomState` object from the `numpy` package (with the short name `np`), -and pass it any integer as the `seed` argument. Below we use the seed number `1`. At -that point, the `RandomState` object can be used to keep track of the randomness that Python -uses by passing it to functions that need randomness. For example, we can call the `sample` method -on the series of numbers, passing it the `RandomState` object that we created -as the `random_state` argument, and `n = 10` to indicate that we want 10 samples. +Before randomly picking the 10 numbers, +we call the `seed` function from the `numpy` package, and pass it any integer as the argument. +Below we use the seed number `1`. At +that point, Python will keep track of the randomness that occurs throughout the code. +For example, we can call the `sample` method +on the series of numbers, passing the argument `n = 10` to indicate that we want 10 samples. ```{code-cell} ipython3 import numpy as np -rnd = np.random.RandomState(seed = 1) +np.random.seed(1) + nums_0_to_9 = pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) -random_numbers1 = nums_0_to_9.sample(n = 10, random_state = rnd).values +random_numbers1 = nums_0_to_9.sample(n = 10).values random_numbers1 ``` You can see that `random_numbers1` is a list of 10 numbers from 0 to 9 that, from all appearances, looks random. If -we run the `sample` method again, passing it the `rnd` object as the `random_state`, +we run the `sample` method again, we will get a fresh batch of 10 numbers that also look random. ```{code-cell} ipython3 -random_numbers2 = nums_0_to_9.sample(n = 10, random_state = rnd).values +random_numbers2 = nums_0_to_9.sample(n = 10).values random_numbers2 ``` If we want to force Python to produce the same sequences of random numbers, -we can simply re-initialize the random state with the seed value `1` and then -call the `sample` method again. +we can simply call the `np.random.seed` function with the seed value `1`---the same +as before---and then call the `sample` method again. ```{code-cell} ipython3 -rnd = np.random.RandomState(seed = 1) -random_numbers1_again = nums_0_to_9.sample(n = 10, random_state = rnd).values +np.random.seed(1) +random_numbers1_again = nums_0_to_9.sample(n = 10).values random_numbers1_again ``` ```{code-cell} ipython3 -random_numbers2_again = nums_0_to_9.sample(n = 10, random_state = rnd).values +random_numbers2_again = nums_0_to_9.sample(n = 10).values random_numbers2_again ``` -Notice that after re-initializing the `RandomState` object, we get the same +Notice that after calling `np.random.seed`, we get the same two sequences of numbers in the same order. `random_numbers1` and `random_numbers1_again` produce the same sequence of numbers, and the same can be said about `random_numbers2` and `random_numbers2_again`. And if we choose a different value for the seed---say, 4235---we obtain a different sequence of random numbers. ```{code-cell} ipython3 -rnd = np.random.RandomState(seed = 4235) -random_numbers = nums_0_to_9.sample(n = 10, random_state = rnd).values +np.random.seed(4235) +random_numbers = nums_0_to_9.sample(n = 10).values random_numbers ``` ```{code-cell} ipython3 -random_numbers = nums_0_to_9.sample(n = 10, random_state = rnd).values +random_numbers = nums_0_to_9.sample(n = 10).values random_numbers ``` @@ -261,23 +262,57 @@ So what does this mean for data analysis? Well, `sample` is certainly not the only data frame method that uses randomness in Python. Many of the functions that we use in `scikit-learn`, `pandas`, and beyond use randomness—many of them without even telling you about it. Also note that when Python starts -up, it creates its own seed to use. So if you do not explicitly call the -`np.random.seed` function in your code or specify the `random_state` argument -in `scikit-learn` functions (where it is available), your results will likely -not be reproducible. And finally, be careful to set the seed *only once* at +up, it creates its own seed to use. So if you do not explicitly +call the `np.random.seed` function, your results +will likely not be reproducible. Finally, be careful to set the seed *only once* at the beginning of a data analysis. Each time you set the seed, you are inserting -your own human input, thereby influencing the analysis. If you use -`np.random.choice` many times throughout your analysis, the randomness that -Python uses will not look as random as it should. +your own human input, thereby influencing the analysis. For example, if you use +the `sample` many times throughout your analysis but set the seed each time, the +randomness that Python uses will not look as random as it should. In summary: if you want your analysis to be reproducible, i.e., produce *the same result* -each time you run it, make sure to create a `RandomState` object exactly once -at the beginning of the analysis, and pass that object to each function that -uses randomness via the `random_state` argument. Different `seed` argument values -will lead to different patterns of randomness, but as long as you pick the same -argument value your result will be the same. In the remainder of the textbook, +each time you run it, make sure to use `np.random.seed` exactly once +at the beginning of the analysis. Different argument values +in `np.random.seed` will lead to different patterns of randomness, but as long as you pick the same +value your analysis results will be the same. In the remainder of the textbook, we will set the seed once at the beginning of each chapter. +````{note} +When you use `np.random.seed`, you are really setting the seed for the `numpy` +package's *default random number generator*. Using the global default random +number generator is easier than other methods, but has some potential drawbacks. For example, +other code that you may not notice (e.g., code buried inside some +other package) could potentially *also* call `np.random.seed`, thus modifying +your analysis in an undesirable way. Furthermore, not *all* functions use +`numpy`'s random number generator; some may use another one entirely. +In that case, setting `np.random.seed` may not actually make your whole analysis +reproducible. + +In this book, we will generally only use packages that play nicely with `numpy`'s +default random number generator, so we will stick with `np.random.seed`. +You can achieve more careful control over randomness in your analysis +by creating a `numpy` [`RandomState` object](https://numpy.org/doc/1.16/reference/generated/numpy.random.RandomState.html) once at the beginning of your analysis, and passing it to +the `random_state` argument that is available in many `pandas` and `scikit-learn` +functions. For example, we can reproduce our earlier example by using a `RandomState` +object with the `seed` value set to 1; we get the same lists of numbers once again. +```{code} +rnd = np.random.RandomState(seed = 1) +random_numbers1_third = nums_0_to_9.sample(n = 10, random_state = rnd).values +random_numbers1_third +``` +```{code} +array([2, 9, 6, 4, 0, 3, 1, 7, 8, 5]) +``` +```{code} +random_numbers2_third = nums_0_to_9.sample(n = 10, random_state = rnd).values +random_numbers2_third +``` +```{code} +array([9, 5, 3, 0, 8, 4, 2, 1, 6, 7]) +``` + +```` + ## Evaluating accuracy with `scikit-learn` ```{index} scikit-learn, visualization; scatter @@ -2140,6 +2175,7 @@ and guidance that the worksheets provide will function as intended. variables. Note that while this book is still a very accessible introductory text, it requires a bit more mathematical background than we require. + ```{code-cell} ipython3 :tags: [remove-cell] From acda50dac88c39700a6aa8ccc1480cacca88ec96 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sun, 8 Jan 2023 11:40:12 -0800 Subject: [PATCH 22/45] polished ch6 up to tuning --- source/classification2.md | 286 +++++++++++++------------------------- 1 file changed, 95 insertions(+), 191 deletions(-) diff --git a/source/classification2.md b/source/classification2.md index e51f9b59..e906a927 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -64,7 +64,7 @@ By the end of the chapter, readers will be able to do the following: - Describe what training, validation, and test data sets are and how they are used in classification. - Split data into training, validation, and test data sets. - Describe what a random seed is and its importance in reproducible data analysis. -- Set the random seed in Python using either the `numpy.random.seed` function or `random_state` argument in `scikit-learn` functions. +- Set the random seed in Python using the `numpy.random.seed` function. - Evaluate classification accuracy in Python using a validation data set and appropriate metrics. - Execute cross-validation in Python to choose the number of neighbors in a $K$-nearest neighbors classifier. - Describe the advantages and disadvantages of the $K$-nearest neighbors classification algorithm. @@ -291,9 +291,11 @@ reproducible. In this book, we will generally only use packages that play nicely with `numpy`'s default random number generator, so we will stick with `np.random.seed`. You can achieve more careful control over randomness in your analysis -by creating a `numpy` [`RandomState` object](https://numpy.org/doc/1.16/reference/generated/numpy.random.RandomState.html) once at the beginning of your analysis, and passing it to +by creating a `numpy` [`RandomState` object](https://numpy.org/doc/1.16/reference/generated/numpy.random.RandomState.html) +once at the beginning of your analysis, and passing it to the `random_state` argument that is available in many `pandas` and `scikit-learn` -functions. For example, we can reproduce our earlier example by using a `RandomState` +functions. Those functions will then use your `RandomState` to generate random numbers instead of +`numpy`'s default generator. For example, we can reproduce our earlier example by using a `RandomState` object with the `seed` value set to 1; we get the same lists of numbers once again. ```{code} rnd = np.random.RandomState(seed = 1) @@ -327,8 +329,8 @@ We begin the analysis by loading the packages we require, reading in the breast cancer data, and then making a quick scatter plot visualization of tumor cell concavity versus smoothness colored by diagnosis in {numref}`fig:06-precode`. -You will also notice that we set the random seed using either the `np.random.seed` function -or `random_state` argument, as described in Section {ref}`randomseeds`. +You will also notice that we set the random seed using the `np.random.seed` function, +as described in the {ref}`randomseeds` section. ```{code-cell} ipython3 # load packages @@ -340,15 +342,16 @@ np.random.seed(1) # load data cancer = pd.read_csv("data/unscaled_wdbc.csv") -## re-label Class 'M' as 'Malignant', and Class 'B' as 'Benign' -cancer["Class"] = cancer["Class"].apply( - lambda x: "Malignant" if (x == "M") else "Benign" -) +# re-label Class 'M' as 'Malignant', and Class 'B' as 'Benign', +# and change the Class variable to have a category type +cancer['Class'] = cancer['Class'].replace({ + 'M' : 'Malignant', + 'B' : 'Benign' + }) +cancer['Class'] = cancer['Class'].astype('category') # create scatter plot of tumor cell concavity versus smoothness, # labeling the points be diagnosis class -## create a list of colors that will be used to customize the color of points -colors = ["#86bfef", "#efb13f"] perim_concav = ( alt.Chart(cancer) @@ -356,7 +359,7 @@ perim_concav = ( .encode( x="Smoothness", y="Concavity", - color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"), + color=alt.Color("Class", title="Diagnosis"), ) ) @@ -389,7 +392,9 @@ and 25% for testing. The `train_test_split` function from `scikit-learn` handles the procedure of splitting the data for us. We can specify two very important parameters when using `train_test_split` to ensure -that the accuracy estimates from the test data are reasonable. First, `shuffle=True` (default) means the data will be shuffled before splitting, which ensures that any ordering present +that the accuracy estimates from the test data are reasonable. First, +setting `shuffle=True` (which is the default) means the data will be shuffled before splitting, +which ensures that any ordering present in the data does not influence the data that ends up in the training and testing sets. Second, by specifying the `stratify` parameter to be the target column of the training set, it **stratifies** the data by the class label, to ensure that roughly @@ -401,7 +406,8 @@ so specifying `stratify` as the class column ensures that roughly 63% of the tra and the same proportions exist in the testing data. Let's use the `train_test_split` function to create the training and testing sets. -We will specify that `train_size=0.75` so that 75% of our original data set ends up +We first need to import the function from the `sklearn` package. Then +we will specify that `train_size=0.75` so that 75% of our original data set ends up in the training set. We will also set the `stratify` argument to the categorical label variable (here, `cancer['Class']`) to ensure that the training and testing subsets contain the right proportions of each category of observation. @@ -409,35 +415,10 @@ Note that the `train_test_split` function uses randomness, so we shall set `rand the split reproducible. ```{code-cell} ipython3 -:tags: [remove-cell] +from sklearn.model_selection import train_test_split -# The `initial_split` function \index{tidymodels!initial\_split} from `tidymodels` handles the procedure of splitting -# the data for us. It also applies two very important steps when splitting to ensure -# that the accuracy estimates from the test data are reasonable. First, it -# **shuffles** the \index{shuffling} data before splitting, which ensures that any ordering present -# in the data does not influence the data that ends up in the training and testing sets. -# Second, it **stratifies** the \index{stratification} data by the class label, to ensure that roughly -# the same proportion of each class ends up in both the training and testing sets. For example, -# in our data set, roughly 63% of the -# observations are from the benign class (`B`), and 37% are from the malignant class (`M`), -# so `initial_split` ensures that roughly 63% of the training data are benign, -# 37% of the training data are malignant, -# and the same proportions exist in the testing data. - -# Let's use the `initial_split` function to create the training and testing sets. -# We will specify that `prop = 0.75` so that 75% of our original data set ends up -# in the training set. We will also set the `strata` argument to the categorical label variable -# (here, `Class`) to ensure that the training and testing subsets contain the -# right proportions of each category of observation. -# The `training` and `testing` functions then extract the training and testing -# data sets into two separate data frames. -# Note that the `initial_split` function uses randomness, but since we set the -# seed earlier in the chapter, the split will be reproducible. -``` - -```{code-cell} ipython3 cancer_train, cancer_test = train_test_split( - cancer, train_size=0.75, stratify=cancer["Class"], random_state=1 + cancer, train_size=0.75, stratify=cancer["Class"] ) cancer_train.info() ``` @@ -458,40 +439,28 @@ glue("cancer_test_nrow", len(cancer_test)) We can see from `.info()` in the code above that the training set contains {glue:}`cancer_train_nrow` observations, while the test set contains {glue:}`cancer_test_nrow` observations. This corresponds to -a train / test split of 75% / 25%, as desired. Recall from Chapter {ref}`classification` -that we use the `.info()` method to view data with a large number of columns, -as it prints the data such that the columns go down the page (instead of across). - -```{code-cell} ipython3 -:tags: [remove-cell] - -# We can see from `glimpse` in \index{glimpse} the code above that the training set contains `r nrow(cancer_train)` -# observations, while the test set contains `r nrow(cancer_test)` observations. This corresponds to -# a train / test split of 75% / 25%, as desired. Recall from Chapter \@ref(classification) -# that we use the `glimpse` function to view data with a large number of columns, -# as it prints the data such that the columns go down the page (instead of across). -``` +a train / test split of 75% / 25%, as desired. Recall from the {ref}`classification` chapter +that we use the `.info()` method to preview the number of rows, the variable names, their data types, and +missing entries of a data frame. ```{index} groupby, count ``` -We can use `.groupby()` and `.count()` to find the percentage of malignant and benign classes -in `cancer_train` and we see about {glue:}`cancer_train_b_prop`% of the training +We can use the `value_counts` method with the `normalize` argument set to `True` +to find the percentage of malignant and benign classes +in `cancer_train`. We see about {glue:}`cancer_train_b_prop`% of the training data are benign and {glue:}`cancer_train_m_prop`% are malignant, indicating that our class proportions were roughly preserved when we split the data. ```{code-cell} ipython3 -cancer_proportions = pd.DataFrame() -cancer_proportions['n'] = cancer_train.groupby('Class')['ID'].count() -cancer_proportions['percent'] = 100 * cancer_proportions['n'] / len(cancer_train) -cancer_proportions +cancer_train["Class"].value_counts(normalize = True) ``` ```{code-cell} ipython3 :tags: [remove-cell] -glue("cancer_train_b_prop", round(cancer_proportions.iloc[0, 1])) -glue("cancer_train_m_prop", round(cancer_proportions.iloc[1, 1])) +glue("cancer_train_b_prop", round(cancer_train["Class"].value_counts(normalize = True)["Benign"]*100)) +glue("cancer_train_m_prop", round(cancer_train["Class"].value_counts(normalize = True)["Malignant"]*100)) ``` ### Preprocess the data @@ -509,17 +478,15 @@ training and test data sets. ```{index} pipeline, pipeline; make_column_transformer, pipeline; StandardScaler ``` -Fortunately, the `Pipeline` framework (together with column transformer) from `scikit-learn` helps us handle this properly. Below we construct and prepare the preprocessor using `make_column_transformer`. Later after we construct a full `Pipeline`, we will only fit it with the training data. +Fortunately, `scikit-learn` helps us handle this properly as long as we wrap our +analysis steps in a `Pipeline`, as in the {ref}`classification1` chapter. +So below we construct and prepare +the preprocessor using `make_column_transformer` just as before. ```{code-cell} ipython3 -:tags: [remove-cell] - -# Fortunately, the `recipe` framework from `tidymodels` helps us handle \index{recipe}\index{recipe!step\_scale}\index{recipe!step\_center} -# this properly. Below we construct and prepare the recipe using only the training -# data (due to `data = cancer_train` in the first line). -``` +from sklearn.preprocessing import StandardScaler +from sklearn.compose import make_column_transformer -```{code-cell} ipython3 cancer_preprocessor = make_column_transformer( (StandardScaler(), ["Smoothness", "Concavity"]), ) @@ -530,34 +497,23 @@ cancer_preprocessor = make_column_transformer( Now that we have split our original data set into training and test sets, we can create our $K$-nearest neighbors classifier with only the training set using the technique we learned in the previous chapter. For now, we will just choose -the number $K$ of neighbors to be 3. To fit the model with only concavity and smoothness as the -predictors, we need to explicitly create `X` (predictors) and `y` (target) based on `cancer_train`. -As before we need to create a model specification, combine -the model specification and preprocessor into a workflow, and then finally -use `fit` with `X` and `y` to build the classifier. +the number $K$ of neighbors to be 3, and use only the concavity and smoothness predictors by +selecting them from the `cancer_train` data frame. +We will first import the `KNeighborsClassifier` model and `make_pipeline` from `sklearn`. +Then as before we will create a model object, combine +the model object and preprocessor into a `Pipeline` using the `make_pipeline` function, and then finally +use the `fit` method to build the classifier. ```{code-cell} ipython3 -:tags: [remove-cell] - -# Now that we have split our original data set into training and test sets, we -# can create our $K$-nearest neighbors classifier with only the training set using -# the technique we learned in the previous chapter. For now, we will just choose -# the number $K$ of neighbors to be 3, and use concavity and smoothness as the -# predictors. As before we need to create a model specification, combine -# the model specification and recipe into a workflow, and then finally -# use `fit` with the training data `cancer_train` to build the classifier. -``` - -```{code-cell} ipython3 -# hidden seed -# np.random.seed(1) +from sklearn.neighbors import KNeighborsClassifier +from sklearn.pipeline import make_pipeline -knn_spec = KNeighborsClassifier(n_neighbors=3) ## weights="uniform" +knn = KNeighborsClassifier(n_neighbors=3) X = cancer_train.loc[:, ["Smoothness", "Concavity"]] y = cancer_train["Class"] -knn_fit = make_pipeline(cancer_preprocessor, knn_spec).fit(X, y) +knn_fit = make_pipeline(cancer_preprocessor, knn).fit(X, y) knn_fit ``` @@ -568,53 +524,19 @@ knn_fit ``` Now that we have a $K$-nearest neighbors classifier object, we can use it to -predict the class labels for our test set. We use the `pandas.concat()` to add the -column of predictions to the original test data, creating the +predict the class labels for our test set. We will use the `assign` method to +augment the original test data with a column of predictions, creating the `cancer_test_predictions` data frame. The `Class` variable contains the true diagnoses, while the `predicted` contains the predicted diagnoses from the -classifier. - -```{code-cell} ipython3 -:tags: [remove-cell] - -# Now that we have a $K$-nearest neighbors classifier object, we can use it to -# predict the class labels for our test set. We use the `bind_cols` \index{bind\_cols} to add the -# column of predictions to the original test data, creating the -# `cancer_test_predictions` data frame. The `Class` variable contains the true -# diagnoses, while the `.pred_class` contains the predicted diagnoses from the -# classifier. -``` - -```{code-cell} ipython3 -cancer_test_predictions = knn_fit.predict( - cancer_test.loc[:, ["Smoothness", "Concavity"]] -) - -cancer_test_predictions = pd.concat( - [ - pd.DataFrame(cancer_test_predictions, columns=["predicted"]), - cancer_test.reset_index(drop=True), - ], - axis=1, -) # add the predictions column to the original test data - -cancer_test_predictions -``` +classifier. Note that below we print out just the `ID`, `Class`, and `predicted` +variables in the output data frame. ```{code-cell} ipython3 -:tags: [remove-cell] - -## alternative way to add a column - -# # add the predictions column to the original test data -# cancer_test_predictions = cancer_test.reset_index(drop=True).assign( -# predicted=cancer_test_predictions -# ) - -# # move the `predicted` column to the first column for easy visualization -# col_order = cancer_test_predictions.columns.tolist() -# col_order = col_order[-1:] + col_order[:-1] -# cancer_test_predictions[col_order] +cancer_test_predictions = cancer_test.assign( + predicted = knn_fit.predict( + cancer_test.loc[:, ["Smoothness", "Concavity"]] + )) +cancer_test_predictions[['ID', 'Class', 'predicted']] ``` ### Compute the accuracy @@ -622,26 +544,30 @@ cancer_test_predictions ```{index} scikit-learn; score ``` -Finally, we can assess our classifier's accuracy. To do this we use the `score` method -from `scikit-learn` to get the statistics about the quality of our model, specifying -the `X` and `y` arguments based on `cancer_test`. - +Finally, we can assess our classifier's accuracy. We could compute the accuracy manually +by using our earlier formula: the number of correct predictions divided by the total +number of predictions. First we filter the rows to find the number of correct predictions, +and then divide the number of rows with correct predictions by the total number of rows +using the `len` function. ```{code-cell} ipython3 -:tags: [remove-cell] +correct_preds = cancer_test_predictions[ + cancer_test_predictions['Class'] == cancer_test_predictions['predicted'] + ] -# Finally, we can assess our classifier's accuracy. To do this we use the `metrics` function \index{tidymodels!metrics} -# from `tidymodels` to get the statistics about the quality of our model, specifying -# the `truth` and `estimate` arguments: +len(correct_preds) / len(cancer_test_predictions) ``` -```{code-cell} ipython3 -# np.random.seed(1) - -X_test = cancer_test.loc[:, ["Smoothness", "Concavity"]] -y_test = cancer_test["Class"] - -cancer_acc_1 = knn_fit.score(X_test, y_test) +The `scitkit-learn` package also provides a more convenient way to do this using +the `score` method. To use the `score` method, we need to specify two arguments: +predictors and true labels. We pass the same test data +for the predictors that we originally passed into `predict` when making predictions, +and we provide the true labels via the `cancer_test["Class"]` series. +```{code-cell} ipython3 +cancer_acc_1 = knn_fit.score( + cancer_test.loc[:, ["Smoothness", "Concavity"]], + cancer_test["Class"] + ) cancer_acc_1 ``` @@ -651,56 +577,33 @@ cancer_acc_1 glue("cancer_acc_1", round(100*cancer_acc_1)) ``` -```{code-cell} ipython3 -:tags: [remove-cell] - -# In the metrics data frame, we filtered the `.metric` column since we are -# interested in the `accuracy` row. Other entries involve more advanced metrics that -# are beyond the scope of this book. Looking at the value of the `.estimate` variable -# shows that the estimated accuracy of the classifier on the test data -# was `r round(100*cancer_acc_1$.estimate, 0)`%. -``` ++++ The output shows that the estimated accuracy of the classifier on the test data was {glue:}`cancer_acc_1`%. - -+++ - -We can also look at the *confusion matrix* for the classifier as a `numpy` array using the `confusion_matrix` function: - -```{code-cell} ipython3 -# np.random.seed(1) - -confusion = confusion_matrix( - cancer_test_predictions["Class"], - cancer_test_predictions["predicted"], - labels=knn_fit.classes_, -) - -confusion -``` - -It is hard for us to interpret the confusion matrix as shown above. We could use the `ConfusionMatrixDisplay` function of the `scikit-learn` package to plot the confusion matrix. +We can also look at the *confusion matrix* for the classifier +using the `crosstab` function from `pandas`. The `crosstab` function +takes two arguments: the true labels first, then the predicted labels second. ```{code-cell} ipython3 -from sklearn.metrics import ConfusionMatrixDisplay - -confusion_display = ConfusionMatrixDisplay( - confusion_matrix=confusion, display_labels=knn_fit.classes_ -) -confusion_display.plot(); +pd.crosstab(cancer_test_predictions["Class"], + cancer_test_predictions["predicted"] + ) ``` ```{code-cell} ipython3 :tags: [remove-cell] +_ctab = pd.crosstab(cancer_test_predictions["Class"], + cancer_test_predictions["predicted"] + ) -glue("confu11", confusion[1, 1]) -glue("confu00", confusion[0, 0]) -glue("confu10", confusion[1, 0]) -glue("confu01", confusion[0, 1]) -glue("confu11_00", confusion[1, 1] + confusion[0, 0]) -glue("confu10_11", confusion[1, 0] + confusion[1, 1]) -glue("confu_fal_neg", round(100 * confusion[1, 0] / (confusion[1, 0] + confusion[1, 1]))) +glue("confu11", _ctab["Malignant"]["Malignant"]) +glue("confu00", _ctab["Benign"]["Benign"]) +glue("confu10", _ctab["Benign"]["Malignant"]) +glue("confu01", _ctab["Malignant"]["Benign"]) +glue("confu11_00", _ctab["Malignant"]["Malignant"] + _ctab["Benign"]["Benign"]) +glue("confu10_11", _ctab["Benign"]["Malignant"] + _ctab["Malignant"]["Malignant"]) +glue("confu_fal_neg", round(100 * _ctab["Benign"]["Malignant"] / (_ctab["Benign"]["Malignant"] + _ctab["Malignant"]["Malignant"]))) ``` The confusion matrix shows {glue:}`confu11` observations were correctly predicted @@ -754,7 +657,7 @@ As an example, in the breast cancer data, recall the proportions of benign and m observations in the training data are as follows: ```{code-cell} ipython3 -cancer_proportions +cancer_train["Class"].value_counts(normalize = True) ``` Since the benign class represents the majority of the training data, @@ -770,7 +673,8 @@ the $K$-nearest neighbors classifier improved quite a bit on the basic majority classifier. Hooray! But we still need to be cautious; in this application, it is likely very important not to misdiagnose any malignant tumors to avoid missing patients who actually need medical care. The confusion matrix above shows -that the classifier does, indeed, misdiagnose a significant number of malignant tumors as benign ({glue:}`confu10` out of {glue:}`confu10_11` malignant tumors, or {glue:}`confu_fal_neg`%!). +that the classifier does, indeed, misdiagnose a significant number of +malignant tumors as benign ({glue:}`confu10` out of {glue:}`confu10_11` malignant tumors, or {glue:}`confu_fal_neg`%!). Therefore, even though the accuracy improved upon the majority classifier, our critical analysis suggests that this classifier may not have appropriate performance for the application. From c135649ead03df4dce33260849a8a247f2d07ad0 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sun, 8 Jan 2023 13:08:50 -0800 Subject: [PATCH 23/45] initial cross val example done --- source/classification2.md | 93 +++++++++++++++------------------------ 1 file changed, 35 insertions(+), 58 deletions(-) diff --git a/source/classification2.md b/source/classification2.md index e906a927..78d0cf44 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -582,7 +582,9 @@ glue("cancer_acc_1", round(100*cancer_acc_1)) The output shows that the estimated accuracy of the classifier on the test data was {glue:}`cancer_acc_1`%. We can also look at the *confusion matrix* for the classifier -using the `crosstab` function from `pandas`. The `crosstab` function +using the `crosstab` function from `pandas`. A confusion matrix shows how many +observations of each (true) label were classified as each (predicted) label. +The `crosstab` function takes two arguments: the true labels first, then the predicted labels second. ```{code-cell} ipython3 @@ -744,92 +746,67 @@ models, and evaluate their accuracy. We will start with just a single split. ```{code-cell} ipython3 -# create the 25/75 split of the training data into training and validation +# create the 25/75 split of the *training data* into sub-training and validation cancer_subtrain, cancer_validation = train_test_split( - cancer_train, test_size=0.25, random_state=1 + cancer_train, test_size=0.25 ) -# could reuse the standardization preprocessor from before -# (but now we want to fit with the cancer_subtrain) +# fit the model on the sub-training data +knn = KNeighborsClassifier(n_neighbors=3) X = cancer_subtrain.loc[:, ["Smoothness", "Concavity"]] y = cancer_subtrain["Class"] -knn_fit = make_pipeline(cancer_preprocessor, knn_spec).fit(X, y) - -# get predictions on the validation data -validation_predicted = knn_fit.predict( - cancer_validation.loc[:, ["Smoothness", "Concavity"]] -) -validation_predicted = pd.concat( - [ - pd.DataFrame(validation_predicted, columns=["predicted"]), - cancer_validation.reset_index(drop=True), - ], - axis=1, -) # to add the predictions column to the original test data - -# compute the accuracy -X_valid = cancer_validation.loc[:, ["Smoothness", "Concavity"]] -y_valid = cancer_validation["Class"] -acc = knn_fit.score(X_valid, y_valid) +knn_fit = make_pipeline(cancer_preprocessor, knn).fit(X, y) +# compute the score on validation data +acc = knn_fit.score( + cancer_validation.loc[:, ["Smoothness", "Concavity"]], + cancer_validation["Class"] + ) acc ``` ```{code-cell} ipython3 :tags: [remove-cell] -glue(f"acc_seed1", round(100 * acc, 1)) -``` - -```{code-cell} ipython3 -:tags: [remove-cell] - -accuracies = [] -for i in range(1, 6): +accuracies = [acc] +for i in range(1, 5): # create the 25/75 split of the training data into training and validation cancer_subtrain, cancer_validation = train_test_split( - cancer_train, test_size=0.25, random_state=i + cancer_train, test_size=0.25 ) - # could reuse the standardization preprocessor from before - # (but now we want to fit with the cancer_subtrain) + # fit the model on the sub-training data + knn = KNeighborsClassifier(n_neighbors=3) X = cancer_subtrain.loc[:, ["Smoothness", "Concavity"]] y = cancer_subtrain["Class"] - knn_fit = make_pipeline(cancer_preprocessor, knn_spec).fit(X, y) + knn_fit = make_pipeline(cancer_preprocessor, knn).fit(X, y) - # get predictions on the validation data - validation_predicted = knn_fit.predict( - cancer_validation.loc[:, ["Smoothness", "Concavity"]] - ) - validation_predicted = pd.concat( - [ - pd.DataFrame(validation_predicted, columns=["predicted"]), - cancer_validation.reset_index(drop=True), - ], - axis=1, - ) # to add the predictions column to the original test data - - # compute the accuracy - X_valid = cancer_validation.loc[:, ["Smoothness", "Concavity"]] - y_valid = cancer_validation["Class"] - acc_ = knn_fit.score(X_valid, y_valid) - accuracies.append(acc_) -accuracies + # compute the score on validation data + accuracies.append(knn_fit.score( + cancer_validation.loc[:, ["Smoothness", "Concavity"]], + cancer_validation["Class"] + )) +avg_accuracy = np.round(np.array(accuracies).mean()*100,1) +accuracies = list(np.round(np.array(accuracies)*100, 1)) ``` +```{code-cell} ipython3 +:tags: [remove-cell] +glue(f"acc_seed1", np.round(100 * acc,1)) +glue("avg_5_splits", avg_accuracy) +glue("accuracies", accuracies) +``` ```{code-cell} ipython3 :tags: [remove-cell] -for i in range(1, 6): - glue(f"acc_split{i}", round(100 * accuracies[i-1], 1)) -glue("avg_5_splits", round(100 * sum(accuracies) / len(accuracies))) ``` + + The accuracy estimate using this split is {glue:}`acc_seed1`%. Now we repeat the above code 4 more times, which generates 4 more splits. Therefore we get five different shuffles of the data, and therefore five different values for -accuracy: {glue:}`acc_split1`%, {glue:}`acc_split2`%, {glue:}`acc_split3`%, -{glue:}`acc_split4`%, {glue:}`acc_split5`%. None of these values are +accuracy: {glue:}`accuracies` (each a percentage). None of these values are necessarily "more correct" than any other; they're just five estimates of the true, underlying accuracy of our classifier built using our overall training data. We can combine the estimates by taking their From 0e3b7330a6343bfe1225a6d1ca39069651a57406 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sun, 8 Jan 2023 13:14:39 -0800 Subject: [PATCH 24/45] in python -> in scikit --- source/classification2.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/classification2.md b/source/classification2.md index 78d0cf44..a4b86cb1 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -1771,7 +1771,7 @@ training over 1000 candidate models with $m=10$ predictors, forward selection re +++ -### Forward selection in Python +### Forward selection in `scikit-learn` We now turn to implementing forward selection in Python. The function [`SequentialFeatureSelector`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html) From 170e267088173cd37e162f4e2226a4c3618b7cc5 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sun, 8 Jan 2023 20:10:15 -0800 Subject: [PATCH 25/45] working on cross-val --- source/classification2.md | 170 ++++++++++++++++---------------------- 1 file changed, 69 insertions(+), 101 deletions(-) diff --git a/source/classification2.md b/source/classification2.md index a4b86cb1..983bda7c 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -414,6 +414,13 @@ right proportions of each category of observation. Note that the `train_test_split` function uses randomness, so we shall set `random_state` to make the split reproducible. +```{code-cell} ipython3 +# seed hacking to get a split that makes 10-fold have a lower std error than 5-fold +np.random.seed(5) +``` + + + ```{code-cell} ipython3 from sklearn.model_selection import train_test_split @@ -837,88 +844,54 @@ resulting in 5 different choices for the **validation set**; we call this 5-fold cross-validation. ``` -+++ - -To perform 5-fold cross-validation in Python with `scikit-learn`, we use another -function: `cross_validate`. This function splits our training data into `cv` folds -automatically. -According to its [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html), the parameter `cv`: - -> For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, [`StratifiedKFold`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold) is used. - -This means `cross_validate` will ensure that the training and validation subsets contain the -right proportions of each category of observation. +++ ```{index} cross-validation; cross_validate, scikit-learn; cross_validate ``` -When we run the `cross_validate` function, cross-validation is carried out on each -train/validation split. We can set `return_train_score=True` to obtain the training scores as well as the validation scores. The `cross_validate` function outputs a dictionary, and we use `pd.DataFrame` to convert it to a `pandas` dataframe for better visualization. (Noteworthy, the `test_score` column is actually the validation scores that we are interested in.) +To perform 5-fold cross-validation in Python with `scikit-learn`, we use another +function: `cross_validate`. This function requires that we specify +a modelling `Pipeline` as the `estimator` argument, +the number of folds as the `cv` argument, +and the training data predictors and labels as the `X` and `y` arguments. +Since the `cross_validate` function outputs a dictionary, we use `pd.DataFrame` to convert it to a `pandas` +dataframe for better visualization. +Note that the `cross_validate` function handles stratifying the classes in +each train and validate fold automatically. +We begin by importing the `cross_validate` function from `sklearn`. ```{code-cell} ipython3 -:tags: [remove-cell] - -# To perform 5-fold cross-validation in R with `tidymodels`, we use another -# function: `vfold_cv`. \index{tidymodels!vfold\_cv}\index{cross-validation!vfold\_cv} This function splits our training data into `v` folds -# automatically. We set the `strata` argument to the categorical label variable -# (here, `Class`) to ensure that the training and validation subsets contain the -# right proportions of each category of observation. -``` +from sklearn.model_selection import cross_validate -```{code-cell} ipython3 -cancer_pipe = make_pipeline(cancer_preprocessor, knn_spec) -X = cancer_subtrain.loc[:, ["Smoothness", "Concavity"]] -y = cancer_subtrain["Class"] -cv_5 = cross_validate( - estimator=cancer_pipe, - X=X, - y=y, - cv=5, - return_train_score=True, +knn = KNeighborsClassifier(n_neighbors=3) +cancer_pipe = make_pipeline(cancer_preprocessor, knn) +X = cancer_train.loc[:, ["Smoothness", "Concavity"]] +y = cancer_train["Class"] +cv_5_df = pd.DataFrame( + cross_validate( + estimator=cancer_pipe, + cv=5, + X=X, + y=y + ) ) -cv_5_df = pd.DataFrame(cv_5) cv_5_df ``` -```{code-cell} ipython3 -:tags: [remove-cell] - -# Then, when we create our data analysis workflow, we use the `fit_resamples` function \index{cross-validation!fit\_resamples}\index{tidymodels!fit\_resamples} -# instead of the `fit` function for training. This runs cross-validation on each -# train/validation split. -``` - +The validation scores we are interested in are contained in the `test_score` column. We can then aggregate the *mean* and *standard error* of the classifier's validation accuracy across the folds. You should consider the mean (`mean`) to be the estimated accuracy, while the standard -error (`std`) is a measure of how uncertain we are in the mean value. A detailed treatment of this +error (`sem`) is a measure of how uncertain we are in that mean value. A detailed treatment of this is beyond the scope of this chapter; but roughly, if your estimated mean is {glue:}`cv_5_mean` and standard error is {glue:}`cv_5_std`, you can expect the *true* average accuracy of the classifier to be somewhere roughly between {glue:}`cv_5_lower`% and {glue:}`cv_5_upper`% (although it may fall outside this range). You may ignore the other columns in the metrics data frame. ```{code-cell} ipython3 -:tags: [remove-cell] - -# The `collect_metrics` \index{tidymodels!collect\_metrics}\index{cross-validation!collect\_metrics} function is used to aggregate the *mean* and *standard error* -# of the classifier's validation accuracy across the folds. You will find results -# related to the accuracy in the row with `accuracy` listed under the `.metric` column. -# You should consider the mean (`mean`) to be the estimated accuracy, while the standard -# error (`std_err`) is a measure of how uncertain we are in the mean value. A detailed treatment of this -# is beyond the scope of this chapter; but roughly, if your estimated mean is `r round(filter(collect_metrics(knn_fit), .metric == "accuracy")$mean,2)` and standard -# error is `r round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2)`, you can expect the *true* average accuracy of the -# classifier to be somewhere roughly between `r (round(filter(collect_metrics(knn_fit), .metric == "accuracy")$mean,2) - round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2))*100`% and `r (round(filter(collect_metrics(knn_fit), .metric == "accuracy")$mean,2) + round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2))*100`% (although it may -# fall outside this range). You may ignore the other columns in the metrics data frame, -# as they do not provide any additional insight. -# You can also ignore the entire second row with `roc_auc` in the `.metric` column, -# as it is beyond the scope of this book. -``` - -```{code-cell} ipython3 -cv_5_metrics = cv_5_df.aggregate(func=['mean', 'std']) +cv_5_metrics = cv_5_df.agg(['mean', 'sem']) cv_5_metrics ``` @@ -926,14 +899,14 @@ cv_5_metrics :tags: [remove-cell] glue("cv_5_mean", round(cv_5_metrics.loc["mean", "test_score"], 2)) -glue("cv_5_std", round(cv_5_metrics.loc["std", "test_score"], 2)) +glue("cv_5_std", round(cv_5_metrics.loc["sem", "test_score"], 2)) glue( "cv_5_upper", round( 100 * ( round(cv_5_metrics.loc["mean", "test_score"], 2) - + round(cv_5_metrics.loc["std", "test_score"], 2) + + round(cv_5_metrics.loc["sem", "test_score"], 2) ) ), ) @@ -943,7 +916,7 @@ glue( 100 * ( round(cv_5_metrics.loc["mean", "test_score"], 2) - - round(cv_5_metrics.loc["std", "test_score"], 2) + - round(cv_5_metrics.loc["sem", "test_score"], 2) ) ), ) @@ -957,52 +930,41 @@ it takes to run the analysis. So when you do cross-validation, you need to consider the size of the data, the speed of the algorithm (e.g., $K$-nearest neighbors), and the speed of your computer. In practice, this is a trial-and-error process, but typically $C$ is chosen to be either 5 or 10. Here -we will try 10-fold cross-validation to see if we get a lower standard error: +we will try 10-fold cross-validation to see if we get a lower standard error. ```{code-cell} ipython3 -cv_10 = cross_validate( - estimator=cancer_pipe, - X=X, - y=y, - cv=10, - return_train_score=True, +cv_10 = pd.DataFrame( + cross_validate( + estimator=cancer_pipe, + cv=10, + X=X, + y=y + ) ) cv_10_df = pd.DataFrame(cv_10) -cv_10_metrics = cv_10_df.aggregate(func=['mean', 'std']) +cv_10_metrics = cv_10_df.agg(['mean', 'sem']) cv_10_metrics ``` -In this case, using 10-fold instead of 5-fold cross validation did increase the standard error. In fact, due to the randomness in how the data are split, sometimes -you might even end up with a *lower* standard error when increasing the number of folds! -The increase in standard error can become more dramatic by increasing the number of folds +In this case, using 10-fold instead of 5-fold cross validation did +reduce the standard error very slightly. In fact, due to the randomness in how the data are split, sometimes +you might even end up with a *higher* standard error when increasing the number of folds! +We can make the reduction in standard error more dramatic by increasing the number of folds by a large amount. In the following code we show the result when $C = 50$; -picking such a large number of folds often takes a long time to run in practice, +picking such a large number of folds can take a long time to run in practice, so we usually stick to 5 or 10. ```{code-cell} ipython3 -:tags: [remove-cell] - -# In this case, using 10-fold instead of 5-fold cross validation did reduce the standard error, although -# by only an insignificant amount. In fact, due to the randomness in how the data are split, sometimes -# you might even end up with a *higher* standard error when increasing the number of folds! -# We can make the reduction in standard error more dramatic by increasing the number of folds -# by a large amount. In the following code we show the result when $C = 50$; -# picking such a large number of folds often takes a long time to run in practice, -# so we usually stick to 5 or 10. -``` - -```{code-cell} ipython3 -cv_50 = cross_validate( - estimator=cancer_pipe, - X=X, - y=y, - cv=50, - return_train_score=True, +cv_50_df = pd.DataFrame( + cross_validate( + estimator=cancer_pipe, + cv=50, + X=X, + y=y + ) ) - -cv_50_df = pd.DataFrame(cv_50) -cv_50_metrics = cv_50_df.aggregate(func=['mean', 'std']) +cv_50_metrics = cv_50_df.agg(['mean', 'sem']) cv_50_metrics ``` @@ -1027,7 +989,10 @@ In order to improve our classifier, we have one choice of parameter: the number neighbors, $K$. Since cross-validation helps us evaluate the accuracy of our classifier, we can use cross-validation to calculate an accuracy for each value of $K$ in a reasonable range, and then pick the value of $K$ that gives us the -best accuracy. The `scikit-learn` package collection provides 2 build-in methods for tuning parameters. Each parameter in the model can be adjusted rather than given a specific value. We can define a set of values for each hyperparameters and find the best parameters in this set. +best accuracy. The `scikit-learn` package collection provides two built-in methods +for tuning parameters. Each parameter in the model can be adjusted rather +than given a specific value. We can define a set of values for each hyperparameters +and find the best parameters in this set. - Exhaustive grid search - [`sklearn.model_selection.GridSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) @@ -1201,7 +1166,7 @@ $K =$ {glue:}`best_k_unique` for the classifier. ### Under/Overfitting To build a bit more intuition, what happens if we keep increasing the number of -neighbors $K$? In fact, the accuracy actually starts to decrease! +neighbors $K$? In fact, the cross-validation accuracy actually estimate starts to decrease! Let's specify a much larger range of values of $K$ to try in the `param_grid` argument of `GridSearchCV`. {numref}`fig:06-lots-of-ks` shows a plot of estimated accuracy as we vary $K$ from 1 to almost the number of observations in the data set. @@ -1420,7 +1385,7 @@ The overall workflow for performing $K$-nearest neighbors classification using ` 3. Create a `Pipeline` that specifies the preprocessing steps and the classifier. 4. Use the `GridSearchCV` function (or `RandomizedSearchCV`) to estimate the classifier accuracy for a range of $K$ values. Pass the parameter grid and the pipeline defined in step 2 and step 3 as the `param_grid` argument and the `estimator` argument, respectively. 5. Call `fit` on the `GridSearchCV` instance created in step 4, passing the training data. -6. Pick a value of $K$ that yields a high accuracy estimate that doesn't change much if you change $K$ to a nearby value. +6. Pick a value of $K$ that yields a high cross-validation accuracy estimate that doesn't change much if you change $K$ to a nearby value. 7. Make a new model specification for the best parameter value (i.e., $K$), and retrain the classifier by calling the `fit` method. 8. Evaluate the estimated accuracy of the classifier on the test set using the `score` method. @@ -1435,7 +1400,7 @@ The overall workflow for performing $K$-nearest neighbors classification using ` # 3. Create a `recipe` that specifies the class label and predictors, as well as preprocessing steps for all variables. Pass the training data as the `data` argument of the recipe. # 4. Create a `nearest_neighbors` model specification, with `neighbors = tune()`. # 5. Add the recipe and model specification to a `workflow()`, and use the `tune_grid` function on the train/validation splits to estimate the classifier accuracy for a range of $K$ values. -# 6. Pick a value of $K$ that yields a high accuracy estimate that doesn't change much if you change $K$ to a nearby value. +# 6. Pick a value of $K$ that yields a high cross-validation accuracy estimate that doesn't change much if you change $K$ to a nearby value. # 7. Make a new model specification for the best parameter value (i.e., $K$), and retrain the classifier using the `fit` function. # 8. Evaluate the estimated accuracy of the classifier on the test set using the `predict` function. ``` @@ -1582,7 +1547,7 @@ for i in range(len(ks)): ) cv_5 = cross_validate(estimator=cancer_fixed_pipe, X=X, y=y, cv=5) - cv_5_metrics = pd.DataFrame(cv_5).aggregate(func=["mean", "std"]) + cv_5_metrics = pd.DataFrame(cv_5).agg(["mean", "sem"]) fixedaccs.append(cv_5_metrics.loc["mean", "test_score"]) ``` @@ -1859,7 +1824,10 @@ This means that {glue:}`sequentialfeatureselector_n_features` features were sele +++ -Now, let's code the actual algorithm by ourselves. The key idea of the forward selection code is to properly extract each subset of predictors for which we want to build a model, pass them to the preprocessor and fit the pipeline with them. +Now, let's code the actual algorithm by ourselves. The key idea of the forward +selection code is to properly extract each subset of predictors for which we +want to build a model, pass them to the preprocessor and fit the pipeline with +them. ```{code-cell} ipython3 :tags: [remove-cell] From 3d72b3353f682242111895d857c90000e3fe7a0b Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sun, 8 Jan 2023 23:10:03 -0800 Subject: [PATCH 26/45] polished ch6 up to predictor selection --- source/classification2.md | 260 ++++++++++++++++++-------------------- 1 file changed, 123 insertions(+), 137 deletions(-) diff --git a/source/classification2.md b/source/classification2.md index 983bda7c..9abc8635 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -989,113 +989,116 @@ In order to improve our classifier, we have one choice of parameter: the number neighbors, $K$. Since cross-validation helps us evaluate the accuracy of our classifier, we can use cross-validation to calculate an accuracy for each value of $K$ in a reasonable range, and then pick the value of $K$ that gives us the -best accuracy. The `scikit-learn` package collection provides two built-in methods -for tuning parameters. Each parameter in the model can be adjusted rather -than given a specific value. We can define a set of values for each hyperparameters -and find the best parameters in this set. - -- Exhaustive grid search - - [`sklearn.model_selection.GridSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) - - A user specifies a set of values for each hyperparameter. - - The method considers product of the sets and then evaluates each combination one by one. - -- Randomized hyperparameter optimization - - [`sklearn.model_selection.RandomizedSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) - - Samples configurations at random until certain budget (e.g., time) is exhausted - -+++ - -Let us walk through how to use `GridSearchCV` to tune the model. `RandomizedSearchCV` follows a similar workflow, and you will get to practice both of them in the worksheet. +best accuracy. The `scikit-learn` package collection provides built-in +functionality, named `GridSearchCV`, to automatically handle the details for us. +Before we use `GridSearchCV`, we need to create a new pipeline +with a `KNeighborsClassifier` that has the number of neighbors left unspecified. ```{code-cell} ipython3 -:tags: [remove-cell] - -# In order to improve our classifier, we have one choice of parameter: the number of -# neighbors, $K$. Since cross-validation helps us evaluate the accuracy of our -# classifier, we can use cross-validation to calculate an accuracy for each value -# of $K$ in a reasonable range, and then pick the value of $K$ that gives us the -# best accuracy. The `tidymodels` package collection provides a very simple -# syntax for tuning models: each parameter in the model to be tuned should be specified -# as `tune()` in the model specification rather than given a particular value. +knn = KNeighborsClassifier() +cancer_tune_pipe = make_pipeline(cancer_preprocessor, knn) ``` -Before we use `GridSearchCV` (or `RandomizedSearchCV`), we should define the parameter grid by passing the set of values for each parameters that you would like to tune in a Python dictionary; below we create the `param_grid` dictionary with `kneighborsclassifier__n_neighbors` as the key and pair it with the values we would like to tune from 1 to 100 (stepping by 5) using the `range` function. We would also need to redefine the pipeline to use default values for parameters. ++++ -```{code-cell} ipython3 -param_grid = { +Next we specify the grid of parameter values that we want to try for +each tunable parameter. We do this in a Python dictionary: the key is +the identifier of the parameter to tune, and the value is a list of parameter values +to try when tuning. We can find the "identifier" of a parameter by using +the `get_params` method on the pipeline. +```{code-cell} ipython3 +cancer_tune_pipe.get_params() +``` +Wow, there's quite a bit of *stuff* there! If you sift through the muck +a little bit, you will see one parameter identifier that stands out: +`"kneighborsclassifier__n_neighbors"`. This identifier combines the name +of the K nearest neighbors classification step in our pipeline, `kneighborsclassifier`, +with the name of the parameter, `n_neighbors`. +We now construct the `parameter_grid` dictionary that will tell `GridSearchCV` +what parameter values to try. +Note that you can specify multiple tunable parameters +by creating a dictionary with multiple key-value pairs, but +here we just have to tune the number of neighbors. +```{code-cell} ipython3 +parameter_grid = { "kneighborsclassifier__n_neighbors": range(1, 100, 5), } -cancer_tune_pipe = make_pipeline(cancer_preprocessor, KNeighborsClassifier()) ``` +The `range` function in Python that we used above allows us to specify a sequence of values. +The first argument is the starting number (here, `1`), +the second argument is *one greater than* the final number (here, `100`), +and the third argument is the number to values to skip between steps in the sequence (here, `5`). +So in this case we generate the sequence 1, 6, 11, 16, ..., 96. +If we instead specified `range(0, 100, 5)`, we would get the sequence 0, 5, 10, 15, ..., 90, 95. +The number 100 is not included because the third argument is *one greater than* the final possible +number in the sequence. There are two additional useful ways to employ `range`. +If we call `range` with just one argument, Python counts +up to that number starting at 0. So `range(4)` is the same as `range(0, 4, 1)` and generates the sequence 0, 1, 2, 3. +If we call `range` with two arguments, Python counts starting at the first number up to the second number. +So `range(1, 4)` is the same as `range(1, 4, 1)` and generates the sequence `1, 2, 3`. -```{index} cross-validation; GridSearchCV, cross-validation; RandomizedSearchCV, scikit-learn; GridSearchCV, scikit-learn; RandomizedSearchCV +```{index} cross-validation; GridSearchCV, scikit-learn; GridSearchCV, scikit-learn; RandomizedSearchCV ``` -```{code-cell} ipython3 -:tags: [remove-cell] - -# Then instead of using `fit` or `fit_resamples`, we will use the `tune_grid` function \index{cross-validation!tune\_grid}\index{tidymodels!tune\_grid} -# to fit the model for each value in a range of parameter values. -# In particular, we first create a data frame with a `neighbors` -# variable that contains the sequence of values of $K$ to try; below we create the `k_vals` -# data frame with the `neighbors` variable containing values from 1 to 100 (stepping by 5) using -# the `seq` function. -# Then we pass that data frame to the `grid` argument of `tune_grid`. -``` - -Now, let us create the `GridSearchCV` object and the `RandomizedSearchCV` object by passing the new pipeline `cancer_tune_pipe` and the `param_grid` dictionary to the respective functions. `n_jobs=-1` means using all the available processors. +Okay! We are finally ready to create the `GridSearchCV` object. +First we import it from the `sklearn` package. +Then we pass it the `cancer_tune_pipe` pipeline in the `estimator` argument, +the `parameter_grid` in the `param_grid` argument, +and specify `cv=10` folds. Note that this does not actually run +the tuning yet; just as before, we will have to use the `fit` method. ```{code-cell} ipython3 +from sklearn.model_selection import GridSearchCV + cancer_tune_grid = GridSearchCV( estimator=cancer_tune_pipe, - param_grid=param_grid, - cv=10, - n_jobs=-1, - return_train_score=True, + param_grid=parameter_grid, + cv=10 ) ``` -Now, let us fit the model to the training data. The attribute `cv_results_` of the fitted model is a dictionary of `numpy` arrays containing all cross-validation results from different choices of parameters. We can visualize them more clearly through a dataframe. +Now we use the `fit` method on the `GridSearchCV` object to begin the tuning process. +We pass the training data predictors and labels as the two arguments to `fit` as usual. +The `cv_results_` attribute of the output contains the resulting cross-validation +accuracy estimate for each choice of `n_neighbors`, but it isn't in an easily used +format. We will wrap it in a `pd.DataFrame` to make it easier to understand, +and print the `info` of the result. ```{code-cell} ipython3 -X_tune = cancer_train.loc[:, ["Smoothness", "Concavity"]] -y_tune = cancer_train["Class"] - -cancer_model_grid = cancer_tune_grid.fit(X_tune, y_tune) - -accuracies_grid = pd.DataFrame(cancer_model_grid.cv_results_) +accuracies_grid = pd.DataFrame( + cancer_tune_grid + .fit(cancer_train.loc[:, ["Smoothness", "Concavity"]], + cancer_train["Class"] + ).cv_results_) ``` ```{code-cell} ipython3 accuracies_grid.info() ``` - -`cv_results_` gives abundant information, but for our purpose, we only focus on `param_kneighborsclassifier__n_neighbors` (the $K$, number of neighbors), `mean_test_score` (the mean validation score across all folds), and `std_test_score` (the standard deviation of the validation scores). - -```{code-cell} ipython3 -accuracies_grid[ - ["param_kneighborsclassifier__n_neighbors", "mean_test_score", "std_test_score"] -] -``` - -```{code-cell} ipython3 -:tags: [remove-cell] - -sorted_accuracies = accuracies_grid.sort_values(by='mean_test_score', ascending=False) -best_k_list = sorted_accuracies[ - sorted_accuracies["mean_test_score"] - == sorted_accuracies.iloc[0, :]["mean_test_score"] -]["param_kneighborsclassifier__n_neighbors"].tolist() - -# If there are more than 1 hyperparameter yielding the highest validation score -if len(best_k_list) > 1: - i = 1 - for k in best_k_list: - glue(f"best_k_{i}", k) - i += 1 -else: - glue("best_k_unique", best_k_list[0]) -glue("best_acc", round(sorted_accuracies.iloc[0]["mean_test_score"] * 100, 2)) +There is a lot of information to look at here, but we are most interested +in three quantities: the number of neighbors (`param_kneighbors_classifier__n_neighbors`), +the cross-validation accuracy estimate (`mean_test_score`), +and the standard error of the accuracy estimate. Unfortunately `GridSearchCV` does +not directly output the standard error for each cross-validation accuracy; but +it *does* output the standard *deviation* (`std_test_score`). We can compute +the standard error from the standard deviation by dividing it by the square +root of the number of folds, i.e., + +$$\text{Standard Error} = \frac{1}{\sqrt{\text{# Folds}}}\text{Standard Deviation}.$$ + +We will also rename the parameter name column to be a bit more readable, +and drop the now unused `std_test_score` column. + +```{code-cell} ipython3 +accuracies_grid = accuracies_grid[["param_kneighborsclassifier__n_neighbors", "mean_test_score", "std_test_score"] + ].assign( + sem_test_score = accuracies_grid["std_test_score"] / 10**(1/2) + ).rename( + columns = {"param_kneighborsclassifier__n_neighbors" : "n_neighbors"} + ).drop( + columns = ["std_test_score"] + ) +accuracies_grid ``` We can decide which number of neighbors is best by plotting the accuracy versus $K$, @@ -1109,7 +1112,7 @@ accuracy_vs_k = ( .mark_line(point=True) .encode( x=alt.X( - "param_kneighborsclassifier__n_neighbors", + "n_neighbors", title="Neighbors", ), y=alt.Y( @@ -1140,7 +1143,7 @@ Plot of estimated accuracy versus the number of neighbors. Setting the number of neighbors to $K =$ {glue:}`best_k_unique` provides the highest accuracy ({glue:}`best_acc`%). But there is no exact or perfect answer here; -any selection from $K = 20$ and $55$ would be reasonably justified, as all +any selection from $K = 30$ to $80$ or so would be reasonably justified, as all of these differ in classifier accuracy by a small amount. Remember: the values you see on this plot are *estimates* of the true accuracy of our classifier. Although the @@ -1166,7 +1169,7 @@ $K =$ {glue:}`best_k_unique` for the classifier. ### Under/Overfitting To build a bit more intuition, what happens if we keep increasing the number of -neighbors $K$? In fact, the cross-validation accuracy actually estimate starts to decrease! +neighbors $K$? In fact, the cross-validation accuracy estimate actually starts to decrease! Let's specify a much larger range of values of $K$ to try in the `param_grid` argument of `GridSearchCV`. {numref}`fig:06-lots-of-ks` shows a plot of estimated accuracy as we vary $K$ from 1 to almost the number of observations in the data set. @@ -1174,23 +1177,25 @@ we vary $K$ from 1 to almost the number of observations in the data set. ```{code-cell} ipython3 :tags: [remove-output] -param_grid_lots = { +large_param_grid = { "kneighborsclassifier__n_neighbors": range(1, 385, 10), } -cancer_tune_grid_lots = GridSearchCV( +large_cancer_tune_grid = GridSearchCV( estimator=cancer_tune_pipe, - param_grid=param_grid_lots, - cv=10, - n_jobs=-1, - return_train_score=True, + param_grid=large_param_grid, + cv=10 ) -cancer_model_grid_lots = cancer_tune_grid_lots.fit(X_tune, y_tune) -accuracies_grid_lots = pd.DataFrame(cancer_model_grid_lots.cv_results_) +large_accuracies_grid = pd.DataFrame( + large_cancer_tune_grid.fit( + cancer_train.loc[:, ["Smoothness", "Concavity"]], + cancer_train["Class"] + ).cv_results_ + ) -accuracy_vs_k_lots = ( - alt.Chart(accuracies_grid_lots) +large_accuracy_vs_k = ( + alt.Chart(large_accuracies_grid) .mark_line(point=True) .encode( x=alt.X( @@ -1205,13 +1210,13 @@ accuracy_vs_k_lots = ( ) ) -accuracy_vs_k_lots +large_accuracy_vs_k ``` ```{code-cell} ipython3 :tags: [remove-cell] -glue("fig:06-lots-of-ks", accuracy_vs_k_lots) +glue("fig:06-lots-of-ks", large_accuracy_vs_k) ``` :::{glue:figure} fig:06-lots-of-ks @@ -1254,9 +1259,6 @@ training data, it is said to **overfit** the data. ```{code-cell} ipython3 :tags: [remove-cell] -# create the scatter plot -colors = ["#86bfef", "#efb13f"] - cancer_plot = ( alt.Chart( cancer_train, @@ -1281,7 +1283,7 @@ cancer_plot = ( ) ), ), - color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"), + color=alt.Color("Class", title="Diagnosis"), ) ) @@ -1317,12 +1319,10 @@ for k in [1, 7, 20, 300]: .encode( x=alt.X("Smoothness"), y=alt.Y("Concavity"), - color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"), + color=alt.Color("Class", title="Diagnosis"), ) ) plot_list.append(cancer_plot + prediction_plot) - -# (plot_list[0] | plot_list[1]) & (plot_list[2] | plot_list[3]) ``` ```{code-cell} ipython3 @@ -1345,25 +1345,27 @@ Effect of K in overfitting and underfitting. +++ -Both overfitting and underfitting are problematic and will lead to a model -that does not generalize well to new data. When fitting a model, we need to strike -a balance between the two. You can see these two effects in {numref}`fig:06-decision-grid-K`, which shows how the classifier changes as -we set the number of neighbors $K$ to 1, 7, 20, and 300. +Both overfitting and underfitting are problematic and will lead to a model that +does not generalize well to new data. When fitting a model, we need to strike a +balance between the two. You can see these two effects in +{numref}`fig:06-decision-grid-K`, which shows how the classifier changes as we +set the number of neighbors $K$ to 1, 7, 20, and 300. +++ ## Summary Classification algorithms use one or more quantitative variables to predict the -value of another categorical variable. In particular, the $K$-nearest neighbors algorithm -does this by first finding the $K$ points in the training data nearest -to the new observation, and then returning the majority class vote from those -training observations. We can evaluate a classifier by splitting the data -randomly into a training and test data set, using the training set to build the -classifier, and using the test set to estimate its accuracy. Finally, we -can tune the classifier (e.g., select the number of neighbors $K$ in $K$-NN) -by maximizing estimated accuracy via cross-validation. The overall -process is summarized in {numref}`fig:06-overview`. +value of another categorical variable. In particular, the $K$-nearest neighbors +algorithm does this by first finding the $K$ points in the training data +nearest to the new observation, and then returning the majority class vote from +those training observations. We can tune and evaluate a classifier by splitting +the data randomly into a training and test data set. The training set is used +to build the classifier and we can tune the classifier (e.g., select the number +of neighbors in $K$-nearest neighbors) by maximizing estimated accuracy via +cross-validation. After we have tuned the model we can use the test set to +estimate its accuracy. The overall process is summarized in +{numref}`fig:06-overview`. +++ @@ -1381,30 +1383,14 @@ Overview of KNN classification. The overall workflow for performing $K$-nearest neighbors classification using `scikit-learn` is as follows: 1. Use the `train_test_split` function to split the data into a training and test set. Set the `stratify` argument to the class label column of the dataframe. Put the test set aside for now. -2. Define the parameter grid by passing the set of $K$ values that you would like to tune. -3. Create a `Pipeline` that specifies the preprocessing steps and the classifier. -4. Use the `GridSearchCV` function (or `RandomizedSearchCV`) to estimate the classifier accuracy for a range of $K$ values. Pass the parameter grid and the pipeline defined in step 2 and step 3 as the `param_grid` argument and the `estimator` argument, respectively. -5. Call `fit` on the `GridSearchCV` instance created in step 4, passing the training data. +2. Create a `Pipeline` that specifies the preprocessing steps and the classifier. +3. Define the parameter grid by passing the set of $K$ values that you would like to tune. +4. Use `GridSearchCV` to estimate the classifier accuracy for a range of $K$ values. Pass the pipeline and parameter grid defined in steps 2. and 3. as the `param_grid` argument and the `estimator` argument, respectively. +5. Execute the grid search by passing the training data to the `fit` method on the `GridSearchCV` instance created in step 4. 6. Pick a value of $K$ that yields a high cross-validation accuracy estimate that doesn't change much if you change $K$ to a nearby value. 7. Make a new model specification for the best parameter value (i.e., $K$), and retrain the classifier by calling the `fit` method. 8. Evaluate the estimated accuracy of the classifier on the test set using the `score` method. -```{code-cell} ipython3 -:tags: [remove-cell] - -# The overall workflow for performing $K$-nearest neighbors classification using `tidymodels` is as follows: -# \index{tidymodels}\index{recipe}\index{cross-validation}\index{K-nearest neighbors!classification}\index{classification} - -# 1. Use the `initial_split` function to split the data into a training and test set. Set the `strata` argument to the class label variable. Put the test set aside for now. -# 2. Use the `vfold_cv` function to split up the training data for cross-validation. -# 3. Create a `recipe` that specifies the class label and predictors, as well as preprocessing steps for all variables. Pass the training data as the `data` argument of the recipe. -# 4. Create a `nearest_neighbors` model specification, with `neighbors = tune()`. -# 5. Add the recipe and model specification to a `workflow()`, and use the `tune_grid` function on the train/validation splits to estimate the classifier accuracy for a range of $K$ values. -# 6. Pick a value of $K$ that yields a high cross-validation accuracy estimate that doesn't change much if you change $K$ to a nearby value. -# 7. Make a new model specification for the best parameter value (i.e., $K$), and retrain the classifier using the `fit` function. -# 8. Evaluate the estimated accuracy of the classifier on the test set using the `predict` function. -``` - In these last two chapters, we focused on the $K$-nearest neighbor algorithm, but there are many other methods we could have used to predict a categorical label. All algorithms have their strengths and weaknesses, and we summarize these for From e05b5a854fdb134731df0f7e05997168fc8a9e64 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sun, 8 Jan 2023 23:16:48 -0800 Subject: [PATCH 27/45] commented out predictor selection --- source/classification2.md | 31 +++++-------------------------- 1 file changed, 5 insertions(+), 26 deletions(-) diff --git a/source/classification2.md b/source/classification2.md index 9abc8635..5faa55d5 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -1130,6 +1130,8 @@ accuracy_vs_k :tags: [remove-cell] glue("fig:06-find-k", accuracy_vs_k) +glue("best_k_unique", accuracies_grid["n_neighbors"][accuracies_grid["mean_test_score"].idxmax()]) +glue("best_acc", np.round(accuracies_grid["mean_test_score"].max()*100,1)]) ``` :::{glue:figure} fig:06-find-k @@ -1410,6 +1412,7 @@ the $K$-NN here. +++ + + ## Exercises Practice exercises for the material covered in this chapter @@ -2011,32 +2016,6 @@ and guidance that the worksheets provide will function as intended. text, it requires a bit more mathematical background than we require. -```{code-cell} ipython3 -:tags: [remove-cell] - -# - The [`tidymodels` website](https://tidymodels.org/packages) is an excellent -# reference for more details on, and advanced usage of, the functions and -# packages in the past two chapters. Aside from that, it also has a [nice -# beginner's tutorial](https://www.tidymodels.org/start/) and [an extensive list -# of more advanced examples](https://www.tidymodels.org/learn/) that you can use -# to continue learning beyond the scope of this book. It's worth noting that the -# `tidymodels` package does a lot more than just classification, and so the -# examples on the website similarly go beyond classification as well. In the next -# two chapters, you'll learn about another kind of predictive modeling setting, -# so it might be worth visiting the website only after reading through those -# chapters. -# - *An Introduction to Statistical Learning* [@james2013introduction] provides -# a great next stop in the process of -# learning about classification. Chapter 4 discusses additional basic techniques -# for classification that we do not cover, such as logistic regression, linear -# discriminant analysis, and naive Bayes. Chapter 5 goes into much more detail -# about cross-validation. Chapters 8 and 9 cover decision trees and support -# vector machines, two very popular but more advanced classification methods. -# Finally, Chapter 6 covers a number of methods for selecting predictor -# variables. Note that while this book is still a very accessible introductory -# text, it requires a bit more mathematical background than we require. -``` - ## References +++ From ee813305e48a903d6a008838e2a61e884ee2b9b4 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sun, 8 Jan 2023 23:25:23 -0800 Subject: [PATCH 28/45] done ch6 except final under/overfit plot --- source/classification2.md | 26 +++++--------------------- 1 file changed, 5 insertions(+), 21 deletions(-) diff --git a/source/classification2.md b/source/classification2.md index 5faa55d5..e6a5fbe1 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -29,26 +29,8 @@ kernelspec: import altair as alt import numpy as np import pandas as pd -# import sklearn -# from sklearn.compose import make_column_transformer -# from sklearn.metrics import confusion_matrix, plot_confusion_matrix -# from sklearn.metrics.pairwise import euclidean_distances -# from sklearn.model_selection import ( -# GridSearchCV, -# RandomizedSearchCV, -# cross_validate, -# train_test_split, -# ) -# from sklearn.neighbors import KNeighborsClassifier -# from sklearn.pipeline import Pipeline, make_pipeline -# from sklearn.preprocessing import OneHotEncoder, StandardScaler -# -# alt.data_transformers.disable_max_rows() -# alt.renderers.enable("mimetype") - + from myst_nb import glue - -#pd.options.display.max_colwidth = 100 ``` ## Overview @@ -1131,7 +1113,7 @@ accuracy_vs_k glue("fig:06-find-k", accuracy_vs_k) glue("best_k_unique", accuracies_grid["n_neighbors"][accuracies_grid["mean_test_score"].idxmax()]) -glue("best_acc", np.round(accuracies_grid["mean_test_score"].max()*100,1)]) +glue("best_acc", np.round(accuracies_grid["mean_test_score"].max()*100,1)) ``` :::{glue:figure} fig:06-find-k @@ -1259,7 +1241,9 @@ completely different. In general, if the model *is influenced too much* by the training data, it is said to **overfit** the data. ```{code-cell} ipython3 -:tags: [remove-cell] + +alt.data_transformers.disable_max_rows() +alt.renderers.enable("mimetype") cancer_plot = ( alt.Chart( From 46c487a8cfb405ede17906f0e422f110400eced0 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 11 Jan 2023 15:40:11 -0800 Subject: [PATCH 29/45] warnings filter in ch6; remove seed hack cell --- source/classification2.md | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/source/classification2.md b/source/classification2.md index e6a5fbe1..60e90bb6 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -17,14 +17,10 @@ kernelspec: ```{code-cell} ipython3 :tags: [remove-cell] -#import warnings -#def warn(*args, **kwargs): -# pass -#warnings.warn = warn -``` -```{code-cell} ipython3 -:tags: [remove-cell] +import warnings +warnings.filterwarnings("ignore", category=DeprecationWarning) +warnings.filterwarnings("ignore", category=FutureWarning) import altair as alt import numpy as np @@ -397,6 +393,7 @@ Note that the `train_test_split` function uses randomness, so we shall set `rand the split reproducible. ```{code-cell} ipython3 +:tags: [remove-cell] # seed hacking to get a split that makes 10-fold have a lower std error than 5-fold np.random.seed(5) ``` From e191e98eea09a84361ff593468c61e5b4443da90 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 11 Jan 2023 15:42:38 -0800 Subject: [PATCH 30/45] remove reference to random state in train/test split --- source/classification2.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/source/classification2.md b/source/classification2.md index 60e90bb6..1d38cabf 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -389,8 +389,6 @@ we will specify that `train_size=0.75` so that 75% of our original data set ends in the training set. We will also set the `stratify` argument to the categorical label variable (here, `cancer['Class']`) to ensure that the training and testing subsets contain the right proportions of each category of observation. -Note that the `train_test_split` function uses randomness, so we shall set `random_state` to make -the split reproducible. ```{code-cell} ipython3 :tags: [remove-cell] From 6ca5c56154d0e31f88400c32806815eb20da66a4 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 11 Jan 2023 15:43:47 -0800 Subject: [PATCH 31/45] minor typesetting .method() vs method --- source/classification2.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/source/classification2.md b/source/classification2.md index 1d38cabf..f6b747f4 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -396,8 +396,6 @@ right proportions of each category of observation. np.random.seed(5) ``` - - ```{code-cell} ipython3 from sklearn.model_selection import train_test_split @@ -421,10 +419,10 @@ glue("cancer_test_nrow", len(cancer_test)) ```{index} info ``` -We can see from `.info()` in the code above that the training set contains {glue:}`cancer_train_nrow` observations, +We can see from the `info` method above that the training set contains {glue:}`cancer_train_nrow` observations, while the test set contains {glue:}`cancer_test_nrow` observations. This corresponds to a train / test split of 75% / 25%, as desired. Recall from the {ref}`classification` chapter -that we use the `.info()` method to preview the number of rows, the variable names, their data types, and +that we use the `info` method to preview the number of rows, the variable names, their data types, and missing entries of a data frame. ```{index} groupby, count From 70dc2b568aa97063839bb11411f9aa7fbad7e2c1 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 11 Jan 2023 15:52:35 -0800 Subject: [PATCH 32/45] put setup.md back in to fix broken links --- source/_toc.yml | 2 +- source/setup.md | 11 +++++++++-- 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/source/_toc.yml b/source/_toc.yml index a36c7b04..e1cac4c9 100644 --- a/source/_toc.yml +++ b/source/_toc.yml @@ -9,7 +9,7 @@ parts: - file: acknowledgements-python.md - file: authors.md - file: editors.md - #- file: setup.md + - file: setup.md - caption: Chapters numbered: 3 chapters: diff --git a/source/setup.md b/source/setup.md index 5efe5697..9f7614e1 100644 --- a/source/setup.md +++ b/source/setup.md @@ -14,7 +14,7 @@ kernelspec: --- (move-to-your-own-machine)= -# Setting up your computer -- TBD +# Setting up your computer ## Overview @@ -26,9 +26,14 @@ needed to do the data science covered in this book on your own computer. By the end of the chapter, readers will be able to do the following: - Install the Git version control software. -- Install and launch a local instance of JupyterLab with the R kernel. +- Install and launch a local instance of JupyterLab with the Python kernel. - Download the worksheets that accompany the chapters of this book from GitHub. +```{note} +This chapter is not available in the Python version of the textbook yet. +``` + + From 26155a0c7464393661663f85066c0334679ca121 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 18 Jan 2023 10:49:09 -0800 Subject: [PATCH 33/45] Update source/classification2.md Co-authored-by: Joel Ostblom --- source/classification2.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/source/classification2.md b/source/classification2.md index f6b747f4..2881bb28 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -323,10 +323,9 @@ cancer = pd.read_csv("data/unscaled_wdbc.csv") # re-label Class 'M' as 'Malignant', and Class 'B' as 'Benign', # and change the Class variable to have a category type cancer['Class'] = cancer['Class'].replace({ - 'M' : 'Malignant', - 'B' : 'Benign' - }) -cancer['Class'] = cancer['Class'].astype('category') + 'M' : 'Malignant', + 'B' : 'Benign' +}).astype('category') # create scatter plot of tumor cell concavity versus smoothness, # labeling the points be diagnosis class From 71d52b508cb72e1c8b1ba5dc5d03203bd4ed264e Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 18 Jan 2023 10:50:10 -0800 Subject: [PATCH 34/45] Update source/classification2.md Co-authored-by: Joel Ostblom --- source/classification2.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/classification2.md b/source/classification2.md index 2881bb28..2f57607f 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -533,7 +533,7 @@ using the `len` function. ```{code-cell} ipython3 correct_preds = cancer_test_predictions[ cancer_test_predictions['Class'] == cancer_test_predictions['predicted'] - ] +] len(correct_preds) / len(cancer_test_predictions) ``` From 36e725f3c2f08cf4ade4031b7bccd4fe947373d3 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 18 Jan 2023 10:50:24 -0800 Subject: [PATCH 35/45] Update source/classification2.md Co-authored-by: Joel Ostblom --- source/classification2.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/source/classification2.md b/source/classification2.md index 2f57607f..6dc13048 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -546,9 +546,9 @@ and we provide the true labels via the `cancer_test["Class"]` series. ```{code-cell} ipython3 cancer_acc_1 = knn_fit.score( - cancer_test.loc[:, ["Smoothness", "Concavity"]], - cancer_test["Class"] - ) + cancer_test.loc[:, ["Smoothness", "Concavity"]], + cancer_test["Class"] +) cancer_acc_1 ``` From 0b7ecc41c2126a95b6723109208b3ae388029269 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 18 Jan 2023 10:50:32 -0800 Subject: [PATCH 36/45] Update source/classification2.md Co-authored-by: Joel Ostblom --- source/classification2.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/source/classification2.md b/source/classification2.md index 6dc13048..1157ff21 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -569,9 +569,10 @@ The `crosstab` function takes two arguments: the true labels first, then the predicted labels second. ```{code-cell} ipython3 -pd.crosstab(cancer_test_predictions["Class"], - cancer_test_predictions["predicted"] - ) +pd.crosstab( + cancer_test_predictions["Class"], + cancer_test_predictions["predicted"] +) ``` ```{code-cell} ipython3 From 96493e42a0b39170de8582cdf7d6b91e1e7f590d Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 18 Jan 2023 10:50:41 -0800 Subject: [PATCH 37/45] Update source/classification2.md Co-authored-by: Joel Ostblom --- source/classification2.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/classification2.md b/source/classification2.md index 1157ff21..720f2fe8 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -641,7 +641,7 @@ As an example, in the breast cancer data, recall the proportions of benign and m observations in the training data are as follows: ```{code-cell} ipython3 -cancer_train["Class"].value_counts(normalize = True) +cancer_train["Class"].value_counts(normalize=True) ``` Since the benign class represents the majority of the training data, From 1bc8c363d7a76e7901aa6d688072caa5bee63720 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 18 Jan 2023 10:50:54 -0800 Subject: [PATCH 38/45] Update source/classification2.md Co-authored-by: Joel Ostblom --- source/classification2.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/classification2.md b/source/classification2.md index 720f2fe8..761a9c79 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -434,7 +434,7 @@ data are benign and {glue:}`cancer_train_m_prop`% are malignant, indicating that our class proportions were roughly preserved when we split the data. ```{code-cell} ipython3 -cancer_train["Class"].value_counts(normalize = True) +cancer_train["Class"].value_counts(normalize=True) ``` ```{code-cell} ipython3 From d533ac27fa8bc2ee3d0bee75198e96205b401b70 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 18 Jan 2023 10:51:06 -0800 Subject: [PATCH 39/45] Update source/classification2.md Co-authored-by: Joel Ostblom --- source/classification2.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/source/classification2.md b/source/classification2.md index 761a9c79..630327e4 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -741,9 +741,9 @@ knn_fit = make_pipeline(cancer_preprocessor, knn).fit(X, y) # compute the score on validation data acc = knn_fit.score( - cancer_validation.loc[:, ["Smoothness", "Concavity"]], - cancer_validation["Class"] - ) + cancer_validation.loc[:, ["Smoothness", "Concavity"]], + cancer_validation["Class"] +) acc ``` From 28ceac80180a6be2b764172f40a96a42acc99779 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 18 Jan 2023 10:51:53 -0800 Subject: [PATCH 40/45] Update source/classification2.md Co-authored-by: Joel Ostblom --- source/classification2.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/source/classification2.md b/source/classification2.md index 630327e4..ac6567f5 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -514,9 +514,8 @@ variables in the output data frame. ```{code-cell} ipython3 cancer_test_predictions = cancer_test.assign( - predicted = knn_fit.predict( - cancer_test.loc[:, ["Smoothness", "Concavity"]] - )) + predicted = knn_fit.predict(cancer_test.loc[:, ["Smoothness", "Concavity"]]) +) cancer_test_predictions[['ID', 'Class', 'predicted']] ``` From 2ef3707567282b8d5a94f3af6b506563ad0239e4 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 18 Jan 2023 10:52:13 -0800 Subject: [PATCH 41/45] Update source/classification2.md Co-authored-by: Joel Ostblom --- source/classification2.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/classification2.md b/source/classification2.md index ac6567f5..dad74263 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -534,7 +534,7 @@ correct_preds = cancer_test_predictions[ cancer_test_predictions['Class'] == cancer_test_predictions['predicted'] ] -len(correct_preds) / len(cancer_test_predictions) +correct_preds.shape[0] / cancer_test_predictions.shape[0] ``` The `scitkit-learn` package also provides a more convenient way to do this using From bae52da56d1fd0798aaccf05b7be9671f819c9a0 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 18 Jan 2023 10:54:11 -0800 Subject: [PATCH 42/45] values -> to_numpy in randomness section --- source/classification2.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/source/classification2.md b/source/classification2.md index dad74263..e3d83196 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -188,7 +188,7 @@ np.random.seed(1) nums_0_to_9 = pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) -random_numbers1 = nums_0_to_9.sample(n = 10).values +random_numbers1 = nums_0_to_9.sample(n = 10).to_numpy() random_numbers1 ``` You can see that `random_numbers1` is a list of 10 numbers @@ -197,7 +197,7 @@ we run the `sample` method again, we will get a fresh batch of 10 numbers that also look random. ```{code-cell} ipython3 -random_numbers2 = nums_0_to_9.sample(n = 10).values +random_numbers2 = nums_0_to_9.sample(n = 10).to_numpy() random_numbers2 ``` @@ -207,12 +207,12 @@ as before---and then call the `sample` method again. ```{code-cell} ipython3 np.random.seed(1) -random_numbers1_again = nums_0_to_9.sample(n = 10).values +random_numbers1_again = nums_0_to_9.sample(n = 10).to_numpy() random_numbers1_again ``` ```{code-cell} ipython3 -random_numbers2_again = nums_0_to_9.sample(n = 10).values +random_numbers2_again = nums_0_to_9.sample(n = 10).to_numpy() random_numbers2_again ``` @@ -224,12 +224,12 @@ obtain a different sequence of random numbers. ```{code-cell} ipython3 np.random.seed(4235) -random_numbers = nums_0_to_9.sample(n = 10).values +random_numbers = nums_0_to_9.sample(n = 10).to_numpy() random_numbers ``` ```{code-cell} ipython3 -random_numbers = nums_0_to_9.sample(n = 10).values +random_numbers = nums_0_to_9.sample(n = 10).to_numpy() random_numbers ``` @@ -277,14 +277,14 @@ functions. Those functions will then use your `RandomState` to generate random n object with the `seed` value set to 1; we get the same lists of numbers once again. ```{code} rnd = np.random.RandomState(seed = 1) -random_numbers1_third = nums_0_to_9.sample(n = 10, random_state = rnd).values +random_numbers1_third = nums_0_to_9.sample(n = 10, random_state = rnd).to_numpy() random_numbers1_third ``` ```{code} array([2, 9, 6, 4, 0, 3, 1, 7, 8, 5]) ``` ```{code} -random_numbers2_third = nums_0_to_9.sample(n = 10, random_state = rnd).values +random_numbers2_third = nums_0_to_9.sample(n = 10, random_state = rnd).to_numpy() random_numbers2_third ``` ```{code} From 66cf370505d0ddc0c71e28663fd4155a180e36f2 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 18 Jan 2023 10:57:57 -0800 Subject: [PATCH 43/45] Update source/classification2.md Co-authored-by: Joel Ostblom --- source/classification2.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/classification2.md b/source/classification2.md index e3d83196..9780cd45 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -332,7 +332,7 @@ cancer['Class'] = cancer['Class'].replace({ perim_concav = ( alt.Chart(cancer) - .mark_point(opacity=0.6, filled=True, size=40) + .mark_circle() .encode( x="Smoothness", y="Concavity", From 4a88ec3575c603ebbe12027bf50b216fc3f78025 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 18 Jan 2023 10:58:29 -0800 Subject: [PATCH 44/45] Update source/classification2.md Co-authored-by: Joel Ostblom --- source/classification2.md | 1 - 1 file changed, 1 deletion(-) diff --git a/source/classification2.md b/source/classification2.md index 9780cd45..25a5787e 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -1235,7 +1235,6 @@ training data, it is said to **overfit** the data. ```{code-cell} ipython3 alt.data_transformers.disable_max_rows() -alt.renderers.enable("mimetype") cancer_plot = ( alt.Chart( From a1f945494cd0ec3815741288f606c4cab680bb4a Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 18 Jan 2023 11:00:59 -0800 Subject: [PATCH 45/45] remove code for area plot at the end of ch6 --- source/classification2.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/classification2.md b/source/classification2.md index 25a5787e..e365ff99 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -1233,7 +1233,7 @@ completely different. In general, if the model *is influenced too much* by the training data, it is said to **overfit** the data. ```{code-cell} ipython3 - +:tags: [remove-cell] alt.data_transformers.disable_max_rows() cancer_plot = (