From feb3a44c23bf4d8493525bc7067b5b75df9c327f Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 31 Dec 2022 14:20:28 -0800 Subject: [PATCH 01/37] starting work on ch5+6; categorical type change; remove commented out R code --- source/classification1.md | 67 ++++++++++----------------------------- 1 file changed, 17 insertions(+), 50 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index db09a47a..fe974ac1 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -83,7 +83,7 @@ By the end of the chapter, readers will be able to do the following: ```{index} see: feature ; predictor ``` -In many situations, we want to make `predictions` based on the current situation +In many situations, we want to make predictions based on the current situation as well as past experiences. For instance, a doctor may want to diagnose a patient as either diseased or healthy based on their symptoms and the doctor's past experience with patients; an email provider might want to tag a given @@ -206,65 +206,32 @@ total set of variables per image in this data set is: ```{index} info ``` -Below we use `.info()` to preview the data frame. This method can -make it easier to inspect the data when we have a lot of columns, -as it prints the data such that the columns go down -the page (instead of across). +Below we use the `info` method to preview the data frame. This method can +make it easier to inspect the data when we have a lot of columns: +it prints only the column names down the page (instead of across), +as well as their data types and the number of non-missing entries. ```{code-cell} ipython3 cancer.info() ``` -From the summary of the data above, we can see that `Class` is of type object. - -+++ - -Given that we only have two different values in our `Class` column (B for benign and M -for malignant), we only expect to get two names back. +From the summary of the data above, we can see that `Class` is of type `object`. +We can use the `unique` method on the `Class` column to see all unique values +present in that column. We see that there are two diagnoses: +benign, represented by `'B'`, and malignant, represented by `'M'`. ```{code-cell} ipython3 cancer['Class'].unique() ``` -```{code-cell} ipython3 -:tags: [remove-cell] +Since we will be working with `Class` as a categorical statistical variable, +it is a good idea to convert it to the `category` type using the `astype` method +on the `cancer` data frame. We will verify the result using the `info` method +again. -## The above was based on the following text and code in R textbook. ## -####################################################################### -# Below we use `glimpse` \index{glimpse} to preview the data frame. This function can -# make it easier to inspect the data when we have a lot of columns, -# as it prints the data such that the columns go down -# the page (instead of across). - -# ```{r 05-glimpse} -# glimpse(cancer) -# ``` - -# From the summary of the data above, we can see that `Class` is of type character -# (denoted by ``). Since we will be working with `Class` as a -# categorical statistical variable, we will convert it to a factor using the -# function `as_factor`. \index{factor!as\_factor} - -# ```{r 05-class} -# cancer <- cancer |> -# mutate(Class = as_factor(Class)) -# glimpse(cancer) -# ``` - -# Recall that factors have what are called "levels", which you can think of as categories. We -# can verify the levels of the `Class` column by using the `levels` \index{levels}\index{factor!levels} function. -# This function should return the name of each category in that column. Given -# that we only have two different values in our `Class` column (B for benign and M -# for malignant), we only expect to get two names back. Note that the `levels` function requires a *vector* argument; -# so we use the `pull` function to extract a single column (`Class`) and -# pass that into the `levels` function to see the categories -# in the `Class` column. - -# ```{r 05-levels} -# cancer |> -# pull(Class) |> -# levels() -# ``` +```{code-cell} ipython3 +cancer['Class'] = cancer['Class'].astype('category') +cancer.info() ``` ### Exploring the cancer data @@ -273,7 +240,7 @@ cancer['Class'].unique() ``` Before we start doing any modeling, let's explore our data set. Below we use -the `.groupby()`, `.count()` methods to find the number and percentage +the `groupby` and `count` methods to find the number and percentage of benign and malignant tumor observations in our data set. When paired with `.groupby()`, `.count()` counts the number of observations in each `Class` group. Then we calculate the percentage in each group by dividing by the total number of observations. We have 357 (63\%) benign and 212 (37\%) malignant tumor observations. From a507994626778752ce61d8906858aa95ceb52165 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 31 Dec 2022 15:21:22 -0800 Subject: [PATCH 02/37] value counts, class name remap, replace in ch5 --- source/classification1.md | 119 ++++++++++++++++++++------------------ 1 file changed, 64 insertions(+), 55 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index fe974ac1..c8818220 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -15,38 +15,6 @@ kernelspec: (classification)= # Classification I: training & predicting -```{code-cell} ipython3 -:tags: [remove-cell] - -import random - -import altair as alt -from altair_saver import save -import numpy as np -import pandas as pd -import sklearn -from sklearn.compose import make_column_transformer -from sklearn.neighbors import KNeighborsClassifier -from sklearn.pipeline import Pipeline, make_pipeline -from sklearn.metrics.pairwise import euclidean_distances -from sklearn.preprocessing import StandardScaler -import plotly.express as px -import plotly.graph_objs as go -from plotly.offline import iplot, plot -from IPython.display import HTML -from myst_nb import glue - -alt.data_transformers.disable_max_rows() - -# alt.renderers.enable('altair_saver', fmts=['vega-lite', 'png']) - -# # Handle large data sets by not embedding them in the notebook -# alt.data_transformers.enable('data_server') - -# # Save a PNG blob as a backup for when the Altair plots do not render -# alt.renderers.enable('mimetype') -``` - ## Overview In previous chapters, we focused solely on descriptive and exploratory data analysis questions. @@ -155,10 +123,11 @@ guide patient treatment. Our first step is to load, wrangle, and explore the data using visualizations in order to better understand the data we are working with. We start by -loading the `pandas` package needed for our analysis. +loading the `pandas` and `altair` packages needed for our analysis. ```{code-cell} ipython3 import pandas as pd +import altair as alt ``` In this case, the file containing the breast cancer data set is a `.csv` @@ -215,6 +184,9 @@ as well as their data types and the number of non-missing entries. cancer.info() ``` +```{index} unique +``` + From the summary of the data above, we can see that `Class` is of type `object`. We can use the `unique` method on the `Class` column to see all unique values present in that column. We see that there are two diagnoses: @@ -224,59 +196,95 @@ benign, represented by `'B'`, and malignant, represented by `'M'`. cancer['Class'].unique() ``` -Since we will be working with `Class` as a categorical statistical variable, +We will also improve the readability of our analysis +by renaming `'M'` to `'Malignant'` and `'B'` to `'Benign'` using the `replace` +method. The `replace` method takes one argument: a dictionary that maps +previous values to desired new values. +Furthermore, since we will be working with `Class` as a categorical statistical variable, it is a good idea to convert it to the `category` type using the `astype` method -on the `cancer` data frame. We will verify the result using the `info` method -again. +on the `cancer` data frame. We will verify the result using the `info` +and `unique` methods again. + +```{index} replace +``` ```{code-cell} ipython3 +cancer['Class'] = cancer['Class'].replace({ + 'M' : 'Malignant', + 'B' : 'Benign' + }) cancer['Class'] = cancer['Class'].astype('category') cancer.info() ``` +```{code-cell} ipython3 +cancer['Class'].unique() +``` + ### Exploring the cancer data ```{index} groupby, count ``` +```{code-cell} ipython3 +:tags: [remove-cell] +from myst_nb import glue +import numpy as np +glue("benign_count", cancer['Class'].value_counts()['Benign']) +glue("benign_pct", int(np.round(100*cancer['Class'].value_counts(normalize=True)['Benign']))) +glue("malignant_count", cancer['Class'].value_counts()['Malignant']) +glue("malignant_pct", int(np.round(100*cancer['Class'].value_counts(normalize=True)['Malignant']))) +``` + Before we start doing any modeling, let's explore our data set. Below we use the `groupby` and `count` methods to find the number and percentage -of benign and malignant tumor observations in our data set. When paired with `.groupby()`, `.count()` counts the number of observations in each `Class` group. -Then we calculate the percentage in each group by dividing by the total number of observations. We have 357 (63\%) benign and 212 (37\%) malignant tumor observations. +of benign and malignant tumor observations in our data set. When paired with +`groupby`, `count` counts the number of observations for each value of the `Class` +variable. Then we calculate the percentage in each group by dividing by the total +number of observations and multiplying by 100. We have +{glue:}`benign_count` ({glue:}`benign_pct`\%) benign and +{glue:}`malignant_count` ({glue:}`malignant_pct`\%) malignant +tumor observations. ```{code-cell} ipython3 -num_obs = len(cancer) explore_cancer = pd.DataFrame() explore_cancer['count'] = cancer.groupby('Class')['ID'].count() -explore_cancer['percentage'] = explore_cancer['count'] / num_obs * 100 +explore_cancer['percentage'] = 100 * explore_cancer['count']/len(cancer) explore_cancer ``` -```{index} visualization; scatter +```{index} value_counts ``` -Next, let's draw a scatter plot to visualize the relationship between the -perimeter and concavity variables. Rather than use `altair's` default palette, -we select our own colorblind-friendly colors—`"#efb13f"` -for light orange and `"#86bfef"` for light blue—and - pass them as the `scale` argument in the `color` argument. -We also make the category labels ("B" and "M") more readable by -changing them to "Benign" and "Malignant" using `.apply()` method on the dataframe. +The `pandas` package also has a more convenient specialized `value_counts` method for +counting the number of occurrences of each value in a column. If we pass no arguments +to the method, it outputs a series containing the number of occurences +of each value. If we instead pass the argument `normalize=True`, it instead prints the fraction +of occurrences of each value. ```{code-cell} ipython3 -:tags: [] +cancer['Class'].value_counts() +``` -colors = ["#86bfef", "#efb13f"] -cancer["Class"] = cancer["Class"].apply( - lambda x: "Malignant" if (x == "M") else "Benign" -) +```{code-cell} ipython3 +cancer['Class'].value_counts(normalize=True) +``` + +```{index} visualization; scatter +``` + +Next, let's draw a colored scatter plot to visualize the relationship between the +perimeter and concavity variables. Recall that `altair's` default palette +is colorblind-friendly, so we can stick with that here. + +```{code-cell} ipython3 perim_concav = ( alt.Chart(cancer) .mark_point(opacity=0.6, filled=True, size=40) .encode( x=alt.X("Perimeter", title="Perimeter (standardized)"), y=alt.Y("Concavity", title="Concavity (standardized)"), - color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"), + color=alt.Color("Class", title="Diagnosis"), ) ) perim_concav @@ -305,7 +313,8 @@ you classify that new observation? If the standardized concavity and perimeter values are 1 and 1 respectively, the point would lie in the middle of the orange cloud of malignant points and thus we could probably classify it as malignant. Based on our visualization, it seems like -the *prediction of an unobserved label* might be possible. +it may be possible to make accurate predictions of the `Class` variable (i.e., a diagnosis) for +tumor images with unknown diagnoses. +++ From fd9a88212cffc27f4b6a36089a47dc872105f9ad Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 31 Dec 2022 15:28:42 -0800 Subject: [PATCH 03/37] remove warnings --- source/classification1.md | 8 ++++++++ source/classification2.md | 8 ++++++++ 2 files changed, 16 insertions(+) diff --git a/source/classification1.md b/source/classification1.md index c8818220..c64de132 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -12,6 +12,14 @@ kernelspec: name: python3 --- +```{code-cell} ipython3 +:tags: [remove-cell] +import warnings +def warn(*args, **kwargs): + pass +warnings.warn = warn +``` + (classification)= # Classification I: training & predicting diff --git a/source/classification2.md b/source/classification2.md index 8daf6c38..8087cfb4 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -15,6 +15,14 @@ kernelspec: (classification2)= # Classification II: evaluation & tuning +```{code-cell} ipython3 +:tags: [remove-cell] +import warnings +def warn(*args, **kwargs): + pass +warnings.warn = warn +``` + ```{code-cell} ipython3 :tags: [remove-cell] From 8b20e7f19c58a2a42964310242c30909ce09ced9 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 31 Dec 2022 15:42:46 -0800 Subject: [PATCH 04/37] polished ch5+6 up to euclidean dist --- source/classification1.md | 78 +++++++++++++++++++++------------------ source/classification2.md | 14 +++---- 2 files changed, 49 insertions(+), 43 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index c64de132..063b0356 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -12,12 +12,16 @@ kernelspec: name: python3 --- +%```{code-cell} ipython3 +%:tags: [remove-cell] +%import warnings +%def warn(*args, **kwargs): +% pass +%warnings.warn = warn +%``` + ```{code-cell} ipython3 -:tags: [remove-cell] -import warnings -def warn(*args, **kwargs): - pass -warnings.warn = warn +from sklearn.metrics.pairwise import euclidean_distances ``` (classification)= @@ -332,6 +336,8 @@ tumor images with unknown diagnoses. :tags: [remove-cell] new_point = [2, 4] +glue("new_point_1_0", new_point[0]) +glue("new_point_1_1", new_point[1]) attrs = ["Perimeter", "Concavity"] points_df = pd.DataFrame( {"Perimeter": new_point[0], "Concavity": new_point[1], "Class": ["Unknown"]} @@ -342,8 +348,6 @@ perim_concav_with_new_point_df = pd.concat((cancer, points_df), ignore_index=Tru my_distances = euclidean_distances(perim_concav_with_new_point_df.loc[:, attrs])[ len(cancer) ][:-1] -glue("1-new_point_0", new_point[0]) -glue("1-new_point_1", new_point[1]) ``` ```{index} K-nearest neighbors; classification @@ -361,8 +365,11 @@ $K$ for us. We will cover how to choose $K$ ourselves in the next chapter. To illustrate the concept of $K$-nearest neighbors classification, we will walk through an example. Suppose we have a -new observation, with standardized perimeter of {glue:}`1-new_point_0` and standardized concavity of {glue:}`1-new_point_1`, whose -diagnosis "Class" is unknown. This new observation is depicted by the red, diamond point in {numref}`fig:05-knn-2`. +new observation, with standardized perimeter +of {glue:}`new_point_1_0` and standardized concavity +of {glue:}`new_point_1_1`, whose +diagnosis "Class" is unknown. This new observation is +depicted by the red, diamond point in {numref}`fig:05-knn-2`. ```{code-cell} ipython3 :tags: [remove-cell] @@ -370,21 +377,16 @@ diagnosis "Class" is unknown. This new observation is depicted by the red, diamo perim_concav_with_new_point = ( alt.Chart( perim_concav_with_new_point_df, - # title="Scatter plot of concavity versus perimeter with new observation represented as a red diamond.", ) .mark_point(opacity=0.6, filled=True, size=40) .encode( x=alt.X("Perimeter", title="Perimeter (standardized)"), y=alt.Y("Concavity", title="Concavity (standardized)"), - color=alt.Color( - "Class", - scale=alt.Scale(range=["#86bfef", "#efb13f", "red"]), - title="Diagnosis", - ), + color=alt.Color("Class", title="Diagnosis"), shape=alt.Shape( "Class", scale=alt.Scale(range=["circle", "circle", "diamond"]) ), - size=alt.condition("datum.Class == 'Unknown'", alt.value(80), alt.value(30)) + size=alt.condition("datum.Class == 'Unknown'", alt.value(80), alt.value(30)), ) ) glue('fig:05-knn-2', perim_concav_with_new_point, display=True) @@ -410,10 +412,11 @@ glue("1-neighbor_per", round(near_neighbor_df.iloc[0, :]['Perimeter'], 1)) glue("1-neighbor_con", round(near_neighbor_df.iloc[0, :]['Concavity'], 1)) ``` -{numref}`fig:05-knn-3` shows that the nearest point to this new observation is **malignant** and -located at the coordinates ({glue:}`1-neighbor_per`, {glue:}`1-neighbor_con`). The idea here is that if a point is close to another in the scatter plot, -then the perimeter and concavity values are similar, and so we may expect that -they would have the same diagnosis. +{numref}`fig:05-knn-3` shows that the nearest point to this new observation is +**malignant** and located at the coordinates ({glue:}`1-neighbor_per`, +{glue:}`1-neighbor_con`). The idea here is that if a point is close to another +in the scatter plot, then the perimeter and concavity values are similar, +and so we may expect that they would have the same diagnosis. ```{code-cell} ipython3 :tags: [remove-cell] @@ -430,7 +433,9 @@ glue('fig:05-knn-3', (perim_concav_with_new_point + line), display=True) :::{glue:figure} fig:05-knn-3 :name: fig:05-knn-3 -Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a malignant label. +Scatter plot of concavity versus perimeter. The new observation is represented +as a red diamond with a line to the one nearest neighbor, which has a malignant +label. ::: ```{code-cell} ipython3 @@ -447,8 +452,8 @@ perim_concav_with_new_point_df2 = pd.concat((cancer, points_df2), ignore_index=T my_distances2 = euclidean_distances(perim_concav_with_new_point_df2.loc[:, attrs])[ len(cancer) ][:-1] -glue("2-new_point_0", new_point[0]) -glue("2-new_point_1", new_point[1]) +glue("new_point_2_0", new_point[0]) +glue("new_point_2_1", new_point[1]) ``` ```{code-cell} ipython3 @@ -457,7 +462,6 @@ glue("2-new_point_1", new_point[1]) perim_concav_with_new_point2 = ( alt.Chart( perim_concav_with_new_point_df2, - # title="Scatter plot of concavity versus perimeter with new observation represented as a red diamond.", ) .mark_point(opacity=0.6, filled=True, size=40) .encode( @@ -465,7 +469,6 @@ perim_concav_with_new_point2 = ( y=alt.Y("Concavity", title="Concavity (standardized)"), color=alt.Color( "Class", - scale=alt.Scale(range=["#86bfef", "#efb13f", "red"]), title="Diagnosis", ), shape=alt.Shape( @@ -493,9 +496,10 @@ glue("2-neighbor_con", round(near_neighbor_df2.iloc[0, :]['Concavity'], 1)) glue('fig:05-knn-4', (perim_concav_with_new_point2 + line2), display=True) ``` -Suppose we have another new observation with standardized perimeter {glue:}`2-new_point_0` and -concavity of {glue:}`2-new_point_1`. Looking at the scatter plot in {numref}`fig:05-knn-4`, how would you -classify this red, diamond observation? The nearest neighbor to this new point is a +Suppose we have another new observation with standardized perimeter +{glue:}`new_point_2_0` and concavity of {glue:}`new_point_2_1`. Looking at the +scatter plot in {numref}`fig:05-knn-4`, how would you classify this red, +diamond observation? The nearest neighbor to this new point is a **benign** observation at ({glue:}`2-neighbor_per`, {glue:}`2-neighbor_con`). Does this seem like the right prediction to make for this observation? Probably not, if you consider the other nearby points. @@ -505,7 +509,9 @@ not, if you consider the other nearby points. :::{glue:figure} fig:05-knn-4 :name: fig:05-knn-4 -Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a benign label. +Scatter plot of concavity versus perimeter. The new observation is represented +as a red diamond with a line to the one nearest neighbor, which has a benign +label. ::: ```{code-cell} ipython3 @@ -575,13 +581,13 @@ next chapter. ```{index} distance; K-nearest neighbors, straight line; distance ``` -We decide which points are the $K$ "nearest" to our new observation -using the *straight-line distance* (we will often just refer to this as *distance*). -Suppose we have two observations $a$ and $b$, each having two predictor variables, $x$ and $y$. -Denote $a_x$ and $a_y$ to be the values of variables $x$ and $y$ for observation $a$; -$b_x$ and $b_y$ have similar definitions for observation $b$. -Then the straight-line distance between observation $a$ and $b$ on the x-y plane can -be computed using the following formula: +We decide which points are the $K$ "nearest" to our new observation using the +*straight-line distance* (we will often just refer to this as *distance*). +Suppose we have two observations $a$ and $b$, each having two predictor +variables, $x$ and $y$. Denote $a_x$ and $a_y$ to be the values of variables +$x$ and $y$ for observation $a$; $b_x$ and $b_y$ have similar definitions for +observation $b$. Then the straight-line distance between observation $a$ and +$b$ on the x-y plane can be computed using the following formula: $$\mathrm{Distance} = \sqrt{(a_x -b_x)^2 + (a_y - b_y)^2}$$ diff --git a/source/classification2.md b/source/classification2.md index 8087cfb4..2eca1945 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -15,13 +15,13 @@ kernelspec: (classification2)= # Classification II: evaluation & tuning -```{code-cell} ipython3 -:tags: [remove-cell] -import warnings -def warn(*args, **kwargs): - pass -warnings.warn = warn -``` +%```{code-cell} ipython3 +%:tags: [remove-cell] +%import warnings +%def warn(*args, **kwargs): +% pass +%warnings.warn = warn +%``` ```{code-cell} ipython3 :tags: [remove-cell] From bd28be9eb67f3adbd1556cfb1d0a81b9fe05e9ae Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 31 Dec 2022 16:02:18 -0800 Subject: [PATCH 05/37] minor bugfix --- source/classification1.md | 7 ++----- source/classification2.md | 4 ++-- 2 files changed, 4 insertions(+), 7 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index 063b0356..3e86ba0f 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -12,15 +12,12 @@ kernelspec: name: python3 --- -%```{code-cell} ipython3 -%:tags: [remove-cell] +```{code-cell} ipython3 +:tags: [remove-cell] %import warnings %def warn(*args, **kwargs): % pass %warnings.warn = warn -%``` - -```{code-cell} ipython3 from sklearn.metrics.pairwise import euclidean_distances ``` diff --git a/source/classification2.md b/source/classification2.md index 2eca1945..4a07e738 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -15,13 +15,13 @@ kernelspec: (classification2)= # Classification II: evaluation & tuning -%```{code-cell} ipython3 +```{code-cell} ipython3 %:tags: [remove-cell] %import warnings %def warn(*args, **kwargs): % pass %warnings.warn = warn -%``` +``` ```{code-cell} ipython3 :tags: [remove-cell] From 9499a732ba6b6d63e0fc8d1674ade53f5d474308 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 31 Dec 2022 16:04:36 -0800 Subject: [PATCH 06/37] minor bugfix --- source/classification1.md | 8 ++++---- source/classification2.md | 10 +++++----- 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index 3e86ba0f..b37c4276 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -14,10 +14,10 @@ kernelspec: ```{code-cell} ipython3 :tags: [remove-cell] -%import warnings -%def warn(*args, **kwargs): -% pass -%warnings.warn = warn +#import warnings +#def warn(*args, **kwargs): +# pass +#warnings.warn = warn from sklearn.metrics.pairwise import euclidean_distances ``` diff --git a/source/classification2.md b/source/classification2.md index 4a07e738..0c3e53b4 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -16,11 +16,11 @@ kernelspec: # Classification II: evaluation & tuning ```{code-cell} ipython3 -%:tags: [remove-cell] -%import warnings -%def warn(*args, **kwargs): -% pass -%warnings.warn = warn +:tags: [remove-cell] +#import warnings +#def warn(*args, **kwargs): +# pass +#warnings.warn = warn ``` ```{code-cell} ipython3 From 294103af4ac8dc09ae3b8741200a30b92adcb194 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 31 Dec 2022 16:08:57 -0800 Subject: [PATCH 07/37] fixed worksheets link at end of chp --- source/classification1.md | 2 +- source/classification2.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index b37c4276..f2f06af8 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -1995,7 +1995,7 @@ Scatter plot of smoothness versus area where background color indicates the deci Practice exercises for the material covered in this chapter can be found in the accompanying -[worksheets repository](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets#readme) +[worksheets repository](https://github.com/UBC-DSCI/data-science-a-first-intro-python-worksheets#readme) in the "Classification I: training and predicting" row. You can launch an interactive version of the worksheet in your browser by clicking the "launch binder" button. You can also preview a non-interactive version of the worksheet by clicking "view worksheet." diff --git a/source/classification2.md b/source/classification2.md index 0c3e53b4..17390aec 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -2127,7 +2127,7 @@ Estimated accuracy versus the number of predictors for the sequence of models bu Practice exercises for the material covered in this chapter can be found in the accompanying -[worksheets repository](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets#readme) +[worksheets repository](https://github.com/UBC-DSCI/data-science-a-first-intro-python-worksheets#readme) in the "Classification II: evaluation and tuning" row. You can launch an interactive version of the worksheet in your browser by clicking the "launch binder" button. You can also preview a non-interactive version of the worksheet by clicking "view worksheet." From 1ad6164c04754ec5d1e8f2c8200b7935f7ff3fc7 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 31 Dec 2022 19:22:34 -0800 Subject: [PATCH 08/37] fix minor section heading wording in Ch1 --- source/intro.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/source/intro.md b/source/intro.md index 9683b4ef..f953319e 100644 --- a/source/intro.md +++ b/source/intro.md @@ -38,7 +38,7 @@ By the end of the chapter, readers will be able to do the following: - Read tabular data with `read_csv`. - Use `help()` to access help and documentation tools in Python. - Create new variables and objects in Python. -- Create and organize subsets of tabular data using `[]`, `loc[]`, and `sort_values` +- Create and organize subsets of tabular data using `[]`, `loc[]`, `sort_values`, and `head` - Visualize data with an `altair` bar plot. ## Canadian languages data set @@ -588,7 +588,7 @@ with multiple kinds of `category`. The data frame `aboriginal_lang` contains only 67 rows, and looks like it only contains Aboriginal languages. So it looks like the `loc[]` operation gave us the result we wanted! -### Using `sort_values` to order and `head` to select rows by value +### Using `sort_values` and `head` to select rows by ordered values ```{index} pandas.DataFrame; sort_values, pandas.DataFrame; head ``` From ee90b8e543d1f6d48e27ff523506ebffd74c872a Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 31 Dec 2022 23:21:43 -0800 Subject: [PATCH 09/37] added nsmallest + note; better chaining for dist comps; removed comments; fixed colors (not working yet) --- source/classification1.md | 114 ++++++++++++++++---------------------- 1 file changed, 48 insertions(+), 66 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index f2f06af8..8b81f7a6 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -18,7 +18,15 @@ kernelspec: #def warn(*args, **kwargs): # pass #warnings.warn = warn + +from myst_nb import glue +import numpy as np from sklearn.metrics.pairwise import euclidean_distances +from IPython.display import HTML + +import plotly.express as px +import plotly.graph_objs as go +from plotly.offline import iplot, plot ``` (classification)= @@ -237,8 +245,6 @@ cancer['Class'].unique() ```{code-cell} ipython3 :tags: [remove-cell] -from myst_nb import glue -import numpy as np glue("benign_count", cancer['Class'].value_counts()['Benign']) glue("benign_pct", int(np.round(100*cancer['Class'].value_counts(normalize=True)['Benign']))) glue("malignant_count", cancer['Class'].value_counts()['Malignant']) @@ -600,6 +606,13 @@ the $K=5$ neighbors that are nearest to our new point. You will see in the code below, we compute the straight-line distance using the formula above: we square the differences between the two observations' perimeter and concavity coordinates, add the squared differences, and then take the square root. +In order to find the $K=5$ nearest neighbors, we will use the `nsmallest` function from `pandas`. + +> **Note:** Recall that in the {ref}`intro` chapter, we used `sort_values` followed by `head` to obtain +> the ten rows with the *largest* values of a variable. We could have instead used the `nlargest` function +> from `pandas` for this purpose. The `nsmallest` and `nlargest` functions achieve the same goal +> as `sort_values` followed by `head`, but are slightly more efficient because they are specialized for this purpose. +> In general, it is good to use more specialized functions when they are available! ```{code-cell} ipython3 :tags: [remove-cell] @@ -620,7 +633,6 @@ perim_concav_with_new_point3 = ( y=alt.Y("Concavity", title="Concavity (standardized)"), color=alt.Color( "Class", - scale=alt.Scale(range=["#86bfef", "#efb13f", "red"]), title="Diagnosis", ), shape=alt.Shape( @@ -647,13 +659,14 @@ Scatter plot of concavity versus perimeter with new observation represented as a ```{code-cell} ipython3 new_obs_Perimeter = 0 new_obs_Concavity = 3.5 -cancer_dist = cancer.loc[:, ["ID", "Perimeter", "Concavity", "Class"]] -cancer_dist = cancer_dist.assign(dist_from_new = np.sqrt( - (cancer_dist["Perimeter"] - new_obs_Perimeter) ** 2 - + (cancer_dist["Concavity"] - new_obs_Concavity) ** 2 -)) -# sort the rows in ascending order and take the first 5 rows -cancer_dist = cancer_dist.sort_values(by="dist_from_new").head(5) +cancer_dist = (cancer + .loc[:, ["Perimeter", "Concavity", "Class"]] + .assign(dist_from_new = ( + (cancer["Perimeter"] - new_obs_Perimeter) ** 2 + + (cancer["Concavity"] - new_obs_Concavity) ** 2 + )**(1/2)) + .nsmallest(5, "dist_from_new") + ) cancer_dist ``` @@ -662,36 +675,21 @@ we computed the `dist_from_new` variable (the distance to the new observation) for each of the 5 nearest neighbors in the training data. -```{code-cell} ipython3 -:tags: [remove-cell] - -## Couldn't find ways to have nice Latex equations in pandas dataframe - -# cancer_dist_eq = cancer_dist.copy() -# cancer_dist_eq['Perimeter'] = round(cancer_dist_eq['Perimeter'], 2) -# cancer_dist_eq['Concavity'] = round(cancer_dist_eq['Concavity'], 2) -# for i in list(cancer_dist_eq.index): -# cancer_dist_eq.loc[ -# i, "Distance" -# ] = f"[({new_obs_Perimeter} - {round(cancer_dist_eq.loc[i, 'Perimeter'], 2)})² + ({new_obs_Concavity} - {round(cancer_dist_eq.loc[i, 'Concavity'], 2)})²]¹/² = {round(cancer_dist_eq.loc[i, 'dist_from_new'], 2)}" -# cancer_dist_eq[["Perimeter", "Concavity", "Distance", "Class"]] -``` - ```{table} Evaluating the distances from the new observation to each of its 5 nearest neighbors :name: tab:05-multiknn-mathtable | Perimeter | Concavity | Distance | Class | |-----------|-----------|----------------------------------------|-------| -| 0.24 | 2.65 | $\sqrt{(0-0.24)^2+(3.5-2.65)^2}=0.88$| B | -| 0.75 | 2.87 | $\sqrt{(0-0.75)^2+(3.5-2.87)^2}=0.98$| M | -| 0.62 | 2.54 | $\sqrt{(0-0.62)^2+(3.5-2.54)^2}=1.14$| M | -| 0.42 | 2.31 | $\sqrt{(0-0.42)^2+(3.5-2.31)^2}=1.26$| M | -| -1.16 | 4.04 | $\sqrt{(0-(-1.16))^2+(3.5-4.04)^2}=1.28$| B | +| 0.24 | 2.65 | $\sqrt{(0-0.24)^2+(3.5-2.65)^2}=0.88$| Benign | +| 0.75 | 2.87 | $\sqrt{(0-0.75)^2+(3.5-2.87)^2}=0.98$| Malignant | +| 0.62 | 2.54 | $\sqrt{(0-0.62)^2+(3.5-2.54)^2}=1.14$| Malignant | +| 0.42 | 2.31 | $\sqrt{(0-0.42)^2+(3.5-2.31)^2}=1.26$| Malignant | +| -1.16 | 4.04 | $\sqrt{(0-(-1.16))^2+(3.5-4.04)^2}=1.28$| Benign | ``` +++ The result of this computation shows that 3 of the 5 nearest neighbors to our new observation are -malignant (`M`); since this is the majority, we classify our new observation as malignant. +malignant; since this is the majority, we classify our new observation as malignant. These 5 neighbors are circled in {numref}`fig:05-multiknn-3`. ```{code-cell} ipython3 @@ -758,18 +756,20 @@ three predictors. new_obs_Perimeter = 0 new_obs_Concavity = 3.5 new_obs_Symmetry = 1 -cancer_dist2 = cancer.loc[:, ["ID", "Perimeter", "Concavity", "Symmetry", "Class"]] -cancer_dist2["dist_from_new"] = np.sqrt( - (cancer_dist2["Perimeter"] - new_obs_Perimeter) ** 2 - + (cancer_dist2["Concavity"] - new_obs_Concavity) ** 2 - + (cancer_dist2["Symmetry"] - new_obs_Symmetry) ** 2 -) -# sort the rows in ascending order and take the first 5 rows -cancer_dist2 = cancer_dist2.sort_values(by="dist_from_new").head(5) +cancer_dist2 = (cancer + .loc[:, ["Perimeter", "Concavity", "Symmetry", "Class"]] + .assign(dist_from_new = ( + (cancer["Perimeter"] - new_obs_Perimeter) ** 2 + + (cancer["Concavity"] - new_obs_Concavity) ** 2 + + (cancer["Symmetry"] - new_obs_Symmetry) ** 2 + )**(1/2)) + .nsmallest(5, "dist_from_new") + ) cancer_dist2 ``` -Based on $K=5$ nearest neighbors with these three predictors we would classify the new observation as malignant since 4 out of 5 of the nearest neighbors are malignant class. +Based on $K=5$ nearest neighbors with these three predictors we would classify +the new observation as malignant since 4 out of 5 of the nearest neighbors are malignant class. {numref}`fig:05-more` shows what the data look like when we visualize them as a 3-dimensional scatter with lines from the new observation to its five nearest neighbors. @@ -832,7 +832,7 @@ for i, d in enumerate(fig.data): fig.data[i].marker.symbol = symbols[fig.data[i].name] # specify trace names and colors in a dict -colors = {"Malignant": "#86bfef", "Benign": "#efb13f", "Unknown": "red"} +colors = {"Malignant": "#ff7f0e", "Benign": "#1f77b4", "Unknown": "red"} # set all colors in fig for i, d in enumerate(fig.data): @@ -861,7 +861,6 @@ for neighbor_df in neighbor_df_list: fig.update_layout(margin=dict(l=0, r=0, b=0, t=1), template="plotly_white") plot(fig, filename="img/classification1/fig05-more.html", auto_open=False) -# display(HTML("img/classification1/fig05-more.html")) ``` ```{code-cell} ipython3 @@ -874,7 +873,10 @@ display(HTML("img/classification1/fig05-more.html")) :name: fig:05-more :figclass: caption-hack -3D scatter plot of the standardized symmetry, concavity, and perimeter variables. Note that in general we recommend against using 3D visualizations; here we show the data in 3D only to illustrate what higher dimensions and nearest neighbors look like, for learning purposes. +3D scatter plot of the standardized symmetry, concavity, and perimeter +variables. Note that in general we recommend against using 3D visualizations; +here we show the data in 3D only to illustrate what higher dimensions and +nearest neighbors look like, for learning purposes. ``` +++ @@ -884,9 +886,8 @@ display(HTML("img/classification1/fig05-more.html")) In order to classify a new observation using a $K$-nearest neighbor classifier, we have to do the following: 1. Compute the distance between the new observation and each observation in the training set. -2. Sort the data table in ascending order according to the distances. -3. Choose the top $K$ rows of the sorted table. -4. Classify the new observation based on a majority vote of the neighbor classes. +2. Find the $K$ rows corresponding to the $K$ smallest distances. +3. Classify the new observation based on a majority vote of the neighbor classes. +++ @@ -901,29 +902,10 @@ or predict the class for multiple new observations. Thankfully, in Python, the $K$-nearest neighbors algorithm is implemented in [the `scikit-learn` Python package](https://scikit-learn.org/stable/index.html) {cite:p}`sklearn_api` along with many [other models](https://scikit-learn.org/stable/user_guide.html) that you will encounter in this and future chapters of the book. Using the functions -in the `scikit-learn` package will help keep our code simple, readable and accurate; the +in the `scikit-learn` package (named `sklearn` in Python) will help keep our code simple, readable and accurate; the less we have to code ourselves, the fewer mistakes we will likely make. We start by importing `KNeighborsClassifier` from the `sklearn.neighbors` module. -```{code-cell} ipython3 -:tags: [remove-cell] - -## The above was based on: - -# Coding the $K$-nearest neighbors algorithm in R ourselves can get complicated, -# especially if we want to handle multiple classes, more than two variables, -# or predict the class for multiple new observations. Thankfully, in R, -# the $K$-nearest neighbors algorithm is -# implemented in [the `parsnip` R package](https://parsnip.tidymodels.org/) [@parsnip] -# included in `tidymodels`, along with -# many [other models](https://www.tidymodels.org/find/parsnip/) \index{tidymodels}\index{parsnip} -# that you will encounter in this and future chapters of the book. The `tidymodels` collection -# provides tools to help make and use models, such as classifiers. Using the packages -# in this collection will help keep our code simple, readable and accurate; the -# less we have to code ourselves, the fewer mistakes we will likely make. We -# start by loading `tidymodels`. -``` - ```{code-cell} ipython3 from sklearn.neighbors import KNeighborsClassifier ``` From ece61a828cd5524e73c8fc3bf2c91440e7359229 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 31 Dec 2022 23:49:23 -0800 Subject: [PATCH 10/37] initial fit and predict polished; model spec -> model object --- source/classification1.md | 102 ++++++++++++-------------------------- 1 file changed, 31 insertions(+), 71 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index 8b81f7a6..22368ada 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -915,103 +915,63 @@ We will use the `cancer` data set from above, with perimeter and concavity as predictors and $K = 5$ neighbors to build our classifier. Then we will use the classifier to predict the diagnosis label for a new observation with perimeter 0, concavity 3.5, and an unknown diagnosis label. Let's pick out our two desired -predictor variables and class label and store them as a new data set named `cancer_train`: +predictor variables and class label and store them with the name `cancer_train`: ```{code-cell} ipython3 -cancer_train = cancer.loc[:, ['Class', 'Perimeter', 'Concavity']] +cancer_train = cancer[['Class', 'Perimeter', 'Concavity']] cancer_train ``` -```{index} scikit-learn; model instance, scikit-learn; KNeighborsClassifier +```{index} scikit-learn; model object, scikit-learn; KNeighborsClassifier ``` -Next, we create a *model specification* for $K$-nearest neighbors classification -by creating a `KNeighborsClassifier` instance, specifying that we want to use $K = 5$ neighbors -(we will discuss how to choose $K$ in the next chapter) and the straight-line -distance (`weights="uniform"`). The `weights` argument controls -how neighbors vote when classifying a new observation; by setting it to `"uniform"`, -each of the $K$ nearest neighbors gets exactly 1 vote as described above. Other choices, -which weigh each neighbor's vote differently, can be found on -[the `scikit-learn` website](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html?highlight=kneighborsclassifier#sklearn.neighbors.KNeighborsClassifier). +Next, we create a *model object* for $K$-nearest neighbors classification +by creating a `KNeighborsClassifier` instance, specifying that we want to use $K = 5$ neighbors; +we will discuss how to choose $K$ in the next chapter. -```{code-cell} ipython3 -:tags: [remove-cell] +> **Note:** You can specify the `weights` argument in order to control +> how neighbors vote when classifying a new observation. The default is `"uniform"`, where +> each of the $K$ nearest neighbors gets exactly 1 vote as described above. Other choices, +> which weigh each neighbor's vote differently, can be found on +> [the `scikit-learn` website](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html?highlight=kneighborsclassifier#sklearn.neighbors.KNeighborsClassifier). -## The above was based on: - -# Next, we create a *model specification* for \index{tidymodels!model specification} $K$-nearest neighbors classification -# by calling the `nearest_neighbor` function, specifying that we want to use $K = 5$ neighbors -# (we will discuss how to choose $K$ in the next chapter) and the straight-line -# distance (`weight_func = "rectangular"`). The `weight_func` argument controls -# how neighbors vote when classifying a new observation; by setting it to `"rectangular"`, -# each of the $K$ nearest neighbors gets exactly 1 vote as described above. Other choices, -# which weigh each neighbor's vote differently, can be found on -# [the `parsnip` website](https://parsnip.tidymodels.org/reference/nearest_neighbor.html). -# In the `set_engine` \index{tidymodels!engine} argument, we specify which package or system will be used for training -# the model. Here `kknn` is the R package we will use for performing $K$-nearest neighbors classification. -# Finally, we specify that this is a classification problem with the `set_mode` function. -``` ```{code-cell} ipython3 -knn_spec = KNeighborsClassifier(n_neighbors=5) -knn_spec +knn = KNeighborsClassifier(n_neighbors=5) +knn ``` ```{index} scikit-learn; X & y ``` -In order to fit the model on the breast cancer data, we need to call `fit` on the classifier object and pass the data in the argument. We also need to specify what variables to use as predictors and what variable to use as the target. Below, the `X=cancer_train[["Perimeter", "Concavity"]]` and the `y=cancer_train['Class']` argument specifies -that `Class` is the target variable (the one we want to predict), -and both `Perimeter` and `Concavity` are to be used as the predictors. +In order to fit the model on the breast cancer data, we need to call `fit` on +the model object. The `X` argument is used to specify the data for the predictor +variables, while the `y` argument is used to specify the data for the response variable. +So below, we set `X=cancer_train[["Perimeter", "Concavity"]]` and +`y=cancer_train['Class']` to specify that `Class` is the target +variable (the one we want to predict), and both `Perimeter` and `Concavity` are +to be used as the predictors. Note that the `fit` function might look like it does not +do much from the outside, but it is actually doing all the heavy lifting to train +the K-nearest neighbors model, and modifies the `knn` model object. ```{code-cell} ipython3 -:tags: [remove-cell] - -## The above was based on: - -# In order to fit the model on the breast cancer data, we need to pass the model specification -# and the data set to the `fit` function. We also need to specify what variables to use as predictors -# and what variable to use as the target. Below, the `Class ~ Perimeter + Concavity` argument specifies -# that `Class` is the target variable (the one we want to predict), -# and both `Perimeter` and `Concavity` are to be used as the predictors. - - -# We can also use a convenient shorthand syntax using a period, `Class ~ .`, to indicate -# that we want to use every variable *except* `Class` \index{tidymodels!model formula} as a predictor in the model. -# In this particular setup, since `Concavity` and `Perimeter` are the only two predictors in the `cancer_train` -# data frame, `Class ~ Perimeter + Concavity` and `Class ~ .` are equivalent. -# In general, you can choose individual predictors using the `+` symbol, or you can specify to -# use *all* predictors using the `.` symbol. -``` - -```{code-cell} ipython3 -knn_spec.fit(X=cancer_train[["Perimeter", "Concavity"]], y=cancer_train["Class"]); -``` - -```{code-cell} ipython3 -:tags: [remove-cell] - -# Here you can see the final trained model summary. It confirms that the computational engine used -# to train the model was `kknn::train.kknn`. It also shows the fraction of errors made by -# the nearest neighbor model, but we will ignore this for now and discuss it in more detail -# in the next chapter. -# Finally, it shows (somewhat confusingly) that the "best" weight function -# was "rectangular" and "best" setting of $K$ was 5; but since we specified these earlier, -# R is just repeating those settings to us here. In the next chapter, we will actually -# let R find the value of $K$ for us. +knn.fit(X=cancer_train[["Perimeter", "Concavity"]], y=cancer_train["Class"]); ``` ```{index} scikit-learn; predict ``` -Finally, we make the prediction on the new observation by calling `predict` on the classifier object, -passing the new observation itself. As above, -when we ran the $K$-nearest neighbors -classification algorithm manually, the `knn_fit` object classifies the new observation as "Malignant". Note that the `predict` function outputs a `numpy` array with the model's prediction. +After using the `fit` function, we can make a prediction on a new observation +by calling `predict` on the classifier object, passing the new observation +itself. As above, when we ran the $K$-nearest neighbors classification +algorithm manually, the `knn` model object classifies the new observation as +"Malignant". Note that the `predict` function outputs an `array` with the +model's prediction; you can actually make multiple predictions at the same +time using the `predict` function, which is why the output is stored as an `array`. ```{code-cell} ipython3 new_obs = pd.DataFrame({'Perimeter': [0], 'Concavity': [3.5]}) -knn_spec.predict(new_obs) +knn.predict(new_obs) ``` Is this predicted malignant label the true class for this observation? From e874666eb164375f1537fde3d62373b80f7651dc Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sun, 1 Jan 2023 21:38:29 -0800 Subject: [PATCH 11/37] polishing preprocessing --- source/classification1.md | 240 ++++++++++++++++++-------------------- 1 file changed, 112 insertions(+), 128 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index 22368ada..fc9084f1 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -903,8 +903,16 @@ the $K$-nearest neighbors algorithm is implemented in [the `scikit-learn` Python package](https://scikit-learn.org/stable/index.html) {cite:p}`sklearn_api` along with many [other models](https://scikit-learn.org/stable/user_guide.html) that you will encounter in this and future chapters of the book. Using the functions in the `scikit-learn` package (named `sklearn` in Python) will help keep our code simple, readable and accurate; the -less we have to code ourselves, the fewer mistakes we will likely make. We -start by importing `KNeighborsClassifier` from the `sklearn.neighbors` module. +less we have to code ourselves, the fewer mistakes we will likely make. +Before getting started with $K$-nearest neighbors, we need to tell the `sklearn` package +that we prefer using `pandas` data frames over regular arrays via the `set_config` function. +```{code-cell} ipython3 +from sklearn import set_config +set_config(transform_output="pandas") +``` + +We can now get started with $K$-nearest neighbors. The first step is to + import the `KNeighborsClassifier` from the `sklearn.neighbors` module. ```{code-cell} ipython3 from sklearn.neighbors import KNeighborsClassifier @@ -1030,19 +1038,26 @@ is said to be *standardized*, and all variables in a data set will have a mean o and a standard deviation of 1. To illustrate the effect that standardization can have on the $K$-nearest neighbor algorithm, we will read in the original, unstandardized Wisconsin breast cancer data set; we have been using a standardized version of the data set up -until now. To keep things simple, we will just use the `Area`, `Smoothness`, and `Class` +until now. We will apply the same initial wrangling steps as we did earlier, +and to keep things simple we will just use the `Area`, `Smoothness`, and `Class` variables: ```{code-cell} ipython3 -unscaled_cancer = pd.read_csv("data/unscaled_wdbc.csv") -unscaled_cancer = unscaled_cancer[['Class', 'Area', 'Smoothness']] +unscaled_cancer = ( + pd.read_csv("data/unscaled_wdbc.csv") + .loc[:, ['Class', 'Area', 'Smoothness']] + .replace({ + 'M' : 'Malignant', + 'B' : 'Benign' + }) + ) +unscaled_cancer['Class'] = unscaled_cancer['Class'].astype('category') unscaled_cancer ``` Looking at the unscaled and uncentered data above, you can see that the differences between the values for area measurements are much larger than those for -smoothness. Will this affect -predictions? In order to find out, we will create a scatter plot of these two +smoothness. Will this affect predictions? In order to find out, we will create a scatter plot of these two predictors (colored by diagnosis) for both the unstandardized data we just loaded, and the standardized version of that same data. But first, we need to standardize the `unscaled_cancer` data set with `scikit-learn`. @@ -1053,32 +1068,28 @@ standardize the `unscaled_cancer` data set with `scikit-learn`. ```{index} double: scikit-learn; pipeline ``` -In the `scikit-learn` framework, all data preprocessing and modeling can be built using a [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html?highlight=pipeline#sklearn.pipeline.Pipeline), or a more convenient function [`make_pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline) for simple pipeline construction. -Here we will initialize a preprocessor using `make_column_transformer` for -the `unscaled_cancer` data above, specifying -that we want to standardize the predictors `Area` and `Smoothness`: +The `scikit-learn` framework provides a collection of *preprocessors* used to manipulate +data in the [`preprocessing` module](https://scikit-learn.org/stable/modules/preprocessing.html). +Here we will use the `StandardScaler` transformer to standardize the predictor variables in +the `unscaled_cancer` data. In order to tell the `StandardScaler` which variables to standardize, +we wrap it in a +[`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer) object +using the [`make_column_transformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html#sklearn.compose.make_column_transformer) function. `ColumnTransformer` objects also enable the use of multiple preprocessors at +once, which is especially handy when you want to apply different preprocessing to each of the predictor variables. +The primary argument of the `make_column_transformer` function is a sequence of +pairs of (1) a preprocessor, and (2) the columns to which you want to apply that preprocessor. +In the present case, we just have the one `StandardScaler` preprocessor to apply to the `Area` and `Smoothness` columns. ```{code-cell} ipython3 -:tags: [remove-cell] - -## The above was based on: +from sklearn.preprocessing import StandardScaler +from sklearn.compose import make_column_transformer -# In the `tidymodels` framework, all data preprocessing happens -# using a `recipe` from [the `recipes` R package](https://recipes.tidymodels.org/) [@recipes] -# Here we will initialize a recipe \index{recipe} \index{tidymodels!recipe|see{recipe}} for -# the `unscaled_cancer` data above, specifying -# that the `Class` variable is the target, and all other variables are predictors: -``` - -```{code-cell} ipython3 preprocessor = make_column_transformer( (StandardScaler(), ["Area", "Smoothness"]), ) preprocessor ``` -So far, we have built a preprocessor so that each of the predictors have a mean of 0 and standard deviation of 1. - ```{index} scikit-learn; ColumnTransformer, scikit-learn; StandardScaler, scikit-learn; fit_transform ``` @@ -1088,68 +1099,74 @@ So far, we have built a preprocessor so that each of the predictors have a mean ```{index} scikit-learn; fit, scikit-learn; transform ``` -You can now see that the recipe includes a scaling and centering step for all predictor variables. -Note that when you add a step to a `ColumnTransformer`, you must specify what columns to apply the step to. -Here we specified that `StandardScaler` should be applied to -all predictor variables. - -```{index} see: fit, transform, fit_transform; scikit-learn -``` - -At this point, the data are not yet scaled and centered. To actually scale and center -the data, we need to call `fit` and `transform` on the unscaled data ( can be combined into `fit_transform`). +You can see that the preprocessor includes a single standardization step +that is applied to the `Area` and `Smoothness` columns. +Note that here we specified which columns to apply the preprocessing step to +by individual names; this approach can become quite difficult, e.g., when we have many +predictor variables. Rather than writing out the column names individually, +we can instead use the +[`make_column_selector`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_selector.html#sklearn.compose.make_column_selector) function. For example, if we wanted to standardize all *numerical* predictors, +we would use `make_column_selector` and specify the `dtype_include` argument to be `np.number` +(from the `numpy` package). This creates a preprocessor equivalent to the one we +created previously. ```{code-cell} ipython3 -:tags: [remove-cell] +import numpy as np +from sklearn.compose import make_column_selector -# So far, there is not much in the recipe; just a statement about the number of targets -# and predictors. Let's add -# scaling (`step_scale`) \index{recipe!step\_scale} and -# centering (`step_center`) \index{recipe!step\_center} steps for -# all of the predictors so that they each have a mean of 0 and standard deviation of 1. -# Note that `tidyverse` actually provides `step_normalize`, which does both centering and scaling in -# a single recipe step; in this book we will keep `step_scale` and `step_center` separate -# to emphasize conceptually that there are two steps happening. -# The `prep` function finalizes the recipe by using the data (here, `unscaled_cancer`) \index{tidymodels!prep}\index{prep|see{tidymodels}} -# to compute anything necessary to run the recipe (in this case, the column means and standard -# deviations): +preprocessor = make_column_transformer( + (StandardScaler(), make_column_selector(dtype_include=np.number)), + ) +preprocessor ``` -```{code-cell} ipython3 -:tags: [remove-cell] - -# You can now see that the recipe includes a scaling and centering step for all predictor variables. -# Note that when you add a step to a recipe, you must specify what columns to apply the step to. -# Here we used the `all_predictors()` \index{recipe!all\_predictors} function to specify that each step should be applied to -# all predictor variables. However, there are a number of different arguments one could use here, -# as well as naming particular columns with the same syntax as the `select` function. -# For example: - -# - `all_nominal()` and `all_numeric()`: specify all categorical or all numeric variables -# - `all_predictors()` and `all_outcomes()`: specify all predictor or all target variables -# - `Area, Smoothness`: specify both the `Area` and `Smoothness` variable -# - `-Class`: specify everything except the `Class` variable - -# You can find a full set of all the steps and variable selection functions -# on the [`recipes` reference page](https://recipes.tidymodels.org/reference/index.html). - -# At this point, we have calculated the required statistics based on the data input into the -# recipe, but the data are not yet scaled and centered. To actually scale and center -# the data, we need to apply the `bake` \index{tidymodels!bake} \index{bake|see{tidymodels}} function to the unscaled data. +```{index} see: fit, transform, fit_transform; scikit-learn ``` +We are now ready to standardize the numerical predictor columns in the `unscaled_cancer` data frame. +This happens in two steps. We first use the `fit` function to compute the values necessary to apply +the standardization (the mean and standard deviation of each variable), passing the `unscaled_cancer` data as an argument. +Then we use the `transform` function to actually apply the standardization. +It may seem a bit unnecessary to use two steps---`fit` *and* `transform`---to standardize the data. +However, we do this in two steps so that we can specify a different data set in the `transform` step if we want. +This enables us to compute the quantities needed to standardize using one data set, and then +apply that standardization to another data set. + ```{code-cell} ipython3 preprocessor.fit(unscaled_cancer) scaled_cancer = preprocessor.transform(unscaled_cancer) -# scaled_cancer = preprocessor.fit_transform(unscaled_cancer) -scaled_cancer = pd.DataFrame(scaled_cancer, columns=['Area', 'Smoothness']) -scaled_cancer['Class'] = unscaled_cancer['Class'] scaled_cancer ``` +```{code-cell} ipython3 + +``` +It looks like our `Smoothness` and `Area` variables have been standardized. Woohoo! +But there are two important things to notice about the new `scaled_cancer` data frame. First, it only keeps +the columns from the input to `transform` (here, `unscaled_cancer`) that had a preprocessing step applied +to them. The default behavior of the `ColumnTransformer` that we build using `make_column_transformer` +is to *drop* the remaining columns. This default behavior works well with the rest of `sklearn` (as we will see below +in the {ref}`08:puttingittogetherworkflow` section), but for visualizing the result of preprocessing it can be useful to keep the other columns +in our original data frame, such as the `Class` variable here. +To keep other columns, we need to set the `remainder` argument to `'passthrough'` in the `make_column_transformer` function. + Furthermore, you can see that the new column names---{glue:}`scaled_cancer.columns[0]` +and {glue:}`scaled_cancer.columns[1]`---include the name +of the preprocessing step separated by underscores. This default behavior is useful in `sklearn` because we sometimes want to apply +multiple different preprocessing steps to the same columns; but again, for visualization it can be useful to preserve +the original column names. To keep original column names, we need to set the `verbose_feature_names_out` argument to `False`. + +> **Note:** Only specify the `remainder` and `verbose_feature_names_out` arguments when you want to examine the result +> of your preprocessing step. In most cases, you should leave these arguments at their default values. -It may seem redundant that we had to both `fit` *and* `transform` to scale and center the data. - However, we do this in two steps so we can specify a different data set in the `transform` step if we want. - For example, we may want to specify new data that were not part of the training set. +```{code-cell} ipython3 +preprocessor_keep_all = make_column_transformer( + (StandardScaler(), make_column_selector(dtype_include=np.number)), + remainder='passthrough', + verbose_feature_names_out=False + ) +preprocessor_keep_all.fit(unscaled_cancer) +scaled_cancer_all = preprocessor_keep_all.transform(unscaled_cancer) +scaled_cancer_all +``` You may wonder why we are doing so much work just to center and scale our variables. Can't we just manually scale and center the `Area` and @@ -1158,33 +1175,14 @@ technically *yes*; but doing so is error-prone. In particular, we might accidentally forget to apply the same centering / scaling when making predictions, or accidentally apply a *different* centering / scaling than what we used while training. Proper use of a `ColumnTransformer` helps keep our code simple, -readable, and error-free. Furthermore, note that using `fit` and `transform` on the preprocessor is required only when you want to inspect the result of the preprocessing steps -yourself. You will see further on in -Section {ref}`08:puttingittogetherworkflow` that `scikit-learn` provides tools to -automatically streamline the preprocesser and the model so that you can call`fit` +readable, and error-free. Furthermore, note that using `fit` and `transform` on +the preprocessor is required only when you want to inspect the result of the +preprocessing steps +yourself. You will see further on in the +{ref}`08:puttingittogetherworkflow` section that `scikit-learn` provides tools to +automatically streamline the preprocesser and the model so that you can call `fit` and `transform` on the `Pipeline` as necessary without additional coding effort. -```{code-cell} ipython3 -:tags: [remove-cell] - -# It may seem redundant that we had to both `bake` *and* `prep` to scale and center the data. -# However, we do this in two steps so we can specify a different data set in the `bake` step if we want. -# For example, we may want to specify new data that were not part of the training set. - -# You may wonder why we are doing so much work just to center and -# scale our variables. Can't we just manually scale and center the `Area` and -# `Smoothness` variables ourselves before building our $K$-nearest neighbor model? Well, -# technically *yes*; but doing so is error-prone. In particular, we might -# accidentally forget to apply the same centering / scaling when making -# predictions, or accidentally apply a *different* centering / scaling than what -# we used while training. Proper use of a `recipe` helps keep our code simple, -# readable, and error-free. Furthermore, note that using `prep` and `bake` is -# required only when you want to inspect the result of the preprocessing steps -# yourself. You will see further on in Section -# \@ref(puttingittogetherworkflow) that `tidymodels` provides tools to -# automatically apply `prep` and `bake` as necessary without additional coding effort. -``` - {numref}`fig:05-scaling-plt` shows the two scatter plots side-by-side—one for `unscaled_cancer` and one for `scaled_cancer`. Each has the same new observation annotated with its $K=3$ nearest neighbors. In the original unstandardized data plot, you can see some odd choices @@ -1214,7 +1212,7 @@ def class_dscp(x): attrs = ["Area", "Smoothness"] -new_obs = pd.DataFrame({"Class": ["Unknwon"], "Area": 400, "Smoothness": 0.135}) +new_obs = pd.DataFrame({"Class": ["Unknown"], "Area": 400, "Smoothness": 0.135}) unscaled_cancer["Class"] = unscaled_cancer["Class"].apply(class_dscp) area_smoothness_new_df = pd.concat((unscaled_cancer, new_obs), ignore_index=True) my_distances = euclidean_distances(area_smoothness_new_df.loc[:, attrs])[ @@ -1231,7 +1229,6 @@ area_smoothness_new_point = ( y=alt.Y("Smoothness"), color=alt.Color( "Class", - scale=alt.Scale(range=["#86bfef", "#efb13f", "red"]), title="Diagnosis", ), shape=alt.Shape( @@ -1288,13 +1285,13 @@ area_smoothness_new_point = area_smoothness_new_point + line1 + line2 + line3 :tags: [remove-cell] attrs = ["Area", "Smoothness"] -new_obs_scaled = pd.DataFrame({"Class": ["Unknwon"], "Area": -0.72, "Smoothness": 2.8}) -scaled_cancer["Class"] = scaled_cancer["Class"].apply(class_dscp) +new_obs_scaled = pd.DataFrame({"Class": ["Unknown"], "Area": -0.72, "Smoothness": 2.8}) +scaled_cancer_all["Class"] = scaled_cancer_all["Class"].apply(class_dscp) area_smoothness_new_df_scaled = pd.concat( - (scaled_cancer, new_obs_scaled), ignore_index=True + (scaled_cancer_all, new_obs_scaled), ignore_index=True ) my_distances_scaled = euclidean_distances(area_smoothness_new_df_scaled.loc[:, attrs])[ - len(scaled_cancer) + len(scaled_cancer_all) ][:-1] area_smoothness_new_point_scaled = ( alt.Chart( @@ -1307,7 +1304,6 @@ area_smoothness_new_point_scaled = ( y=alt.Y("Smoothness", title="Smoothness (standardized)"), color=alt.Color( "Class", - scale=alt.Scale(range=["#86bfef", "#efb13f", "red"]), title="Diagnosis", ), shape=alt.Shape( @@ -1319,21 +1315,21 @@ area_smoothness_new_point_scaled = ( min_3_idx_scaled = np.argpartition(my_distances_scaled, 3)[:3] neighbor1_scaled = pd.concat( ( - scaled_cancer.loc[min_3_idx_scaled[0], attrs], + scaled_cancer_all.loc[min_3_idx_scaled[0], attrs], new_obs_scaled[attrs].T, ), axis=1, ).T neighbor2_scaled = pd.concat( ( - scaled_cancer.loc[min_3_idx_scaled[1], attrs], + scaled_cancer_all.loc[min_3_idx_scaled[1], attrs], new_obs_scaled[attrs].T, ), axis=1, ).T neighbor3_scaled = pd.concat( ( - scaled_cancer.loc[min_3_idx_scaled[2], attrs], + scaled_cancer_all.loc[min_3_idx_scaled[2], attrs], new_obs_scaled[attrs].T, ), axis=1, @@ -1380,24 +1376,6 @@ Comparison of K = 3 nearest neighbors with standardized and unstandardized data. ```{code-cell} ipython3 :tags: [remove-cell] -# Could not find something mimicing `facet_zoom` in R, here are 2 plots trying to -# illustrate similar points -# 1. interactive plot which allows zooming in/out -glue('fig:05-scaling-plt-interactive', area_smoothness_new_point.interactive()) -``` - -+++ {"tags": ["remove-cell"]} - -:::{glue:figure} fig:05-scaling-plt-interactive -:name: fig:05-scaling-plt-interactive - -Close-up of three nearest neighbors for unstandardized data. -::: - -```{code-cell} ipython3 -:tags: [remove-cell] - -# 2. Static plot, Zoom-in zoom_area_smoothness_new_point = ( alt.Chart( area_smoothness_new_df, @@ -1409,7 +1387,6 @@ zoom_area_smoothness_new_point = ( y=alt.Y("Smoothness", scale=alt.Scale(domain=(0.08, 0.14))), color=alt.Color( "Class", - scale=alt.Scale(range=["#86bfef", "#efb13f", "red"]), title="Diagnosis", ), shape=alt.Shape( @@ -1514,7 +1491,7 @@ in the training data that were tagged as malignant. attrs = ["Perimeter", "Concavity"] new_point = [2, 2] new_point_df = pd.DataFrame( - {"Class": ["Unknwon"], "Perimeter": new_point[0], "Concavity": new_point[1]} + {"Class": ["Unknown"], "Perimeter": new_point[0], "Concavity": new_point[1]} ) rare_cancer["Class"] = rare_cancer["Class"].apply(class_dscp) rare_cancer_with_new_df = pd.concat((rare_cancer, new_point_df), ignore_index=True) @@ -1799,6 +1776,13 @@ placed in a `Pipeline`. We will now place these steps in a `Pipeline` using the `make_pipeline` function, and finally we will call `.fit()` to run the whole `Pipeline` on the `unscaled_cancer` data. +all data preprocessing and modeling can be +built using a +[`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html?highlight=pipeline#sklearn.pipeline.Pipeline), +or a more convenient function +[`make_pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline) +for simple pipeline construction. + ```{code-cell} ipython3 :tags: [remove-cell] From c5c8769f2375d29a2edb5f9001b577a65eb53dd6 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sun, 1 Jan 2023 22:19:23 -0800 Subject: [PATCH 12/37] balancing polished --- source/classification1.md | 119 ++++++++++++++++---------------------- 1 file changed, 50 insertions(+), 69 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index fc9084f1..414597fd 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -1429,39 +1429,25 @@ what the data would look like if the cancer was rare. We will do this by picking only 3 observations from the malignant group, and keeping all of the benign observations. We choose these 3 observations using the `.head()` method, which takes the number of rows to select from the top (`n`). -The new imbalanced data is shown in {numref}`fig:05-unbalanced`. +We use the [`concat`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) +function from `pandas` to glue the two resulting filtered +data frames back together by passing them together in a sequence. +The new imbalanced data is shown in {numref}`fig:05-unbalanced`, +and we print the counts of the classes using the `value_counts` function. ```{code-cell} ipython3 -:tags: [remove-cell] - -# To better illustrate the problem, let's revisit the scaled breast cancer data, -# `cancer`; except now we will remove many of the observations of malignant tumors, simulating -# what the data would look like if the cancer was rare. We will do this by -# picking only 3 observations from the malignant group, and keeping all -# of the benign observations. We choose these 3 observations using the `slice_head` -# function, which takes two arguments: a data frame-like object, -# and the number of rows to select from the top (`n`). -# The new imbalanced data is shown in Figure \@ref(fig:05-unbalanced). -``` - -```{code-cell} ipython3 -cancer = pd.read_csv("data/wdbc.csv") rare_cancer = pd.concat( - (cancer.query("Class == 'B'"), cancer.query("Class == 'M'").head(3)) -) -colors = ["#86bfef", "#efb13f"] -rare_cancer["Class"] = rare_cancer["Class"].apply( - lambda x: "Malignant" if (x == "M") else "Benign" -) + (cancer[cancer["Class"] == 'Benign'], + cancer[cancer["Class"] == 'Malignant'].head(3) + )) + rare_plot = ( - alt.Chart( - rare_cancer - ) + alt.Chart(rare_cancer) .mark_point(opacity=0.6, filled=True, size=40) .encode( x=alt.X("Perimeter", title="Perimeter (standardized)"), y=alt.Y("Concavity", title="Concavity (standardized)"), - color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"), + color=alt.Color("Class", title="Diagnosis"), ) ) rare_plot @@ -1474,6 +1460,10 @@ rare_plot Imbalanced data. ``` +```{code-cell} ipython3 +rare_cancer['Class'].value_counts() +``` + +++ Suppose we now decided to use $K = 7$ in $K$-nearest neighbor classification. @@ -1510,7 +1500,6 @@ rare_plot = ( y=alt.Y("Concavity", title="Concavity (standardized)"), color=alt.Color( "Class", - scale=alt.Scale(range=["#86bfef", "#efb13f", "red"]), title="Diagnosis", ), shape=alt.Shape( @@ -1525,9 +1514,9 @@ min_7_idx = np.argpartition(my_distances, 7)[:7] # For loop: each iteration adds a line segment of corresponding color for i in range(7): - clr = "#86bfef" + clr = "#1f77b4" if rare_cancer.iloc[min_7_idx[i], :]["Class"] == "Malignant": - clr = "#efb13f" + clr = "#ff7f0e" neighbor = pd.concat( ( rare_cancer.iloc[min_7_idx[i], :][attrs], @@ -1560,21 +1549,24 @@ always "benign," corresponding to the blue color. ```{code-cell} ipython3 :tags: [remove-cell] -knn_spec = KNeighborsClassifier(n_neighbors=7) -knn_spec.fit(X=rare_cancer.loc[:, ["Perimeter", "Concavity"]], y=rare_cancer["Class"]) +knn = KNeighborsClassifier(n_neighbors=7) +knn.fit(X=rare_cancer.loc[:, ["Perimeter", "Concavity"]], y=rare_cancer["Class"]) # create a prediction pt grid per_grid = np.linspace( - rare_cancer["Perimeter"].min(), rare_cancer["Perimeter"].max(), 100 + rare_cancer["Perimeter"].min(), rare_cancer["Perimeter"].max(), 50 ) con_grid = np.linspace( - rare_cancer["Concavity"].min(), rare_cancer["Concavity"].max(), 100 + rare_cancer["Concavity"].min(), rare_cancer["Concavity"].max(), 50 ) pcgrid = np.array(np.meshgrid(per_grid, con_grid)).reshape(2, -1).T pcgrid = pd.DataFrame(pcgrid, columns=["Perimeter", "Concavity"]) -knnPredGrid = knn_spec.predict(pcgrid) +pcgrid + +knnPredGrid = knn.predict(pcgrid) prediction_table = pcgrid.copy() prediction_table["Class"] = knnPredGrid +prediction_table # create the scatter plot rare_plot = ( @@ -1585,7 +1577,7 @@ rare_plot = ( .encode( x=alt.X("Perimeter", title="Perimeter (standardized)"), y=alt.Y("Concavity", title="Concavity (standardized)"), - color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"), + color=alt.Color("Class", title="Diagnosis"), ) ) @@ -1595,7 +1587,7 @@ prediction_plot = ( prediction_table, title="Imbalanced data", ) - .mark_point(opacity=0.02, filled=True, size=200) + .mark_point(opacity=0.05, filled=True, size=300) .encode( x=alt.X( "Perimeter", @@ -1611,10 +1603,10 @@ prediction_plot = ( domain=(rare_cancer["Concavity"].min(), rare_cancer["Concavity"].max()) ), ), - color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"), + color=alt.Color("Class", title="Diagnosis"), ) ) -rare_plot + prediction_plot +#rare_plot + prediction_plot glue("fig:05-upsample-2", (rare_plot + prediction_plot)) ``` @@ -1633,27 +1625,16 @@ Despite the simplicity of the problem, solving it in a statistically sound manne fairly nuanced, and a careful treatment would require a lot more detail and mathematics than we will cover in this textbook. For the present purposes, it will suffice to rebalance the data by *oversampling* the rare class. In other words, we will replicate rare observations multiple times in our data set to give them more -voting power in the $K$-nearest neighbor algorithm. In order to do this, we will need an oversampling -step with the `resample` function from the `sklearn` Python package. -We show below how to do this, and also -use the `.groupby()` and `.count()` methods to see that our classes are now balanced: - -```{code-cell} ipython3 -:tags: [remove-cell] - -# Despite the simplicity of the problem, solving it in a statistically sound manner is actually -# fairly nuanced, and a careful treatment would require a lot more detail and mathematics than we will cover in this textbook. -# For the present purposes, it will suffice to rebalance the data by *oversampling* the rare class. \index{oversampling} -# In other words, we will replicate rare observations multiple times in our data set to give them more -# voting power in the $K$-nearest neighbor algorithm. In order to do this, we will add an oversampling -# step to the earlier `uc_recipe` recipe with the `step_upsample` function from the `themis` R package. \index{recipe!step\_upsample} -# We show below how to do this, and also -# use the `group_by` and `summarize` functions to see that our classes are now balanced: -``` - -```{code-cell} ipython3 -rare_cancer['Class'].value_counts() -``` +voting power in the $K$-nearest neighbor algorithm. In order to do this, we will +first separate the classes out into their own data frames by filtering. +Then, we will +use the [`resample`](https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html) function +from the `sklearn` package to increase the number of `Malignant` observations to be the same as the number +of `Benign` observations. We set the `n_samples` argument to be the number of `Malignant` observations we want. +We also set the `random_state` to be some integer +so that our results are reproducible; if we do not set this argument, we will get a different upsampling each time +we run the code. Finally, we use the `value_counts` method + to see that our classes are now balanced. ```{code-cell} ipython3 from sklearn.utils import resample @@ -1664,7 +1645,7 @@ malignant_cancer_upsample = resample( malignant_cancer, n_samples=len(benign_cancer), random_state=100 ) upsampled_cancer = pd.concat((malignant_cancer_upsample, benign_cancer)) -upsampled_cancer.groupby(by='Class')['Class'].count() +upsampled_cancer['Class'].value_counts() ``` Now suppose we train our $K$-nearest neighbor classifier with $K=7$ on this *balanced* data. @@ -1677,13 +1658,13 @@ closer to the benign tumor observations. ```{code-cell} ipython3 :tags: [remove-cell] -knn_spec = KNeighborsClassifier(n_neighbors=7) -knn_spec.fit( +knn = KNeighborsClassifier(n_neighbors=7) +knn.fit( X=upsampled_cancer.loc[:, ["Perimeter", "Concavity"]], y=upsampled_cancer["Class"] ) # create a prediction pt grid -knnPredGrid = knn_spec.predict(pcgrid) +knnPredGrid = knn.predict(pcgrid) prediction_table = pcgrid prediction_table["Class"] = knnPredGrid @@ -1706,21 +1687,21 @@ rare_plot = ( domain=(rare_cancer["Concavity"].min(), rare_cancer["Concavity"].max()) ), ), - color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"), + color=alt.Color("Class", title="Diagnosis"), ) ) # add a prediction layer, also scatter plot upsampled_plot = ( alt.Chart(prediction_table) - .mark_point(opacity=0.02, filled=True, size=200) + .mark_point(opacity=0.05, filled=True, size=300) .encode( x=alt.X("Perimeter", title="Perimeter (standardized)"), y=alt.Y("Concavity", title="Concavity (standardized)"), - color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"), + color=alt.Color("Class", title="Diagnosis"), ) ) -rare_plot + upsampled_plot +#rare_plot + upsampled_plot glue("fig:05-upsample-plot", (rare_plot + upsampled_plot)) ``` @@ -1759,7 +1740,7 @@ First we will load the data, create a model, and specify a preprocessor for how unscaled_cancer = pd.read_csv("data/unscaled_wdbc.csv") # create the KNN model -knn_spec = KNeighborsClassifier(n_neighbors=7) +knn = KNeighborsClassifier(n_neighbors=7) # create the centering / scaling preprocessor preprocessor = make_column_transformer( @@ -1800,7 +1781,7 @@ for simple pipeline construction. ``` ```{code-cell} ipython3 -knn_fit = make_pipeline(preprocessor, knn_spec).fit( +knn_fit = make_pipeline(preprocessor, knn).fit( X=unscaled_cancer.loc[:, ["Area", "Smoothness"]], y=unscaled_cancer["Class"] ) @@ -1819,7 +1800,7 @@ one with `Area = 500` and `Smoothness = 0.075`, and one with `Area = 1500` and ` # As before, the fit object lists the function that trains the model as well as the "best" settings # for the number of neighbors and weight function (for now, these are just the values we chose -# manually when we created `knn_spec` above). But now the fit object also includes information about +# manually when we created `knn` above). But now the fit object also includes information about # the overall workflow, including the centering and scaling preprocessing steps. # In other words, when we use the `predict` function with the `knn_fit` object to make a prediction for a new # observation, it will first apply the same recipe steps to the new observation. From a9deb2efd25101455f46f27a44a0d0ddfc25a2e3 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sun, 1 Jan 2023 22:38:53 -0800 Subject: [PATCH 13/37] pipelines --- source/classification1.md | 110 +++++++++++++++----------------------- 1 file changed, 43 insertions(+), 67 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index 414597fd..d2564ca5 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -1714,30 +1714,28 @@ Upsampled data with background color indicating the decision of the classifier. +++ (08:puttingittogetherworkflow)= -## Putting it together in a `pipeline` +## Putting it together in a `Pipeline` ```{index} scikit-learn; pipeline ``` -The `scikit-learn` package collection also provides the `pipeline`, a way to chain together multiple data analysis steps without a lot of otherwise necessary code for intermediate steps. -To illustrate the whole pipeline, let's start from scratch with the `unscaled_wdbc.csv` data. -First we will load the data, create a model, and specify a preprocessor for how the data should be preprocessed: +The `scikit-learn` package collection also provides the [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html?highlight=pipeline#sklearn.pipeline.Pipeline), +a way to chain together multiple data analysis steps without a lot of otherwise necessary code for intermediate steps. +To illustrate the whole workflow, let's start from scratch with the `unscaled_wdbc.csv` data. +First we will load the data, create a model, and specify a preprocessor for the data. ```{code-cell} ipython3 -:tags: [remove-cell] - -# The `tidymodels` package collection also provides the `workflow`, -# a way to chain\index{tidymodels!workflow}\index{workflow|see{tidymodels}} -# together multiple data analysis steps without a lot of otherwise necessary code -# for intermediate steps. -# To illustrate the whole pipeline, let's start from scratch with the `unscaled_wdbc.csv` data. -# First we will load the data, create a model, -# and specify a recipe for how the data should be preprocessed: -``` +# load the unscaled cancer data, make Class readable +unscaled_cancer = ( + pd.read_csv("data/unscaled_wdbc.csv") + .replace({ + 'M' : 'Malignant', + 'B' : 'Benign' + }) + ) +# make Class a categorical type +unscaled_cancer['Class'] = unscaled_cancer['Class'].astype('category') -```{code-cell} ipython3 -# load the unscaled cancer data -unscaled_cancer = pd.read_csv("data/unscaled_wdbc.csv") # create the KNN model knn = KNeighborsClassifier(n_neighbors=7) @@ -1748,74 +1746,47 @@ preprocessor = make_column_transformer( ) ``` -You will also notice that we did not call `.fit()` on the preprocessor; this is unnecessary when it is -placed in a `Pipeline`. - ```{index} scikit-learn; make_pipeline, scikit-learn; fit ``` -We will now place these steps in a `Pipeline` using the `make_pipeline` function, -and finally we will call `.fit()` to run the whole `Pipeline` on the `unscaled_cancer` data. - -all data preprocessing and modeling can be -built using a -[`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html?highlight=pipeline#sklearn.pipeline.Pipeline), -or a more convenient function -[`make_pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline) -for simple pipeline construction. +Next we place these steps in a `Pipeline` using +the [`make_pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline) function. +The `make_pipeline` function takes a list of steps to apply in your data analysis; in this +case, we just have the `preprocessor` and `knn` steps. +Finally, we call `fit` on the pipeline. +Notice that we do not need to separately call `fit` and `transform` on the `preprocessor`; the +pipeline handles doing this properly for us. +Also notice that when we call `fit` on the pipeline, we can pass +the whole `unscaled_cancer` data frame to the `X` argument, since the preprocessing +step drops all the variables except the two we listed: `Area` and `Smoothness`. +For the `y` response variable argument, we pass the `unscaled_cancer["Class"]` series as before. ```{code-cell} ipython3 -:tags: [remove-cell] - -# Note that each of these steps is exactly the same as earlier, except for one major difference: -# we did not use the `select` function to extract the relevant variables from the data frame, -# and instead simply specified the relevant variables to use via the -# formula `Class ~ Area + Smoothness` (instead of `Class ~ .`) in the recipe. -# You will also notice that we did not call `prep()` on the recipe; this is unnecessary when it is -# placed in a workflow. +from sklearn.pipeline import make_pipeline -# We will now place these steps in a `workflow` using the `add_recipe` and `add_model` functions, \index{tidymodels!add\_recipe}\index{tidymodels!add\_model} -# and finally we will use the `fit` function to run the whole workflow on the `unscaled_cancer` data. -# Note another difference from earlier here: we do not include a formula in the `fit` function. This \index{tidymodels!fit} -# is again because we included the formula in the recipe, so there is no need to respecify it: -``` - -```{code-cell} ipython3 knn_fit = make_pipeline(preprocessor, knn).fit( - X=unscaled_cancer.loc[:, ["Area", "Smoothness"]], y=unscaled_cancer["Class"] + X=unscaled_cancer, + y=unscaled_cancer["Class"] ) knn_fit ``` As before, the fit object lists the function that trains the model. But now the fit object also includes information about -the overall workflow, including the standardizing preprocessing step. +the overall workflow, including the standardization preprocessing step. In other words, when we use the `predict` function with the `knn_fit` object to make a prediction for a new observation, it will first apply the same preprocessing steps to the new observation. As an example, we will predict the class label of two new observations: one with `Area = 500` and `Smoothness = 0.075`, and one with `Area = 1500` and `Smoothness = 0.1`. -```{code-cell} ipython3 -:tags: [remove-cell] - -# As before, the fit object lists the function that trains the model as well as the "best" settings -# for the number of neighbors and weight function (for now, these are just the values we chose -# manually when we created `knn` above). But now the fit object also includes information about -# the overall workflow, including the centering and scaling preprocessing steps. -# In other words, when we use the `predict` function with the `knn_fit` object to make a prediction for a new -# observation, it will first apply the same recipe steps to the new observation. -# As an example, we will predict the class label of two new observations: -# one with `Area = 500` and `Smoothness = 0.075`, and one with `Area = 1500` and `Smoothness = 0.1`. -``` - ```{code-cell} ipython3 new_observation = pd.DataFrame({"Area": [500, 1500], "Smoothness": [0.075, 0.1]}) prediction = knn_fit.predict(new_observation) prediction ``` -The classifier predicts that the first observation is benign ("B"), while the second is -malignant ("M"). {numref}`fig:05-workflow-plot-show` visualizes the predictions that this +The classifier predicts that the first observation is benign, while the second is +malignant. {numref}`fig:05-workflow-plot-show` visualizes the predictions that this trained $K$-nearest neighbor model will make on a large range of new observations. Although you have seen colored prediction map visualizations like this a few times now, we have not included the code to generate them, as it is a little bit complicated. @@ -1829,12 +1800,13 @@ predict the label of each, and visualize the predictions with a colored scatter > visualizations in their own data analyses. ```{code-cell} ipython3 +:tags: [remove-output] # create the grid of area/smoothness vals, and arrange in a data frame are_grid = np.linspace( - unscaled_cancer["Area"].min(), unscaled_cancer["Area"].max(), 100 + unscaled_cancer["Area"].min(), unscaled_cancer["Area"].max(), 50 ) smo_grid = np.linspace( - unscaled_cancer["Smoothness"].min(), unscaled_cancer["Smoothness"].max(), 100 + unscaled_cancer["Smoothness"].min(), unscaled_cancer["Smoothness"].max(), 50 ) asgrid = np.array(np.meshgrid(are_grid, smo_grid)).reshape(2, -1).T asgrid = pd.DataFrame(asgrid, columns=["Area", "Smoothness"]) @@ -1871,7 +1843,7 @@ unscaled_plot = ( ) ), ), - color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"), + color=alt.Color("Class", title="Diagnosis"), ) ) @@ -1882,13 +1854,17 @@ prediction_plot = ( .encode( x=alt.X("Area"), y=alt.Y("Smoothness"), - color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"), + color=alt.Color("Class", title="Diagnosis"), ) ) - unscaled_plot + prediction_plot ``` +```{code-cell} ipython3 +:tags: [remove-input] +glue("fig:05-upsample-2", (unscaled_plot + prediction_plot)) +``` + ```{figure}  :name: fig:05-workflow-plot-show :figclass: caption-hack @@ -1908,7 +1884,7 @@ You can launch an interactive version of the worksheet in your browser by clicki You can also preview a non-interactive version of the worksheet by clicking "view worksheet." If you instead decide to download the worksheet and run it on your own machine, make sure to follow the instructions for computer setup -found in Chapter {ref}`move-to-your-own-machine`. This will ensure that the automated feedback +found in the {ref}`move-to-your-own-machine` chapter. This will ensure that the automated feedback and guidance that the worksheets provide will function as intended. +++ From d5b8af33bc8ff4eca0008dec66efac3827935745 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sun, 1 Jan 2023 22:39:58 -0800 Subject: [PATCH 14/37] learning objs --- source/classification1.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/source/classification1.md b/source/classification1.md index d2564ca5..5f6a2c9a 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -55,7 +55,8 @@ By the end of the chapter, readers will be able to do the following: - Compute, by hand, the straight-line (Euclidean) distance between points on a graph when there are two predictor variables. - Explain the $K$-nearest neighbor classification algorithm. - Perform $K$-nearest neighbor classification in Python using `scikit-learn`. -- Use `StandardScaler` to preprocess data to be centered, scaled, and balanced. +- Use `StandardScaler` and `make_column_transformer` to preprocess data to be centered and scaled. +- Use `resample` to preprocess data to be balanced. - Combine preprocessing and model training using `make_pipeline`. +++ From c1c8151358526ede0146225cc7077bed22143bb6 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sun, 1 Jan 2023 22:42:09 -0800 Subject: [PATCH 15/37] mute warnings in ch5 --- source/classification1.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index 5f6a2c9a..9642f217 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -14,10 +14,10 @@ kernelspec: ```{code-cell} ipython3 :tags: [remove-cell] -#import warnings -#def warn(*args, **kwargs): -# pass -#warnings.warn = warn +import warnings +def warn(*args, **kwargs): + pass +warnings.warn = warn from myst_nb import glue import numpy as np From b2df7420388764648886fcdda7376150fe94e674 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sun, 1 Jan 2023 22:46:43 -0800 Subject: [PATCH 16/37] restore cls2 to main branch --- source/classification2.md | 10 +--------- 1 file changed, 1 insertion(+), 9 deletions(-) diff --git a/source/classification2.md b/source/classification2.md index 17390aec..8daf6c38 100644 --- a/source/classification2.md +++ b/source/classification2.md @@ -15,14 +15,6 @@ kernelspec: (classification2)= # Classification II: evaluation & tuning -```{code-cell} ipython3 -:tags: [remove-cell] -#import warnings -#def warn(*args, **kwargs): -# pass -#warnings.warn = warn -``` - ```{code-cell} ipython3 :tags: [remove-cell] @@ -2127,7 +2119,7 @@ Estimated accuracy versus the number of predictors for the sequence of models bu Practice exercises for the material covered in this chapter can be found in the accompanying -[worksheets repository](https://github.com/UBC-DSCI/data-science-a-first-intro-python-worksheets#readme) +[worksheets repository](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets#readme) in the "Classification II: evaluation and tuning" row. You can launch an interactive version of the worksheet in your browser by clicking the "launch binder" button. You can also preview a non-interactive version of the worksheet by clicking "view worksheet." From ae42ad23b79ef04337a30440d512846d1a02dee7 Mon Sep 17 00:00:00 2001 From: Lindsey Heagy Date: Wed, 11 Jan 2023 08:59:11 -0800 Subject: [PATCH 17/37] Update classification1.md add output scroll for large table --- source/classification1.md | 1 + 1 file changed, 1 insertion(+) diff --git a/source/classification1.md b/source/classification1.md index 9642f217..946de16e 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -156,6 +156,7 @@ arguments, and then inspect its contents: ``` ```{code-cell} ipython3 +:tags: ["output_scroll"] cancer = pd.read_csv("data/wdbc.csv") cancer ``` From f9fb1c04c2325ac8eed362dda99e09c3e71c9705 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 11 Jan 2023 10:24:11 -0800 Subject: [PATCH 18/37] Update source/classification1.md Co-authored-by: Joel Ostblom --- source/classification1.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index 946de16e..2313c2db 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -229,10 +229,9 @@ and `unique` methods again. ```{code-cell} ipython3 cancer['Class'] = cancer['Class'].replace({ - 'M' : 'Malignant', - 'B' : 'Benign' - }) -cancer['Class'] = cancer['Class'].astype('category') + 'M' : 'Malignant', + 'B' : 'Benign' +}).astype('category') cancer.info() ``` From c30425cb5e0d6a398d26cfb1854457d4d6cd8f86 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 11 Jan 2023 10:25:13 -0800 Subject: [PATCH 19/37] Update source/classification1.md Co-authored-by: Joel Ostblom --- source/classification1.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/classification1.md b/source/classification1.md index 2313c2db..e0103e60 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -296,7 +296,7 @@ is colorblind-friendly, so we can stick with that here. ```{code-cell} ipython3 perim_concav = ( alt.Chart(cancer) - .mark_point(opacity=0.6, filled=True, size=40) + .mark_circle() .encode( x=alt.X("Perimeter", title="Perimeter (standardized)"), y=alt.Y("Concavity", title="Concavity (standardized)"), From 045c25615d1b728698ba3b699b2c790e084e2c1c Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 11 Jan 2023 10:25:33 -0800 Subject: [PATCH 20/37] Update source/classification1.md Co-authored-by: Joel Ostblom --- source/classification1.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index e0103e60..ebd8c094 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -1160,10 +1160,10 @@ the original column names. To keep original column names, we need to set the `ve ```{code-cell} ipython3 preprocessor_keep_all = make_column_transformer( - (StandardScaler(), make_column_selector(dtype_include=np.number)), - remainder='passthrough', - verbose_feature_names_out=False - ) + (StandardScaler(), make_column_selector(dtype_include=np.number)), + remainder='passthrough', + verbose_feature_names_out=False +) preprocessor_keep_all.fit(unscaled_cancer) scaled_cancer_all = preprocessor_keep_all.transform(unscaled_cancer) scaled_cancer_all From 4d8f104f40a304f2df36845c0aa0b2daafa6e845 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 11 Jan 2023 10:30:48 -0800 Subject: [PATCH 21/37] remove random state specificaiton from ch5 --- source/classification1.md | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index ebd8c094..a5990fb2 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -1632,10 +1632,17 @@ Then, we will use the [`resample`](https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html) function from the `sklearn` package to increase the number of `Malignant` observations to be the same as the number of `Benign` observations. We set the `n_samples` argument to be the number of `Malignant` observations we want. -We also set the `random_state` to be some integer -so that our results are reproducible; if we do not set this argument, we will get a different upsampling each time -we run the code. Finally, we use the `value_counts` method - to see that our classes are now balanced. +Finally, we use the `value_counts` method to see that our classes are now balanced. +Note that `resample` picks which data to replicate *randomly*; we will learn more about properly handling randomness +in data analysis in the {ref}`classification2` chapter. + +```{code-cell} ipython3 +:tags: [remove-cell] +# hidden seed call to make the below resample reproducible +# we haven't taught students about seeds / prngs yet, so +# for now just hide this. +np.random.seed(1) +``` ```{code-cell} ipython3 from sklearn.utils import resample @@ -1643,7 +1650,7 @@ from sklearn.utils import resample malignant_cancer = rare_cancer[rare_cancer["Class"] == "Malignant"] benign_cancer = rare_cancer[rare_cancer["Class"] == "Benign"] malignant_cancer_upsample = resample( - malignant_cancer, n_samples=len(benign_cancer), random_state=100 + malignant_cancer, n_samples=len(benign_cancer) ) upsampled_cancer = pd.concat((malignant_cancer_upsample, benign_cancer)) upsampled_cancer['Class'].value_counts() From 0d66ef0a85362d2700e8bb3a83355ab148949bb5 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 11 Jan 2023 10:44:16 -0800 Subject: [PATCH 22/37] fixed fill plots --- source/classification1.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index a5990fb2..78f1d8f5 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -1794,7 +1794,7 @@ prediction ``` The classifier predicts that the first observation is benign, while the second is -malignant. {numref}`fig:05-workflow-plot-show` visualizes the predictions that this +malignant. {numref}`fig:05-workflow-plot` visualizes the predictions that this trained $K$-nearest neighbor model will make on a large range of new observations. Although you have seen colored prediction map visualizations like this a few times now, we have not included the code to generate them, as it is a little bit complicated. @@ -1858,7 +1858,7 @@ unscaled_plot = ( # 2. the faded colored scatter for the grid points prediction_plot = ( alt.Chart(prediction_table) - .mark_point(opacity=0.02, filled=True, size=200) + .mark_point(opacity=0.05, filled=True, size=300) .encode( x=alt.X("Area"), y=alt.Y("Smoothness"), @@ -1870,11 +1870,11 @@ unscaled_plot + prediction_plot ```{code-cell} ipython3 :tags: [remove-input] -glue("fig:05-upsample-2", (unscaled_plot + prediction_plot)) +glue("fig:05-workflow-plot", (unscaled_plot + prediction_plot)) ``` ```{figure}  -:name: fig:05-workflow-plot-show +:name: fig:05-workflow-plot :figclass: caption-hack Scatter plot of smoothness versus area where background color indicates the decision of the classifier. From 0abae4160296e92a2a4a91e16c08db8206d7f3e2 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 11 Jan 2023 10:44:56 -0800 Subject: [PATCH 23/37] Update source/classification1.md Co-authored-by: Joel Ostblom --- source/classification1.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/classification1.md b/source/classification1.md index 78f1d8f5..90f41450 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -1116,7 +1116,7 @@ import numpy as np from sklearn.compose import make_column_selector preprocessor = make_column_transformer( - (StandardScaler(), make_column_selector(dtype_include=np.number)), + (StandardScaler(), make_column_selector(dtype_include=np.number)), ) preprocessor ``` From 424252e47b8ef0191ac835bc43a8495eae5605ac Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 11 Jan 2023 10:45:38 -0800 Subject: [PATCH 24/37] Update source/classification1.md Co-authored-by: Joel Ostblom --- source/classification1.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/classification1.md b/source/classification1.md index 90f41450..159d0915 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -1117,7 +1117,7 @@ from sklearn.compose import make_column_selector preprocessor = make_column_transformer( (StandardScaler(), make_column_selector(dtype_include=np.number)), - ) +) preprocessor ``` From 781c5bcdb45dc73c03cb562a3ad56eb9135e375f Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 11 Jan 2023 10:47:30 -0800 Subject: [PATCH 25/37] Update source/classification1.md Co-authored-by: Joel Ostblom --- source/classification1.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/classification1.md b/source/classification1.md index 159d0915..8d9fd6a3 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -382,7 +382,7 @@ perim_concav_with_new_point = ( alt.Chart( perim_concav_with_new_point_df, ) - .mark_point(opacity=0.6, filled=True, size=40) + .mark_circle() .encode( x=alt.X("Perimeter", title="Perimeter (standardized)"), y=alt.Y("Concavity", title="Concavity (standardized)"), From 3188e41a43b4293b2db414e204f719fdaaf8a1f7 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 11 Jan 2023 10:48:21 -0800 Subject: [PATCH 26/37] Update source/classification1.md Co-authored-by: Joel Ostblom --- source/classification1.md | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index 8d9fd6a3..9af565bc 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -1734,15 +1734,12 @@ First we will load the data, create a model, and specify a preprocessor for the ```{code-cell} ipython3 # load the unscaled cancer data, make Class readable -unscaled_cancer = ( - pd.read_csv("data/unscaled_wdbc.csv") - .replace({ - 'M' : 'Malignant', - 'B' : 'Benign' - }) - ) -# make Class a categorical type -unscaled_cancer['Class'] = unscaled_cancer['Class'].astype('category') +unscaled_cancer = pd.read_csv("data/unscaled_wdbc.csv") +unscaled_cancer['Class'] = unscaled_cancer['Class'].replace({ + 'M' : 'Malignant', + 'B' : 'Benign' +}).astype('category') +unscaled_cancer # create the KNN model From 30cf65a87a631387efc897f4457372bc93456dc8 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 11 Jan 2023 10:49:08 -0800 Subject: [PATCH 27/37] Update source/classification1.md Co-authored-by: Joel Ostblom --- source/classification1.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/source/classification1.md b/source/classification1.md index 9af565bc..f71749bb 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -390,7 +390,8 @@ perim_concav_with_new_point = ( shape=alt.Shape( "Class", scale=alt.Scale(range=["circle", "circle", "diamond"]) ), - size=alt.condition("datum.Class == 'Unknown'", alt.value(80), alt.value(30)), + size=alt.condition("datum.Class == 'Unknown'", alt.value(100), alt.value(30)), + stroke=alt.condition("datum.Class == 'Unknown'", alt.value('black'), alt.value(None)), ) ) glue('fig:05-knn-2', perim_concav_with_new_point, display=True) From c9d967e9a097e5d0b2d483a6adf9938cf5377874 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 11 Jan 2023 10:52:15 -0800 Subject: [PATCH 28/37] better intro of meshgrid --- source/classification1.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/classification1.md b/source/classification1.md index f71749bb..0218a0e5 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -1797,7 +1797,7 @@ trained $K$-nearest neighbor model will make on a large range of new observation Although you have seen colored prediction map visualizations like this a few times now, we have not included the code to generate them, as it is a little bit complicated. For the interested reader who wants a learning challenge, we now include it below. -The basic idea is to create a grid of synthetic new observations using the `numpy.meshgrid` function, +The basic idea is to create a grid of synthetic new observations using the `meshgrid` function from `numpy`, predict the label of each, and visualize the predictions with a colored scatter having a very high transparency (low `opacity` value) and large point radius. See if you can figure out what each line is doing! From fece23a5f71cdef3bc87daf5fb016a5ea0bad370 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 11 Jan 2023 10:57:51 -0800 Subject: [PATCH 29/37] better warning filtering in chapters 4 and 5 --- source/classification1.md | 9 ++------- source/viz.md | 6 ++---- 2 files changed, 4 insertions(+), 11 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index 0218a0e5..48ea9cdc 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -15,18 +15,13 @@ kernelspec: ```{code-cell} ipython3 :tags: [remove-cell] import warnings -def warn(*args, **kwargs): - pass -warnings.warn = warn +warnings.filterwarnings("ignore", category=DeprecationWarning) +warnings.filterwarnings("ignore", category=FutureWarning) from myst_nb import glue import numpy as np from sklearn.metrics.pairwise import euclidean_distances from IPython.display import HTML - -import plotly.express as px -import plotly.graph_objs as go -from plotly.offline import iplot, plot ``` (classification)= diff --git a/source/viz.md b/source/viz.md index 5522124e..d43fe8e0 100644 --- a/source/viz.md +++ b/source/viz.md @@ -16,11 +16,9 @@ kernelspec: :tags: [remove-cell] # ignore warnings from altair - import warnings -def warn(*args, **kwargs): - pass -warnings.warn = warn +warnings.filterwarnings("ignore", category=DeprecationWarning) +warnings.filterwarnings("ignore", category=FutureWarning) ``` From 6411b4c032a1bd50c662f69e7933dcf69cc217c8 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 11 Jan 2023 12:32:23 -0800 Subject: [PATCH 30/37] Update source/classification1.md Co-authored-by: Joel Ostblom --- source/classification1.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/classification1.md b/source/classification1.md index 48ea9cdc..4d3d19a7 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -1440,7 +1440,7 @@ rare_cancer = pd.concat( rare_plot = ( alt.Chart(rare_cancer) - .mark_point(opacity=0.6, filled=True, size=40) + .mark_circle() .encode( x=alt.X("Perimeter", title="Perimeter (standardized)"), y=alt.Y("Concavity", title="Concavity (standardized)"), From 734f5c62a1b8ae6d3c1483e83a0b87a1536687ff Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 11 Jan 2023 12:32:47 -0800 Subject: [PATCH 31/37] Update source/classification1.md Co-authored-by: Joel Ostblom --- source/classification1.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index 4d3d19a7..4fabdfc7 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -656,15 +656,15 @@ Scatter plot of concavity versus perimeter with new observation represented as a ```{code-cell} ipython3 new_obs_Perimeter = 0 new_obs_Concavity = 3.5 -cancer_dist = (cancer - .loc[:, ["Perimeter", "Concavity", "Class"]] - .assign(dist_from_new = ( - (cancer["Perimeter"] - new_obs_Perimeter) ** 2 - + (cancer["Concavity"] - new_obs_Concavity) ** 2 - )**(1/2)) - .nsmallest(5, "dist_from_new") - ) -cancer_dist +( + cancer + [["Perimeter", "Concavity", "Class"]] + .assign(dist_from_new = ( + (cancer["Perimeter"] - new_obs_Perimeter) ** 2 + + (cancer["Concavity"] - new_obs_Concavity) ** 2 + )**(1/2)) + .nsmallest(5, "dist_from_new") +) ``` In {numref}`tab:05-multiknn-mathtable` we show in mathematical detail how From ca9b41ebc72676b1cb1a2082f2667908fc70a313 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 11 Jan 2023 12:33:10 -0800 Subject: [PATCH 32/37] Update source/classification1.md Co-authored-by: Joel Ostblom --- source/classification1.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index 4fabdfc7..9e05fd3f 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -753,16 +753,16 @@ three predictors. new_obs_Perimeter = 0 new_obs_Concavity = 3.5 new_obs_Symmetry = 1 -cancer_dist2 = (cancer - .loc[:, ["Perimeter", "Concavity", "Symmetry", "Class"]] - .assign(dist_from_new = ( - (cancer["Perimeter"] - new_obs_Perimeter) ** 2 - + (cancer["Concavity"] - new_obs_Concavity) ** 2 - + (cancer["Symmetry"] - new_obs_Symmetry) ** 2 - )**(1/2)) - .nsmallest(5, "dist_from_new") - ) -cancer_dist2 +( + cancer + [["Perimeter", "Concavity", "Symmetry", "Class"]] + .assign(dist_from_new = ( + (cancer["Perimeter"] - new_obs_Perimeter) ** 2 + + (cancer["Concavity"] - new_obs_Concavity) ** 2 + + (cancer["Symmetry"] - new_obs_Symmetry) ** 2 + )**(1/2)) + .nsmallest(5, "dist_from_new") +) ``` Based on $K=5$ nearest neighbors with these three predictors we would classify From 7e682047293c24d4b01e2d52e4c852b4b3414289 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 11 Jan 2023 12:33:36 -0800 Subject: [PATCH 33/37] Update source/classification1.md Co-authored-by: Joel Ostblom --- source/classification1.md | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index 9e05fd3f..a40e5014 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -1040,15 +1040,12 @@ and to keep things simple we will just use the `Area`, `Smoothness`, and `Class` variables: ```{code-cell} ipython3 -unscaled_cancer = ( - pd.read_csv("data/unscaled_wdbc.csv") - .loc[:, ['Class', 'Area', 'Smoothness']] - .replace({ - 'M' : 'Malignant', - 'B' : 'Benign' - }) - ) -unscaled_cancer['Class'] = unscaled_cancer['Class'].astype('category') +unscaled_cancer = pd.read_csv("data/unscaled_wdbc.csv")[['Class', 'Area', 'Smoothness']] +unscaled_cancer['Class'] = unscaled_cancer['Class'].replace({ + 'M' : 'Malignant', + 'B' : 'Benign' +}).astype('category') +unscaled_cancer unscaled_cancer ``` From f9b4ed96e03424c8b147fc47c015022486e85740 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 11 Jan 2023 12:38:12 -0800 Subject: [PATCH 34/37] remove np.number and replace with just 'number' in dtype selection --- source/classification1.md | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index a40e5014..0ff82f39 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -1068,7 +1068,8 @@ Here we will use the `StandardScaler` transformer to standardize the predictor v the `unscaled_cancer` data. In order to tell the `StandardScaler` which variables to standardize, we wrap it in a [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer) object -using the [`make_column_transformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html#sklearn.compose.make_column_transformer) function. `ColumnTransformer` objects also enable the use of multiple preprocessors at +using the [`make_column_transformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html#sklearn.compose.make_column_transformer) function. +`ColumnTransformer` objects also enable the use of multiple preprocessors at once, which is especially handy when you want to apply different preprocessing to each of the predictor variables. The primary argument of the `make_column_transformer` function is a sequence of pairs of (1) a preprocessor, and (2) the columns to which you want to apply that preprocessor. @@ -1099,17 +1100,16 @@ Note that here we specified which columns to apply the preprocessing step to by individual names; this approach can become quite difficult, e.g., when we have many predictor variables. Rather than writing out the column names individually, we can instead use the -[`make_column_selector`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_selector.html#sklearn.compose.make_column_selector) function. For example, if we wanted to standardize all *numerical* predictors, -we would use `make_column_selector` and specify the `dtype_include` argument to be `np.number` -(from the `numpy` package). This creates a preprocessor equivalent to the one we -created previously. +[`make_column_selector`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_selector.html#sklearn.compose.make_column_selector) function. For +example, if we wanted to standardize all *numerical* predictors, +we would use `make_column_selector` and specify the `dtype_include` argument to be `'number'`. +This creates a preprocessor equivalent to the one we created previously. ```{code-cell} ipython3 -import numpy as np from sklearn.compose import make_column_selector preprocessor = make_column_transformer( - (StandardScaler(), make_column_selector(dtype_include=np.number)), + (StandardScaler(), make_column_selector(dtype_include='number')), ) preprocessor ``` @@ -1153,7 +1153,7 @@ the original column names. To keep original column names, we need to set the `ve ```{code-cell} ipython3 preprocessor_keep_all = make_column_transformer( - (StandardScaler(), make_column_selector(dtype_include=np.number)), + (StandardScaler(), make_column_selector(dtype_include='number')), remainder='passthrough', verbose_feature_names_out=False ) @@ -1799,6 +1799,8 @@ predict the label of each, and visualize the predictions with a colored scatter ```{code-cell} ipython3 :tags: [remove-output] +import numpy as np + # create the grid of area/smoothness vals, and arrange in a data frame are_grid = np.linspace( unscaled_cancer["Area"].min(), unscaled_cancer["Area"].max(), 50 From deb46263b92c2e34b89a70f63487c2abdd0a7085 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 11 Jan 2023 12:39:45 -0800 Subject: [PATCH 35/37] Update source/classification1.md Co-authored-by: Joel Ostblom --- source/classification1.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index 0ff82f39..483f8b38 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -1431,9 +1431,9 @@ and we print the counts of the classes using the `value_counts` function. ```{code-cell} ipython3 rare_cancer = pd.concat( - (cancer[cancer["Class"] == 'Benign'], - cancer[cancer["Class"] == 'Malignant'].head(3) - )) + cancer[cancer["Class"] == 'Benign'], + cancer[cancer["Class"] == 'Malignant'].head(3) +)) rare_plot = ( alt.Chart(rare_cancer) From 98c5654ea2eecae51849d34fed02ca70d770f06f Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 11 Jan 2023 12:39:58 -0800 Subject: [PATCH 36/37] Update source/classification1.md Co-authored-by: Joel Ostblom --- source/classification1.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/classification1.md b/source/classification1.md index 483f8b38..4c75c852 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -1430,7 +1430,7 @@ The new imbalanced data is shown in {numref}`fig:05-unbalanced`, and we print the counts of the classes using the `value_counts` function. ```{code-cell} ipython3 -rare_cancer = pd.concat( +rare_cancer = pd.concat(( cancer[cancer["Class"] == 'Benign'], cancer[cancer["Class"] == 'Malignant'].head(3) )) From 242d05df89ca787632960b395cddf19f1643e33f Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 11 Jan 2023 12:45:42 -0800 Subject: [PATCH 37/37] properly gluing column names --- source/classification1.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/source/classification1.md b/source/classification1.md index 4c75c852..0e2016b9 100644 --- a/source/classification1.md +++ b/source/classification1.md @@ -1132,7 +1132,9 @@ scaled_cancer = preprocessor.transform(unscaled_cancer) scaled_cancer ``` ```{code-cell} ipython3 - +:tags: [remove-cell] +glue('scaled-cancer-column-0', scaled_cancer.columns[0]) +glue('scaled-cancer-column-1', scaled_cancer.columns[1]) ``` It looks like our `Smoothness` and `Area` variables have been standardized. Woohoo! But there are two important things to notice about the new `scaled_cancer` data frame. First, it only keeps @@ -1142,8 +1144,8 @@ is to *drop* the remaining columns. This default behavior works well with the re in the {ref}`08:puttingittogetherworkflow` section), but for visualizing the result of preprocessing it can be useful to keep the other columns in our original data frame, such as the `Class` variable here. To keep other columns, we need to set the `remainder` argument to `'passthrough'` in the `make_column_transformer` function. - Furthermore, you can see that the new column names---{glue:}`scaled_cancer.columns[0]` -and {glue:}`scaled_cancer.columns[1]`---include the name + Furthermore, you can see that the new column names---{glue:}`scaled-cancer-column-0` +and {glue:}`scaled-cancer-column-1`---include the name of the preprocessing step separated by underscores. This default behavior is useful in `sklearn` because we sometimes want to apply multiple different preprocessing steps to the same columns; but again, for visualization it can be useful to preserve the original column names. To keep original column names, we need to set the `verbose_feature_names_out` argument to `False`.