wrote about regularization

hsf-training · Nov 21, 2024 · ee95b0f · ee95b0f
1 parent 17a401f
commit ee95b0f
Show file tree

Hide file tree

Showing 10 changed files with 576 additions and 67 deletions.
diff --git a/deep-learning-intro-for-hep/_toc.yml b/deep-learning-intro-for-hep/_toc.yml
@@ -21,8 +21,6 @@ chapters:
 - file: under-overfitting
 - file: regularization
 - file: exercise-5
-- file: exercise-6
 - file: hyperparameters
-- file: partitioning
 - file: goodness-of-fit
 - file: main-project
diff --git a/deep-learning-intro-for-hep/basic-fitting.md b/deep-learning-intro-for-hep/basic-fitting.md
@@ -403,18 +403,21 @@ If you _do_ know enough to write a (correct) functional form and seed the fit wi
 
 _However_, most scientific problems beyond physics don't have this much prior information. This is especially true in sciences that study the behavior of human beings. What is the underlying theory for a kid preferring chocolate ice cream over vanilla? What are the variables, and what's the functional form? Even if you think that human behavior is determined by underlying chemistry and physics, it would be horrendously complex.
 
-Here's an example: the [Boston Housing Prices](https://www.kaggle.com/datasets/vikrishnan/boston-house-prices) is a classic dataset for regression. The goal is to predict median housing prices in areas around Boston using features like
+Here's an example: the [Boston Housing Prices](https://www.kaggle.com/datasets/vikrishnan/boston-house-prices) is a classic dataset for regression. The goal is to predict median housing prices in towns around Boston using features like
 
+* per capita crime rate per town
 * proportion of residental land zoned for lots over 25,000 square feet
 * proportion of non-retail business acres per town
-* whether the area is adjacent to the Charles river (a boolean variable)
-* nitric oxides concentration
+* adjacency to the Charles River (a boolean variable)
+* nitric oxides concentration (parts per 10 million)
 * average number of rooms per dwelling
-* proportion of owner-occupied lots built before 1940
+* proportion of owner-occupied units built before 1940
 * weighted distances to 5 Boston employment centers
-* accessibility to radial highways
-* full-value property tax rate
-* pupil-teacher ratio in schools
+* index of accessiblity to radial highways
+* full-value property-tax rate per \$10,000
+* pupil-teacher ratio by town
+* $1000(b - 0.63)^2$ where $b$ is the proportion of Black residents
+* % lower status by population
 
 All of these seem like they would have an effect on housing prices, but it's almost impossible to guess which would be more important. Problems like these are usually solved by a generic linear fit of many variables. Unimportant features would have a best-fit slope near zero, and if our goal is to find out which features are most important, we can force unimportant features toward zero with "regularization" (to be discussed in a later section). The idea of ML as "throw everything into a big fit" is close to what you have to do if you have no ansatz, and neural networks are a natural generalization of high-dimensional linear fitting.
 

diff --git a/deep-learning-intro-for-hep/exercise-2.md b/deep-learning-intro-for-hep/exercise-2.md
@@ -29,20 +29,23 @@ import matplotlib.pyplot as plt
 
 +++
 
-I [previously mentioned](basic-fitting.md) the Boston House Prices dataset. It contains descriptive details about areas around Boston:
-
-* proportion of residental land zoned for lots over 25,000 square feet
-* proportion of non-retail business acres per town
-* whether the area is adjacent to the Charles river (a boolean variable)
-* nitric oxides concentration
-* average number of rooms per dwelling
-* proportion of owner-occupied lots built before 1940
-* weighted distances to 5 Boston employment centers
-* accessibility to radial highways
-* full-value property tax rate
-* pupil-teacher ratio in schools
-
-as well as the median prices of owner-occupied homes. Your job is to predict the prices, given all of the other data as features. You will do this with both a linear fit and a neural network with 5 hidden sigmoid components.
+I [previously mentioned](basic-fitting.md) the Boston House Prices dataset. It contains descriptive details about towns around Boston:
+
+* CRIM: per capita crime rate per town
+* ZN: proportion of residental land zoned for lots over 25,000 square feet
+* INDUS: proportion of non-retail business acres per town
+* CHAS: adjacency to the Charles River (a boolean variable)
+* NOX: nitric oxides concentration (parts per 10 million)
+* RM: average number of rooms per dwelling
+* AGE: proportion of owner-occupied units built before 1940
+* DIS: weighted distances to 5 Boston employment centers
+* RAD: index of accessiblity to radial highways
+* TAX: full-value property-tax rate per \$10,000
+* PTRATIO: pupil-teacher ratio by town
+* B: $1000(b - 0.63)^2$ where $b$ is the proportion of Black residents
+* LSTAT: % lower status by population
+
+as well as MEDV, the median prices of owner-occupied homes. Your job is to predict the prices, given all of the other data as features. You will do this with both a linear fit and a neural network with 5 hidden sigmoid components.
 
 You can get the dataset from [the original source](https://www.kaggle.com/datasets/vikrishnan/boston-house-prices) or from this project's GitHub: [deep-learning-intro-for-hep/data/boston-house-prices.csv](https://github.com/hsf-training/deep-learning-intro-for-hep/blob/main/deep-learning-intro-for-hep/data/boston-house-prices.csv).
 

diff --git a/deep-learning-intro-for-hep/exercise-6.md b/deep-learning-intro-for-hep/exercise-6.md
diff --git a/deep-learning-intro-for-hep/hyperparameters.md b/deep-learning-intro-for-hep/hyperparameters.md
@@ -13,4 +13,4 @@ kernelspec:
   name: python3
 ---
 
-# Parameters and hyperparameters
+# Hyperparameters and validation
diff --git a/deep-learning-intro-for-hep/kernel-trick.md b/deep-learning-intro-for-hep/kernel-trick.md
@@ -274,7 +274,7 @@ But we _can't_ do that with the adaptive basis functions.
 
 +++
 
-This process of choosing relevant combinations of input variables and adding them to a model as features is sometimes called "feature engineering." If you know what features are relevant, as in the circle problem, then computing them explicitly as inputs will improve the fit and its generalization. If you're not sure, including them in the mix with the basic features can only help the model find a better optimum (possibly by overfitting; see the next section).
+This process of choosing relevant combinations of input variables and adding them to a model as features is sometimes called "feature engineering." If you know what features are relevant, as in the circle problem, then computing them explicitly as inputs will improve the fit and its generalization. If you're not sure, including them in the mix with the basic features can only help the model find a better optimum (possibly by overfitting; see the next section). The laundry list of features in the Boston House Prices dataset is an example of feature engineering.
 
 Sometimes, feature engineering is presented as a bad thing, since you, as data analyst, are required to be more knowledgeable about the details of the problem. The ML algorithm isn't "figuring it out for you." That's a reasonable perspective if you're developing ML algorithms and you want them to apply to any problem, regardless of how deeply understood those problems are. It's certainly impressive when an ML algorithm rediscovers features we knew were important or discovers new ones.  However, if you're a data analyst, trying to solve one particular problem, and you happen to know about some relevant features, it's in your interest to include them in the mix!
 

diff --git a/deep-learning-intro-for-hep/minimizers.md b/deep-learning-intro-for-hep/minimizers.md
@@ -13,7 +13,7 @@ kernelspec:
   name: python3
 ---
 
-# Minimizing the objective function
+# Minimization algorithms
 
 +++
 
@@ -496,7 +496,7 @@ x[500]
 x.grad[500]
 ```
 
-The autograd computation is not sensitive to a neighborhood of points the way that a real derivative is. When $x = 0$, the derivative is calculated using the same code path as the primary function. For instance,
+The autograd computation is not sensitive to a neighborhood of points the way that a real derivative is. To see this, let's calculate a ReLU function by hand at $x = 0$, using the seemingly equivalent predicates `x > 0` and `x >= 0`:
 
 ```{code-cell} ipython3
 x = torch.tensor(0, dtype=torch.float32, requires_grad=True)
@@ -518,11 +518,17 @@ y.backward()
 x.grad
 ```
 
-The computation of `y` is unaffected by `>` versus `>=` because this function is continuous, but the autograd computation _does_ depend on this implementation detail. Finally,
+At $x = 0$, the first calculation fails `x > 0` so `y` is set to `0` and its derivative is the derivative of the expression `0`, which is `0`.
+
+At $x = 0$, the second calculation passes `x >= 0` so `y` is set to `x` (which is `0`) and its derivative is the derivative of the expression `x`, which is `1`.
+
+This autograd procedure is not sensitive to a local neighborhood around the point that is being differentiated! That's why it doesn't see that the derivative has a left-limit that is different from its right-limit, and doesn't recognize that a true derivative is not defined. The value that this procedure returns depends on what seems like an implementation detail: `x > 0` versus `x >= 0` to decide between `x` and `0`.
+
+But, as a matter of convention adopted by the ML community,
 
 $$\frac{d}{dx} \mbox{ReLU}(0) = 0$$
 
-is a convention adopted by the community. It could have been defined to be $0$ or $1$, but $0$ is more popular.
+It could have been defined to be $0$ or $1$, but $0$ is more popular.
 
 +++
 

diff --git a/deep-learning-intro-for-hep/partitioning.md b/deep-learning-intro-for-hep/partitioning.md
-Original file line number
+Diff line change
@@ Expand Up / @@ -274,7 +274,7 @@ But we _can't_ do that with the adaptive basis functions. @@
     +++
-    This process of choosing relevant combinations of input variables and adding them to a model as features is sometimes called "feature engineering." If you know what features are relevant, as in the circle problem, then computing them explicitly as inputs will improve the fit and its generalization. If you're not sure, including them in the mix with the basic features can only help the model find a better optimum (possibly by overfitting; see the next section).
+    This process of choosing relevant combinations of input variables and adding them to a model as features is sometimes called "feature engineering." If you know what features are relevant, as in the circle problem, then computing them explicitly as inputs will improve the fit and its generalization. If you're not sure, including them in the mix with the basic features can only help the model find a better optimum (possibly by overfitting; see the next section). The laundry list of features in the Boston House Prices dataset is an example of feature engineering.
     Sometimes, feature engineering is presented as a bad thing, since you, as data analyst, are required to be more knowledgeable about the details of the problem. The ML algorithm isn't "figuring it out for you." That's a reasonable perspective if you're developing ML algorithms and you want them to apply to any problem, regardless of how deeply understood those problems are. It's certainly impressive when an ML algorithm rediscovers features we knew were important or discovers new ones.  However, if you're a data analyst, trying to solve one particular problem, and you happen to know about some relevant features, it's in your interest to include them in the mix!
@@ Expand Down @@