- Crux: How to make machine learning algorithms perform well on not just data its been trained on, but also inputs that it has never seen before
- Regularization is defined as anything that's done to decrease an algorithm's test error at the expense of its training error.
Most deep learning regularization strategies try to trade increased bias for reduced variance (Remember that MSE can be decomposed into a bias term and a variance term)
- Put constraints on the model: adding restrictions on parameter values (such as a max norm)
- Add extra terms to the objective function, which can be thought of as a soft constraint on the model parameters
- A model that has overfit is said to have learned the data generating process but also many other generating processes, ie a model that has low variance but high bias. With regularization, we aim to take this model and regularize it to become a model that matches the data generating process.
We can try to limit the capacity of models by adding a penalty term
The
One thing to note is that we usually use
For example for a linear model
- We can also consider using a different parameter norm penalty
$\alpha$ for each layer in the neural network, but due to the additional complexity this introduces in searching for the optimal hyperparameters, the norm penalty is generally the same for each layer.
One of the simplest regularization techniques is weight decay, where
Taking the gradient of this cost function yields
So for a single gradient step,
-
We can use a quadratic approximation of our cost function
$J(w)$ . This is the second-order Taylor series expansion. In a single dimension, this is something like$$f(x) \approx f(a) + f'(a)(x-a) + \frac{1}{2}f''(a)(x-a)^2$$ . -
To understandthe effect of l2 regularizationriz on the parameters learned, we can approximate
$J$ around the optimal weights$w*$ : $$J(\theta) = J(w^) + \nabla J(w^)(w-w^) + \frac{1}{2}(w-w^)^T\textbf{H}(w-w^*)$$ -
Since w* are our optimal weights that minimize the cost function, the second term is eliminated, as
$\nabla J(w^*) = 0$ . -
Thereofre, the minimum of our approximation
$\hat{J}$ occurs when$\nabla_w \hat{J}(w) = H(w-w*) = 0$ . If we now consider a regularized version of the approximation, we have to add the gradient of the regularization penalty$\frac{\alpha}{2}w^Tw$ to the minimization objective: $$\alpha w^* + H(w-w^) = 0$$ Giving us $$\tilde{w} = (H + \alpha I)^{-1}Hw^$$. -
If we do an eigendecomposition on H, letting
$H = Q\Lambda Q^T$ where$\Lambda$ is a diagonal matrix who's diagonal entries are eigenvalues of$H$ and$Q$ is a matrix who's columns are eigenvectors of$H$ that form an orthonormal basis. We obtain$\tilde{w} = Q(\Lambda + \alpha I)^{-1}\Lambda Q^T w*$ . -
This means that the effect of the
$L^2$ weight decay is to rescale$w$ along the axes defined by the Hessian of the cost function$H$ . -
Specifically, when
$\lambda_i >> \alpha$ , the regularization effect is small: this means that in directions where the second order derivative of the cost function$J$ is large, meaning that the cost function has high curvature in that area, the regularization effect will be small, if any. On the other hand, in directions where the eigenvalues of$H$ are small, the regularization effect will be large. -
This intuitively makes sense because we want to penalize the weights in directions that do not have a high curvature (and thus do not contribute significantly to reducing the objective), so we have a high reguarlization penalty. On the other hand, for weights in the directions of high curvature, we do not regularize those weights as much since they contribute significantly to reducing the overall cost function.
-
For linear regression, adding in L2 regularization alters the normal equation solutions for
$w$ from$$w = (X^TX)^{-1} XTy$$ to$$w = (X^TX + \alpha I)^{-1} XTy$$ . This makes linear regression shrink weights on features whose covariance is low compared to the added variance$\alpha I$
- L1 regularization places a penalty on the absolute values of the weights rather than their squared norm as L2-regularization does. It is defined as
$$\Omega(w) = \sum_i (\vert w_i \vert)$$ . - The regularized objective is
$\tilde{J}(w) = \alpha \Vert w \Vert_1 + \nabla_w J(X, y;w)$
This leads to a gradient
We can see that the regularization contribution does not scale linearly with
To solve for
Solving for
Since
This gives us the following expression for
Using the property
$$w_i = \text{sign}(w_i^) (\vert w_i^\vert - \frac{\alpha}{H_{ii}})$$
However, this is not yet complete: if
$$w_i = \text{sign}(w_i^) \max(0,(\vert w_i^\vert - \frac{\alpha}{H_{ii}}))$$
- This means that in the case where
$w_i* \leq \frac{\alpha}{H_{ii}}$ , the regularized learner sets$w_i = 0$ , otherwise$w_i$ is shifted by$\frac{\alpha}{H_{ii}}$ . - L1 regularization results in sparse weights being learned, meaning that some of teh parameters hve their optimal value set to be 0.
- Therefore, L1 regularization can be considered as doing some sort of feature selection: the nonzero parameters indicate what features should be used.
- L1 regularization is equivalent to doing MAP estimation (basically MLE estimation with a prior on your weights) using a Laplacian prior, while L2 regularization is equivalent to imposing a Gaussian prior on your weights.
The generic parameter norm regularized cost function is of form
We can minimize a function subject to constraints by constructing a generalized Lagrange function, which consists of the original objective function plus a set of penalties.
Each penalty is a product between a KKT multiplier and a function representing whether the constraint is satisfied.
Here
The solution to this constriaed optimization problem is given by
Whenever
If we fix
$$\Theta^* = \underset{\Theta}{\operatorname{argmin,}} \mathcal{L} (\Theta, \alpha^) = \underset{\Theta}{\operatorname{argmin,}} J(\Theta; X, y) + \alpha^\Omega(\Theta) $$
This is the exact same cost function described earlier. Thus we can view norm penalties as constrained optimization.
The exact size of the constrained region is dependent on
We may want to use explicit constraints instead of penalties, as we can project
In addition, we may get stuck in local minimia while training with norm penalties, which usually correspond to small
Furthermore we can avoid large oscillations associated with a high learning rate. It is recommended to constrain the norm of each column of the weight matrix, which ensures that no singular hidden unit from having large weights.
Many machine learning methods require inverting the matrix
In this case we can add regularization , which corresponds to inverting the matrix
This concept of using regularization to solve underedetermined linear equations extends beyond machine learning. For example, the Moore-Penrose pseudoinverse described in Chapter 2 is:
This is just linear regression with weight decay.
One particular instance of an under-constrained problem is when we conduct linear regression on a one-hot encoded categorical feature vector. In this case, without regularization, we must normalize the data by fixing one of the classes to be all 0.
The best way to improve model performance is with more data. Sometimes, we can create fake data and add it to the dataset to make our model better. This has been particularily effectve for object recognition.
We can modify the base images with some small translation, or rotation. This approach makes the network learn weights that are robust to these transformations, On occasion, we can inject a bit of noise to the inputs, or even the hidden layers. When comparing two different algortims, we must compare there performance on similar datasets, which can be a subjectvie matter.
For some models, adding some noise to the input is the same thing as a norm penalty. Injecting noise can be especially effective when it's applied to the hidden units.
This is a central concept of the denoising autoencoder. Occasionally, we can also add some small random noise to the weights as well, which serves as a stochastic implementation of Bayesian inference. This also encourages th e model to converge into regions that are not only minimia but minimia that are surrounded by flat regions.
For small
In most datasets there is some set of mislabeled data which may make MLE training harmful. We can explicitly model this by using noisy labels. Furthermore we can use label smooothing, which replaces the hard 0 and 1 targets of softmax with
The goal of semi-supervised learning is to use data from
We can construct a model that shares parameters between a generative model
The generative criterion thus expresses some prior belief about the data.