In general, an optimization problem is the problem of finding an optimal value
Optimization algorithms used for training deep models differ from traditional optimization algorithms in several ways. Machine learning methods usually act indirectly. In most scenarios the goal is to optimize some performance measure
where
The goal of machine learning is to minimize the expected generalization error
where m is the number of training examples. The training process based on minimizing the average training error is called empirical risk minimization. In the context of deep learning empirical risk minimization is rarely used. The first reason for this is that empirical risk minimization is prone to overfitting. The second reason is that many loss functions do not have useful derivatives, but the most effective modern optimization methods are based on gradient descent which involves the derivative of the loss function. Instead of reducing the empirical risk, we optimize a more different quantity from the one that we actually want to optimize in deep learning.
Instead of the actual loss function we often minimize a surrogate loss function, which acts as a proxy for the loss function and has more suitable properties for optimization. Minimizing the surrogate loss function halts when early stopping criterion is met. In particular, this means that training often halts when the surrogate loss function still has large derivatives. This is a second difference to pure optimization where we require the gradient to be zero at convergence. The early stopping criterion is based on true underlying loss function measured on the validation set.
In machine learning, the objective function usually decomposes as a sum over training examples. We compute each update to the parameters based on an expected value of the cost function only on a small subset of the terms of the full cost function as computing the expectation on the whole dataset is very expensive. In practice, the expectations are computed by randomly sampling a small number of examples and taking the average over only those examples.
In deep learning, the optimization algorithms we use are usually so called minibatch or stochastic algorithms. That means we train our model on a batch of examples that contains more than one but also less than all training data.
When we pick the minibatches, we have to consider the following points:
- The minibatches have to be selected randomly and subsequent minibatches should be independent of each other in order to get unbiased estimates that are independent of each other.
- Shuffle examples if ordering is significant.
- In the special case of very large datasets the minibatches are constructed from shuffled examples rather than selected randomly.
Factors influencing the size are:
- How accurate we want the estimate to be (larger batches yield more accurate estimates),
- Trade-off between regularization and optimization,
- Hardware and memory limitations
- Multicore architectures are underutilized by very small batches, so it might make sense to define a minimum batch size.
Different kinds of algorithms use diffent kinds of information from the minibatch in various ways. Some algorithms are more sensitive to sampling error that others, either because they use information that cannot be estimated accurately using few samples or becausethey use information in a way that amplifies errors. Gradient based methods are usually relatively robust and can handle smaller batch sizes, like 100.
One motivation for Stochastic Gradient Descent is that it follows the gradient of the true generalization error, if no examples are repeated. The fact that Stochastic Gradient Descent minimizes the true generalization error can be seen if we consider online learning, i.e. when minibatches are drawn from a stream of data such that every experience is a fair sample from
Hence
Several challenges arise when optimizing neural networks. Some of them are related to critical points on the surface of the underlying loss function, others are caused by the architecture of the deep neural network. In this section, some challenges and their mitigation techniques are presented.
Let
The goal is to reach a point
Starting from an arbitrary point
In general, the critical point at
The second derivative determines the curvature of the loss function
When the input is high dimensional, there exists several first and second order derivatives for a function
Let
the Jacobian matrix is defined as
The first order optimization methods use the Jacobian matrix, i.e. the gradient
The Hessian matrix encompasses many second order derivatives. Each second derivative represents a possible direction that the gradient at point
The Hessian matrix
Mitigation Techniques
To solve the challenges caused by ill-conditioned Hessian matrices, Newton's method (4.3) is used after modification.
The modification of the Newton's method introduced in 4.3 is necessary because it computes the inverse of the Hessian matrix to arrive at an optimum.
If the Hessian is strongly ill-conditioned, then its inverse is ill-conditioned too. Section 8.6 motivates the Newton's method and explains how it can be modified (Conjugate Gradient) in details.
A second mitigation technique to overcome the problems of an ill-conditioned matrix is to adapt the Momentum algorithm. This algorithm allows the gradient to traverse smoothly strong curvatures caused by an ill-conditioned Hessian using the momentum parameter. More information on how this algorithm works can be found in section 8.3.
The cost function of a neural network is nonconvex, it presents a single global minimum and a large number of local minima.
The proliferation of local minima is not problematic due to the non-identifiability of neural networks.
The non-identifiability property declares that altering model's parameters by scaling or permutating them returns an equivalent neural model.
Therefore, equivalent models produce local minima that have equivalent cost values on the loss function.
However, it becomes challenging when the cost value of a big number of local minima deviate strongly from the cost value of the global minimum.
In this case, the learning process cannot generalize well to new data that was not involved in the training process.
Mitigation Techniques. Fortunately, choosing an appropriate model architecture to the learning task ensures that the majority of local minima have low loss value. In addition, it is sufficient to find a convenient local minimum that generalizes well on the task instead of finding the global minimum to update the model's parameters.
Along with local minima, saddle points are widely spread on the surface of cost functions of deep networks due to its nonconvex shape.
A saddle point can be depicted as a local minimum when observing it from above, whereas it is a local maximum when observed from below.
The first and second order optimization methods deal differently with saddle points. When a first order optimization method, e.g. the Gradient Descent algorithm (4.3), approaches a saddle point, it often decreases the gradient and moves with small steps downhill to escape this critical point.
However, the second order optimization methods, e.g. Newton's method (4.6), face challenges when dealing with saddle points on the surface of loss functions. They recognize the saddle point as a critical point with a zero gradient (
Mitigation Technique. To mitigate this problem, the saddle-free Newton method was proposed to help the second order optimizers to quickly escape the saddle point. This method calculates the absolute value of the Hessian matrix. Consequently, the Hessian will treat this point as a local minimum and continue descending the loss function.
In addition to saddle-points, plateaus and flat regions on the surface of the loss function cause problems during optimization because they are considered critical points (
Unfortunately, nonconvex loss functions contain a big amount of flat regions with high cost values and there are no mitigation techniques to solve this problem.
A cliff is a region that undergoes a sharp fall or a sharp rise depending on the point of view with respect to this region.
In both cases, it is dangerous to slide or to climb the cliff, and it is especially challenging to calculate the derivatives at such critical point because the gradient may surpass the cliff region to reach a point far away.
The reason of this behavior is that the gradient at a cliff adapts only to the direction of the steepest descent when it moves forward and it disregards the optimal step size.
Mitigation Technique. A mitigation technique for this unwanted behavior is to use Gradient Clipping Heuristic that reduces the step size to prohibit the gradient to jump over the cliff region.
Neural networks process operations of the input vector over multiple layers forth and back.
When the network is very deep, e.g. recurrent neural network (10), the computation will result in a deep computational graph.
During the computation, vanishing and exploding gradient descent appear.
In the vanishing gradient descent situation, the gradients cannot decide in which direction to move to get into a convenient low cost value on the loss function.
On the other hand, an exploding gradient descent makes the learning process inconsistent.
Mitigation Technique.
A commonly used mitigation technique for both challenges is to drop uninteresting features in the input vector using the power method.
Example. Suppose that a path of the computational graph applies a repeated multiplication with a matrix
After
The vanishing and exploding gradient descent problem arises from scaling
In this example, the power method detects the largest eigenvalue
The presented mitigation techniques so far solved the optimization problem at a single point on the loss function.
Although these methodologies improve the optimization process, it remains questionable whether the reached low cost value is sufficiently low with respect to other low cost values. Another question is whether the current low cost value drives the gradient into a much lower cost value or not.
Mitigation Techniques. To answer these questions, experts advise to employ some heuristics. One heuristic is to force the gradient to start at good points on the loss function and thus ensure that it will converge to a convenient minimum quickly. Another heuristic is to seek a low cost value that generalizes well on the given task instead of seeking the global minimum on the loss function. In fact, the surrogate loss function presented in section 8.1 is usually used to calculate the gradient instead of using the true loss function. This fact amplifies the poor correspondence between local and global structures during optimization.
A neural network adapts its parameters to reduce the loss arising from the difference between the estimated and true outputs. This optimization process is done using optimization algorithms. In this section, three first order optimization algorithms will be presented.
The most popular optimization algorithm is the Stochastic Gradient Descent (SGD).
SGD requires initial parameters
- First, SGD picks a minibatch consisting of
$m$ examples from the training set$\left(\left(x^{(1)}, y^{(1)}\right), \dots, \left(x^{(m)}, y^{(m)}\right)\right)$ . - On this minibatch, it computes a gradient estimate based on the normalized sum of the gradient of each example denoted as $\hat{g} = \frac{1}{m} \nabla_{\theta} \sum_{i} L\left(f(x^{(i)}; \theta), y^{(i)}\right) $ for
$i = 1, \cdots, m$ . - Then, it applies the update step to the model's parameters
$\theta = \theta - \epsilon_{k} \hat{g}$ . - At each iteration
$k$ , a new$\epsilon_{k}$ is calculated and the algorithm runs until the convergence criterion is met, i.e. until a convenient low cost value is reached.
In the SGD algorithm, the adaptive learning rate
-
$\tau$ is the number of iterations needed to make few hundred passes through the neural network, -
$\epsilon_{\tau} = \frac{\epsilon_{0}}{100}$ and -
$\epsilon_{0}$ is the best performing$\epsilon_{k}$ in the first iterations.
The popularity of SGD is also related to the fact that it allows convergence even with a huge number of training examples. It needs
If SGD is applied to a convex problem, the excess error is
The momentum algorithm is another optimization algorithm used during neural network training.
It adds to the gradient descent method a velocity parameter, also called momentum, to control the speed of the descent on the surface of the cost function.
The velocity parameter improves the performance of gradient descent compared to SGD. If the region on the cost function shows high curvature or in the case of a small or a noisy gradient, SGD will take very small step sizes and the learning becomes slow. However, the momentum algorithm recognizes such regions and applies an additional force to the gradient descent to accelerate the learning along the cost function. While descending, the velocity increases due to the applied force and the gradient descent becomes faster. As a result of the increased velocity, the gradient overshoots a region or it experiences an oscillation on the two sides of a local minimum on the loss function. Therefore, it is important to have another force to help the gradient to stop at that local minimum. This force is called viscous drag, it calculates the negative of the velocity to resist to the force accelerating the gradient on the loss function.
The momentum algorithm requires an adaptive learning rate
The velocity
- First, the momentum algorithm picks a minibatch consisting of
$m$ examples from the training set$\left(\left(x^{(1)}, y^{(1)}\right), \dots, \left(x^{(m)}, y^{(m)}\right)\right)$ . - On this minibatch, it computes a gradient estimate based on the normalized sum of the gradient of each example denoted as $\hat{g} = \frac{1}{m} \nabla_{\theta} \sum_{i} L\left(f(x^{(i)}; \theta), y^{(i)}\right) $ for
$i = 1, \cdots, m$ . - Then, it computes the velocity update step
$v = \alpha v - \epsilon \hat{g}$ . - After the velocity update, it applies the update step to the model's parameters
$\theta = \theta + v$ . - At each iteration
$k$ , a new$\epsilon_{k}$ is calculated and the algorithm runs until the convergence criterion is met, i.e. until a convenient low cost value is reached.
The Nesterov momentum is a third algorithm that can be used to optimize the training process of neural networks. It is an extension of the momentum algorithm because it adds a correction factor to the momentum. This correction is applied to the gradient estimation step by estimating
Training algorithms for deep learning are usually iterative which means the user has to specify an initial point. The choice of the initial point affects convergence, the speed of convergence and if we converge to a point with high or low cost. This last aspect is important as points of comparable cost can have different generalization error and the goal of training neural networks is to minimize the generalization error.
Most initialization strategies are based on achieving good properties when the network is initialized. There is no good understanding of how these properties are preserved during training. Certainly known is only that the initial parameters need to break symmetry between different units, which means that hidden units with same activation function and connection to same input parameters must have different initial parameters. This motivates to random initialization.
More specifically, the weights are initialized randomly. The values are drawn either from a Gaussian or uniform distribution. We will denote the uniform distribution by
- Sample each weight from
$U(-\frac{1}{\sqrt{m}}, \frac{1}{\sqrt{m}})$ , where m is the number of input layers.
- Normalized initialization:
$W_{i, j} \sim U(-\sqrt{\frac{6}{m+n}}, \sqrt{\frac{6}{m+n}})$ , where m and n are the number of inputs and outputs respectively and we assume fully connected layers.
- Initialize to random orthogonal matrices with gain factor g that needs to be carefully chosen.
- Use sparse initialization: each unit is initialized to have exactly k nonzero weights.
The second approach compromises between having the same activation variance and same gradient variance among all layers.
An advantage of sparse initialization over approach 1. and 2. is that it does not scale with the number of inputs or outputs. A disadvantage is that it imposed a large prior on weights with large values.
Optimal criteria for initial weights do not lead to optimal performance. This is why in practice the it is useful to treat initial weights as hyperparameters and to treat the initial scale of the weights and whether to use sparse or dense initialization as hyperparameter aswell if not too costly.
The approach for setting the biases must be coordinated with the approach for setting the weights. Setting the biases to zero is compatible with most weight initialization schemes. It is e.g. important to set the biases to nonzero weights, if a unit controls whether other units are able to participate in a function or to avoid too much saturation at initialization time.
It is also possible to initialize the model parameters using maching learning. This is not covered here but in part III of our source "Deep Learning".
Q: Can you give an example for how networks can be nonidentifiable?
Answer: For example by perturbing ingoing and outgoing weight vectors of two units in the same layer of a neural net we get two identical networks. This kind of nonidentifiability is called weight space symmetry.
Q: Can you explain the concept of vicious drag?
Answer: Vicious drag is the force in opposite direction of velocity that slows down a moving particle.
Q: Can you name advantages/disadvantages of a big/small momentum parameter for SGD?
Answer: The momentum parameter
Q: Can you explain the Saddle-free Newton Method?
Answer: The Newton Method tends to get stuck in saddle points. This can be fixed by employing a so called trust region approach. This approach consists of adding a constant value
Q: Why do we gradually decrease the learning rate over time in SGD?
Answer: Since the gradient estimator produces a constant source of noise which leads to the gradients not vanishing entirely.
Q: Can you give an example for exploding/vanishing gradients?
Answer: Repeated multiplications with the same weight Matrix
Q: How does the use of Nesterov Momentum improve the error rates?
Answer: For batch gradient descent on a convex function it can be shown that the use of Nesterov momentum improves the convergence of the excess error from
Q: Can you come up with an example where shuffling the data is important before choosing the minibatches?
Answer: Some data sets are arranged in a way such that subsequent examples are highly correlated which leads to the gradient estimations to not be independent. For example a list of blood sample test results might contain blood samples from single patients at different points in time in consecutive blocks.