A PyTorch implementation of the Levenberg-Marquardt (LM) optimization algorithm, supporting mini-batch training for both regression and classification problems. It leverages GPU acceleration and offers an extensible framework, supporting diverse loss functions and customizable damping strategies.
A TensorFlow implementation is also available: tf-levenberg-marquardt
For more information on the theory behind the Levenberg-Marquardt and Gauss-Newton algorithms, refer to the following resources: Levenberg-Marquardt, Gauss-Newton.
First-order methods like SGD and Adam dominate large-scale neural network training due to their efficiency and scalability, making them the only viable option for models with millions or billions of parameters. However, for smaller models, second-order methods can offer faster convergence and sometimes succeed where first-order methods fail.
The Levenberg-Marquardt algorithm strikes a balance:
- It builds on the Gauss-Newton method, using second-order information for faster convergence.
- Adaptive damping enhances stability, mitigating issues that can arise in the standard Gauss-Newton algorithm.
This makes it a strong choice for problems with manageable model sizes.
- Versatile Loss Support: Leverage the square root trick to apply LM with any PyTorch-supported loss function.
- Mini-batch Training: Scale LM to large datasets for both regression and classification tasks.
- Custom Damping Strategies: Adapt the damping factor dynamically for stable optimization.
- Split Jacobian Matrix Computation: Split the Computation of the Jacobian and Hessian matrix approximation to reduce memory usage.
- Custom Param Selection Strategies: Select a subset of model parameters to update during the training step.
The following loss functions are supported out of the box:
MSELoss
L1Loss
HuberLoss
CrossEntropyLoss
BCELoss
BCEWithLogitsLoss
Additional loss functions can be added by implementing custom residual definitions.
-
Standard:
$\large J^T J + \lambda I$ -
Fletcher:
$\large J^T J + \lambda ~ \text{diag}(J^T J\hspace{0.1em})$ - Custom: Support for defining custom damping strategies.
To install the library, use pip:
pip install torch-levenberg-marquardt
To contribute or modify the code, clone the repository and install it in editable mode:
conda env create -f environment_cuda.yml
conda env create -f environment_cpu.yml
conda env create -f environment_macos.yml
The utils.fit
function provides an example of how to implement a PyTorch training loop using the training.LevenbergMarquardtModule
.
import torch_levenberg_marquardt as tlm
# The fit function provides an example of how to train your model in PyTorch training loop
tlm.utils.fit(
tlm.training.LevenbergMarquardtModule(
model=model,
loss_fn=tlm.loss.MSELoss(),
learning_rate=1.0,
attempts_per_step=10,
solve_method='qr',
),
train_loader,
epochs=50,
)
The class utils.CustomLightningModule
provides an example of how to implement a PyTorch Lightning module that uses the training.LevenbergMarquardtModule
:
import torch_levenberg_marquardt as tlm
from pytorch_lightning import Trainer
# Wrap your model with the Levenberg-Marquardt training module
lm_module = tlm.utils.CustomLightningModule(
tlm.training.LevenbergMarquardtModule(
model=model,
loss_fn=tlm.loss.MSELoss(),
learning_rate=1.0,
attempts_per_step=10,
solve_method='qr',
)
)
# Train using PyTorch Lightning
trainer = Trainer(max_epochs=50, accelerator='gpu', devices=1)
trainer.fit(lm_module, train_loader)
The Levenberg-Marquardt algorithm is designed to solve least-squares problems or the form:
This might seem to restrict it to Mean Squared Error loss. However, it is possible to use the square root trick to adapt LM to any loss that is guaranteed to always be positive. Suppose we have a loss function of the form:
the residuals
class MSELoss(Loss):
def forward(self, y_true: Tensor, y_pred: Tensor) -> Tensor:
return (y_pred - y_true).square().mean()
def residuals(self, y_true: Tensor, y_pred: Tensor) -> Tensor:
return y_pred - y_true
class CrossEntropyLoss(Loss):
def forward(self, y_true: Tensor, y_pred: Tensor) -> Tensor:
return torch.nn.functional.cross_entropy(y_pred, y_true, reduction='mean')
def residuals(self, y_true: Tensor, y_pred: Tensor) -> Tensor:
return torch.sqrt(torch.nn.functional.cross_entropy(y_pred, y_true, reduction='none'))
The Gauss-Newton method provides an efficient way to optimize least-squares problems by approximating the second-order derivatives of the objective function:
where:
-
$\large r_i$ are the residuals derived from a general loss function, -
$\large f\left(x_i, W\right)$ represents the model output for input$\large x_i$ and parameters$\large W$ , -
$\large N$ is the number of data points.
In what follows, the Gauss-Newton algorithm will be derived from Newton's method for function optimization via an approximation.
The recurrence relation for Newton's method for minimizing a function
where
The gradient of
Elements of the Hessian are calculated by differentiating the gradient elements,
The Gauss-Newton method is obtained by ignoring the second-order derivative terms (the second term in this expression). That is, the Hessian is approximated by
where
While the Gauss-Newton method is powerful, its instability near singularity in
This ensures numerical stability when
The damping new_loss < loss
, the new parameters are accepted. Otherwise, the old parameters are restored, and a new attempt is made with an adjusted damping factor.
To achieve optimal performance from the training algorithm, it is important to carefully choose the batch size and the number of model parameters.
The LM algorithm minimizes the least-squares objective:
The Jacobian matrix
For a batch of size
where:
-
$\large W$ : Model parameters (weights), -
$\large N$ : Total number of residuals, determined by$\large N = B \cdot O$ , -
$\large B$ : Batch size, -
$\large O$ : Number of outputs per sample, -
$\large r\left(y_i, f\left(x_i, W\right)\right)$ : Residuals, which are computed differently depending on the chosen loss function. -
$\large P$ : Total number of parameters in the model.
The LM update is chosen based on whether the system is overdetermined
Update formula:
The Size of the matrix to invert
Update formula:
The Size of the matrix to invert
The memory required to store jacobian_max_num_rows
argument.
Rather than constructing the full
Split the full batch into
where
For each sub-batch
Where
Instead of storing the entire
When using the Split Jacobian computation, the memory usage is primarily determined by the size of
A simple curve-fitting example is implemented in examples/sinc_curve_fitting.py
and examples/sinc_curve_fitting_lightning.py
. The function y = sinc(10 * x)
is fitted using a shallow neural network with 61 parameters.
Despite the simplicity of the problem, first-order methods such as Adam fail to converge, whereas Levenberg-Marquardt converges rapidly with very low loss values. The learning rate values were chosen experimentally based on the results obtained by each algorithm.
Here the results with Adam for 10000 epochs and learning_rate=0.01
Training with Adam optimizer...
Epoch 9999: 100%|██████████| 20/20 [00:00<00:00, 81.07it/s, loss_step=0.000461, loss_epoch=0.000412]
`Trainer.fit` stopped: `max_epochs=10000` reached.
Epoch 9999: 100%|██████████| 20/20 [00:00<00:00, 80.21it/s, loss_step=0.000461, loss_epoch=0.000412]
Training completed. Elapsed time: 2604.56 seconds
Here the results with Levenberg-Marquardt for 100 epochs and learning_rate=1.0
Training with Levenberg-Marquardt...
Epoch 49: 100%|██████████| 20/20 [00:00<00:00, 64.07it/s, loss_step=2.79e-7, damping_factor=1e-6, attempts=3.000, loss_epoch=3.31e-7]
`Trainer.fit` stopped: `max_epochs=50` reached.
Epoch 49: 100%|██████████| 20/20 [00:00<00:00, 63.35it/s, loss_step=2.79e-7, damping_factor=1e-6, attempts=3.000, loss_epoch=3.31e-7]
Training completed. Elapsed time: 16.60 seconds
A common MNIST classification example is implemented in examples/mnist_classification.py.py
and examples/mnist_classification_lightning.py.py
. The classification is performed using a convolutional neural network with 1026 parameters.
Both optimization methods achieve roughly the same accuracy on the training and test sets; however, Levenberg-Marquardt requires significantly fewer epochs, automatically stopping the training at epoch 8.
Here the results with Adam for 100 epochs and learning_rate=0.01
Training with Adam optimizer...
Training with Adam optimizer...
Epoch 99: 100%|██████████| 12/12 [00:01<00:00, 8.90it/s, accuracy=0.970, loss_step=0.0977, loss_epoch=0.0986]
`Trainer.fit` stopped: `max_epochs=100` reached.
Epoch 99: 100%|██████████| 12/12 [00:01<00:00, 8.88it/s, accuracy=0.970, loss_step=0.0977, loss_epoch=0.0986]
Training completed. Elapsed time: 125.33 seconds
Adam - Test Loss: 0.089224, Test Accuracy: 97.32%
Here the results with Levenberg-Marquardt for 10 epochs and learning_rate=0.05
Train using Levenberg-Marquardt
Training with Levenberg-Marquardt...
Epoch 8: 83%|████████▎ | 10/12 [00:02<00:00, 4.64it/s, accuracy=0.977, loss_step=0.0683, damping_factor=1e+10, attempts=3.000, loss_epoch=0.0742]
Training completed. Elapsed time: 22.10 seconds
Levenberg-Marquardt - Test Loss: 0.076580, Test Accuracy: 97.59%
- python>3.9
- torch>=2.0.0
- numpy>=1.22
- pytorch-lightning>=1.9
- tqdm
- torchmetrics>=0.11.0