The Model Optimizer's modelopt.torch.prune
module provides advanced state-of-the-art pruning algorithms that enable you to search for the best subnet architecture from your provided base model.
Model Optimizer can be used in one of the following complementary pruning modes to create a search space for optimizing the model:
- Minitron: A pruning method developed by NVIDIA Research for pruning GPT-style models in NVIDIA NeMo or Megatron-LM framework that are using Pipeline or Tensor Parallelisms. It uses the activation magnitudes to prune the mlp, attention heads, GQA query groups, embedding hidden size and number of layers of the model.
- GradNAS: A light-weight pruning method recommended for language models like Hugging Face BERT, GPT-J. It uses the gradient information to prune the model's linear layers and attention heads to meet the given constraints.
- FastNAS: A pruning method recommended for Computer Vision models. Given a pretrained model, FastNAS finds the subnet which maximizes the score function while meeting the given constraints.
Checkout the Quick Start: Pruning and the detailed Optimization Guide in the Model Optimizer documentation for more information on how to use the above pruning algorithms in Model Optimizer.
Checkout the Minitron pruning example in the NVIDIA NeMo repository which showcases the usage of the powerful Minitron pruning algorithm developed by NVIDIA Research for pruning LLMs like Llama 3.1 8B or Mistral NeMo 12B.
You can also look at the tutorial notebooks here which showcase the usage of Minitron pruning followed by distillation for Llama 3.1 8B step-by-step in NeMo framework.
NOTE: If you wish to use this algorithm for pruning Hugging Face LLMs, you can first use the HF to NeMo converters, then use Minitron pruning, optionally followed by distillation in NeMo framework and then convert back to Hugging Face format. You can use the converter scripts in the NeMo repository.
Checkout the BERT pruning example in chained_optimizations directory which showcases the usage of GradNAS for pruning BERT model for Question Answering followed by fine-tuning with distillation and quantization. The example also demonstrates how to save and restore pruned models.
Checkout the FastNAS pruning interactive notebook cifar_resnet in this directory which showcases the usage of FastNAS for pruning a ResNet 20 model for the CIFAR-10 dataset. The notebook also how to profiling the model to understand the search space of possible pruning options and demonstrates the usage saving and restoring pruned models.