diff --git a/tutorials/llm/llama-3/README.rst b/tutorials/llm/llama-3/README.rst index 3bb1a0896b82..1d12b8847c0d 100755 --- a/tutorials/llm/llama-3/README.rst +++ b/tutorials/llm/llama-3/README.rst @@ -2,7 +2,7 @@ Getting Started with Llama 3 and Llama 3.1 ========================================== -This repository contains jupyter notebook tutorials using NeMo Framework for Llama-3 and Llama-3.1 models by Meta. +This repository contains Jupyter Notebook tutorials using the NeMo Framework for Llama-3 and Llama-3.1 models by Meta. .. list-table:: :widths: 100 25 100 @@ -16,7 +16,7 @@ This repository contains jupyter notebook tutorials using NeMo Framework for Lla - Perform LoRA PEFT on Llama 3 8B Instruct using a dataset for bio-medical domain question answering. Deploy multiple LoRA adapters with NVIDIA NIM. * - `Llama 3.1 Law-Domain LoRA Fine-Tuning and Deployment with NeMo Framework and NVIDIA NIM <./sdg-law-title-generation>`_ - `Law StackExchange `_ - - Perform LoRA PEFT on Llama 3.1 8B Instruct using a synthetically augmented version of Law StackExchange with NeMo Framework, followed by deployment with NVIDIA NIM. As a pre-requisite, follow the tutorial for `data curation using NeMo Curator `__. + - Perform LoRA PEFT on Llama 3.1 8B Instruct using a synthetically augmented version of Law StackExchange with NeMo Framework, followed by deployment with NVIDIA NIM. As a prerequisite, follow the tutorial for `data curation using NeMo Curator `_. * - `Llama 3.1 Pruning and Distillation with NeMo Framework <./pruning-distillation>`_ - `WikiText-103-v1 `_ - Perform pruning and distillation on Llama 3.1 8B using the WikiText-103-v1 dataset with NeMo Framework. diff --git a/tutorials/llm/llama-3/pruning-distillation/01_data_preparation.ipynb b/tutorials/llm/llama-3/pruning-distillation/01_data_preparation.ipynb index 1f84dd2719e6..8548c0cfb1d0 100644 --- a/tutorials/llm/llama-3/pruning-distillation/01_data_preparation.ipynb +++ b/tutorials/llm/llama-3/pruning-distillation/01_data_preparation.ipynb @@ -9,7 +9,7 @@ "\n", "The dataset has to be preprocessed using the [preprocess_data_for_megatron.py](https://github.com/NVIDIA/NeMo/blob/main/scripts/nlp_language_modeling/preprocess_data_for_megatron.py) script included in the NeMo Framework. This step will also tokenize data using the `meta-llama/Meta-Llama-3.1-8B` tokenizer model to convert the data into a memory map format.\n", "\n", - "> `NOTE:` In the block of code below, pass the paths to your train, test and validation data files." + "> `NOTE:` In the block of code below, pass the paths to your train, test, and validation data files." ] }, { diff --git a/tutorials/llm/llama-3/pruning-distillation/02_teacher_finetuning.ipynb b/tutorials/llm/llama-3/pruning-distillation/02_teacher_finetuning.ipynb index 8d08793bbe9a..7d58ac4779aa 100644 --- a/tutorials/llm/llama-3/pruning-distillation/02_teacher_finetuning.ipynb +++ b/tutorials/llm/llama-3/pruning-distillation/02_teacher_finetuning.ipynb @@ -6,15 +6,15 @@ "metadata": {}, "source": [ "\n", - "### Step 2: Finetune the teacher on the dataset\n", + "### Step 2: Fine-tune the teacher on the dataset\n", "\n", - "NeMo framework includes a standard python script [megatron_gpt_pretraining.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/megatron_gpt_pretraining.py) for training a model. Once you have your model downloaded and the dataset ready, fine-tuning the teacher model with NeMo is essentially just running this script!\n", + "NeMo Framework includes a standard Python script, [megatron_gpt_pretraining.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/megatron_gpt_pretraining.py), for training a model. Once you have your model downloaded and the dataset ready, fine-tuning the teacher model with NeMo is essentially just running this script!\n", "\n", - "We finetune the unpruned model on our dataset to correct the distribution shift across the original dataset the model was trained on. Per the [blog](https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/) and [tech report](https://arxiv.org/pdf/2408.11796), experiments showed that, without correcting for the distribution shift, the teacher provides suboptimal guidance on the dataset when being distilled.\n", + "We fine-tune the unpruned model on our dataset to correct the distribution shift from the original dataset the model was trained on. According to the [blog](https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/) and [tech report](https://arxiv.org/pdf/2408.11796), experiments showed that without correcting for this distribution shift, the teacher provides suboptimal guidance on the dataset during distillation.\n", "\n", "For this demonstration, this training run is capped by `STEPS`, and validation is carried out every `VAL_INTERVAL` steps.\n", "\n", - "> `NOTE:` In the block of code below, pass the paths to your pre-processed train, test and validation data files as well as path to the teacher .nemo model." + "> `NOTE:` In the block of code below, pass the paths to your pre-processed train, test, and validation data files, as well as the path to the teacher .nemo model." ] }, { @@ -124,8 +124,8 @@ "id": "3040a993-8423-475f-8bc6-d1dd1ce16a83", "metadata": {}, "source": [ - "This will create a finetuned teacher model named `megatron_llama_ft.nemo` in `./distill_trainings/megatron_llama_ft/checkpoints/`. We'll use this later.\n", - "> `NOTE:`This script takes at least 20 minutes to run (depending on GPU) and will generate the finetuned teacher model." + "This will create a fine-tuned teacher model named `megatron_llama_ft.nemo` in `./distill_trainings/megatron_llama_ft/checkpoints/`. We'll use this later.\n", + "> `NOTE:`This script takes at least 20 minutes to run (depending on GPU) and will generate the fine-tuned teacher model." ] } ], diff --git a/tutorials/llm/llama-3/pruning-distillation/03_a_depth_pruning.ipynb b/tutorials/llm/llama-3/pruning-distillation/03_a_depth_pruning.ipynb index a195c2f3a405..d64f8c15bd00 100644 --- a/tutorials/llm/llama-3/pruning-distillation/03_a_depth_pruning.ipynb +++ b/tutorials/llm/llama-3/pruning-distillation/03_a_depth_pruning.ipynb @@ -5,8 +5,8 @@ "id": "8bc99d2f-9ac6-40c2-b072-12b6cb7b9aca", "metadata": {}, "source": [ - "### Step 3: Prune the finetuned-teacher model to create a student\n", - "In this step, we will explore two methods to prune the finetuned teacher model. Refer to the ``NOTE`` in the **_step-by-step instructions_** section of [introduction.ipynb](./introduction.ipynb) to decide which pruning techniques you would like to explore.\n", + "### Step 3: Prune the fine-tuned teacher model to create a student\n", + "In this step, we will explore two methods to prune the fine-tuned teacher model. Refer to the ``NOTE`` in the **_step-by-step instructions_** section of [introduction.ipynb](./introduction.ipynb) to decide which pruning techniques you would like to explore.\n", "\n", "In the first method, depth-pruning, we trim the layers of the model." ] @@ -21,7 +21,7 @@ "\n", "Per the [blog](https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/) and [tech report](https://arxiv.org/pdf/2408.11796), removing contiguous layers from the second last block (layers 16 to 31 continuously) yields the best overall results. \n", "\n", - "> `NOTE:` In the block of code below, pass the paths to your finetuned teacher .nemo model." + "> `NOTE:` In the block of code below, pass the paths to your fine-tuned teacher .nemo model." ] }, { diff --git a/tutorials/llm/llama-3/pruning-distillation/03_b_width_pruning.ipynb b/tutorials/llm/llama-3/pruning-distillation/03_b_width_pruning.ipynb index 7d91d36cbb32..5c4a47872afb 100644 --- a/tutorials/llm/llama-3/pruning-distillation/03_b_width_pruning.ipynb +++ b/tutorials/llm/llama-3/pruning-distillation/03_b_width_pruning.ipynb @@ -5,8 +5,8 @@ "id": "8bc99d2f-9ac6-40c2-b072-12b6cb7b9aca", "metadata": {}, "source": [ - "### Step 3: Prune the finetuned-teacher model to create a student\n", - "In the second method, we will width-prune. In width-pruning, we trim the neurons, attention heads and embedding channels. \n", + "### Step 3: Step 3: Prune the fine-tuned teacher model to create a student\n", + "In the second method, we will width-prune. In width-pruning, we trim the neurons, attention heads, and embedding channels.\n", "\n", "Refer to the ``NOTE`` in the **_step-by-step instructions_** section of [introduction.ipynb](./introduction.ipynb) to decide which pruning techniques you would like to explore." ] @@ -20,15 +20,15 @@ "source": [ "#### Step 3.b.: Using width-pruning\n", "To width-prune the model, we do the following:\n", - "- prune (trim) the MLP intermediate dimension from 14336 to 9216.\n", - "- prune the hidden size from 4096 to 3072.\n", - "- and retrain the attention headcount and number of layers\n", + "- Prune (trim) the MLP intermediate dimension from 14336 to 9216.\n", + "- Prune the hidden size from 4096 to 3072.\n", + "- Retrain the attention headcount and number of layers\n", "\n", - "For width-pruning we will use the [megatron_gpt_prune.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/megatron_gpt_prune.py) script in the NeMo Framework. To see the detailed list of parameters for width-pruning, you can view the [megatron_gpt_prune.yaml](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/conf/megatron_gpt_prune.yaml) file.\n", + "For width-pruning, we will use the [megatron_gpt_prune.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/megatron_gpt_prune.py) script in the NeMo Framework. To see the detailed list of parameters for width-pruning, you can view the [megatron_gpt_prune.yaml](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/conf/megatron_gpt_prune.yaml) file.\n", "\n", "We use the above parameters to get a competitive model for this demonstration. You can use other strategies or parameters from the [blog](https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/) or the [tech report](https://arxiv.org/pdf/2408.11796) for your experiments. \n", "\n", - "> `NOTE:` In the block of code below, pass the paths to your finetuned teacher .nemo model.\n", + "> `NOTE:` In the block of code below, pass the paths to your fine-tuned teacher .nemo model.\n", "\n", "> `TIP:` You can increase the ``batch_size`` (upto 1024) to speed up the width-pruning script execution." ] diff --git a/tutorials/llm/llama-3/pruning-distillation/04_a_distilling_depth_pruned_student.ipynb b/tutorials/llm/llama-3/pruning-distillation/04_a_distilling_depth_pruned_student.ipynb index ccbe1cbf394b..488225837731 100644 --- a/tutorials/llm/llama-3/pruning-distillation/04_a_distilling_depth_pruned_student.ipynb +++ b/tutorials/llm/llama-3/pruning-distillation/04_a_distilling_depth_pruned_student.ipynb @@ -6,9 +6,9 @@ "metadata": {}, "source": [ "### Step 4: Distill knowledge from teacher into student\n", - "Distillation of a model with NeMo Framework is also possible using a python script: [megatron_gpt_distillation.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/megatron_gpt_distillation.py). In this notebook, we will explore distillation with the depth-pruned model as the `STUDENT` model. \n", + "Distillation of a model with NeMo Framework is also possible using a Python script: [megatron_gpt_distillation.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/megatron_gpt_distillation.py). In this notebook, we will explore distillation with the depth-pruned model as the `STUDENT` model.\n", "\n", - "For this demonstration, the `TEACHER` would be the finetuned teacher model `megatron_llama_ft.nemo` and the `STUDENT` model would be the pruned 4B model. This training run is capped by `STEPS`, and validation is carried out every `VAL_INTERVAL` steps." + "For this demonstration, the `TEACHER` would be the fine-tuned teacher model `megatron_llama_ft.nemo` and the `STUDENT` model would be the pruned 4B model. This training run is capped by `STEPS`, and validation is carried out every `VAL_INTERVAL` steps." ] }, { @@ -19,7 +19,7 @@ "#### Step 4.a.: Using depth-pruned student\n", "While distilling knowledge from the teacher to depth-pruned model, the `STUDENT` model would be `4b_depth_pruned_model.nemo` as produced by the [depth-pruning](./03_a_depth_pruning.ipynb) notebook. This training run is capped by `STEPS`, and validation is carried out every `VAL_INTERVAL` steps.\n", "\n", - "> `NOTE:` In the block of code below, pass the paths to your pre-processed train, test and validation data files as well as path to the teacher and student .nemo models." + "> `NOTE:` In the block of code below, pass the paths to your pre-processed train, test, and validation data files, as well as path to the teacher and student .nemo models." ] }, { diff --git a/tutorials/llm/llama-3/pruning-distillation/04_b_distilling_width_pruned_student.ipynb b/tutorials/llm/llama-3/pruning-distillation/04_b_distilling_width_pruned_student.ipynb index 48e81c96cdcf..95110dd19dd9 100644 --- a/tutorials/llm/llama-3/pruning-distillation/04_b_distilling_width_pruned_student.ipynb +++ b/tutorials/llm/llama-3/pruning-distillation/04_b_distilling_width_pruned_student.ipynb @@ -6,10 +6,10 @@ "metadata": {}, "source": [ "### Step 4: Distill knowledge from teacher into student\n", - "Distillation of a model with NeMo Framework is also possible using a python script: [megatron_gpt_distillation.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/megatron_gpt_distillation.py). \n", + "Distillation of a model with NeMo Framework is also possible using a Python script: [megatron_gpt_distillation.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/megatron_gpt_distillation.py). \n", "In this notebook, we will explore distillation with the width-pruned model as the `STUDENT` model.\n", "\n", - "For this demonstration, the `TEACHER` would be the finetuned teacher model `megatron_llama_ft.nemo` and the `STUDENT` model would be the pruned 4B model. This training run is capped by `STEPS`, and validation is carried out every `VAL_INTERVAL` steps." + "For this demonstration, the `TEACHER` would be the fine-tuned teacher model `megatron_llama_ft.nemo` and the `STUDENT` model would be the pruned 4B model. This training run is capped by `STEPS`, and validation is carried out every `VAL_INTERVAL` steps." ] }, { @@ -20,7 +20,7 @@ "#### Step 4.b.: Using width-pruned student\n", "While distilling knowledge from the teacher to width-pruned model, the `STUDENT` model would be `4b_width_pruned_model.nemo` as produced by the [width-pruning](./03_b_width_pruning.ipynb) notebook. This training run is capped by `STEPS`, and validation is carried out every `VAL_INTERVAL` steps.\n", "\n", - "> `NOTE:` In the block of code below, pass the paths to your pre-processed train, test and validation data files as well as path to the teacher and student .nemo models." + "> `NOTE:` In the block of code below, pass the paths to your pre-processed train, test, and validation data files, as well as path to the teacher and student .nemo models." ] }, { diff --git a/tutorials/llm/llama-3/pruning-distillation/05_display_results.ipynb b/tutorials/llm/llama-3/pruning-distillation/05_display_results.ipynb index 0264cc288957..dcb483c55ab6 100644 --- a/tutorials/llm/llama-3/pruning-distillation/05_display_results.ipynb +++ b/tutorials/llm/llama-3/pruning-distillation/05_display_results.ipynb @@ -8,7 +8,8 @@ "### Step 5: Display the validation loss\n", "\n", "Now that the results are in, let's visualize the validation loss of the two distilled models using the `tensorboard` library. \n", - "> `NOTE:` This notebook demonstrates the use of the teacher finetuning, pruning and the distillation script. These scripts should ideally be run on a multi-node cluster with a larger `GLOBAL_BATCH_SIZE` and `STEPS` to see improvement in the validation loss." + "\n", + "> `NOTE:` This notebook demonstrates the use of the teacher fine-tuning, pruning, and the distillation script. These scripts should ideally be run on a multi-node cluster with a larger `GLOBAL_BATCH_SIZE` and `STEPS` to see improvement in the validation loss." ] }, { @@ -16,8 +17,8 @@ "id": "b5822d62-8131-4046-8c22-0bf0fce81df7", "metadata": {}, "source": [ - "#### Validation Loss using depth-pruned model as student in distillation script\n", - "Here is an image of the validation loss over 30 steps of running the training step in the distillation script when we distill the knowledge from the finetuned teacher model to the depth-pruned student." + "#### Validation Loss Using Depth-Pruned Model as Student in Distillation Script\n", + "Here is an image of the validation loss over 30 steps of running the training step in the distillation script, where we distill the knowledge from the fine-tuned teacher model to the depth-pruned student." ] }, { @@ -35,7 +36,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 1, "id": "db6fcf26-8ae8-40e1-875a-0a10bf85be81", "metadata": { "tags": [] @@ -44,7 +45,7 @@ { "data": { "text/html": [ - "
Validation Loss over 30 Training Steps with Depth-Pruned model as Student
" + "
Validation Loss over 30 Training Steps with Depth-Pruned Model as Student
" ], "text/plain": [ "" @@ -68,7 +69,7 @@ ], "source": [ "from IPython.display import Image, display, HTML\n", - "title = \"Validation Loss over 30 Training Steps with Depth-Pruned model as Student\"\n", + "title = \"Validation Loss over 30 Training Steps with Depth-Pruned Model as Student\"\n", "display(HTML(f\"
{title}
\"))\n", "display(Image(url=\"https://github.com/NVIDIA/NeMo/releases/download/r2.0.0rc1/val_loss_depth_pruned_student_distillation.png\", width=400))" ] @@ -78,8 +79,8 @@ "id": "f10041ae-6533-47de-9f76-f97d4469c27a", "metadata": {}, "source": [ - "#### Validation Loss using width-pruned model as student in distillation script\n", - "Here is an image of the validation loss over 30 steps of running the training step in the distillation script when we distill the knowledge from the finetuned teacher model to the width-pruned student." + "#### Validation Loss Using Width-Pruned Model as Student in Distillation Script\n", + "Here is an image of the validation loss over 30 steps of running the training step in the distillation script, where we distill the knowledge from the fine-tuned teacher model to the width-pruned student." ] }, { @@ -97,7 +98,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 2, "id": "ecd79583-f662-40c6-a690-9f4bb847de4e", "metadata": { "tags": [] @@ -106,7 +107,7 @@ { "data": { "text/html": [ - "
Validation Loss over 30 Training Steps with Width-Pruned model as Student
" + "
Validation Loss over 30 Training Steps with Width-Pruned Model as Student
" ], "text/plain": [ "" @@ -130,18 +131,10 @@ ], "source": [ "from IPython.display import Image, display, HTML\n", - "title = \"Validation Loss over 30 Training Steps with Width-Pruned model as Student\"\n", + "title = \"Validation Loss over 30 Training Steps with Width-Pruned Model as Student\"\n", "display(HTML(f\"
{title}
\"))\n", "display(Image(url=\"https://github.com/NVIDIA/NeMo/releases/download/r2.0.0rc1/val_loss_width_pruned_student_distillation.png\", width=400))" ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7ab6ed6f-8bc3-4188-919f-7cee842635ed", - "metadata": {}, - "outputs": [], - "source": [] } ], "metadata": { diff --git a/tutorials/llm/llama-3/pruning-distillation/README.rst b/tutorials/llm/llama-3/pruning-distillation/README.rst index 34febcffa366..45cb119ffcd8 100644 --- a/tutorials/llm/llama-3/pruning-distillation/README.rst +++ b/tutorials/llm/llama-3/pruning-distillation/README.rst @@ -1,13 +1,13 @@ Llama 3.1 Pruning and Distillation with NeMo Framework ======================================================================================= -`Llama 3.1 `_ are open-source large language models by Meta that deliver state-of-the-art performance on popular industry benchmarks. They have been pretrained on over 15 trillion tokens, and support a 128K token context length. They are available in three sizes, 8B, 70B, and 405B, and each size has two variants—base pretrained and instruction tuned. +`Llama 3.1 `_ models, developed by Meta, are open-source large language models that deliver state-of-the-art performance on popular industry benchmarks. Pretrained on over 15 trillion tokens, they support a 128K token context length. These models are available in three sizes: 8B, 70B, and 405B. Each size offers two variants: base pretrained and instruction tuned. -`NVIDIA NeMo Framework `_ provides tools to perform teacher finetuning, pruning and distillation on Llama 3.1 to fit your use case. +`NVIDIA NeMo Framework `_ provides tools to perform teacher fine-tuning, pruning, and distillation on Llama 3.1 to fit your use case. `NVIDIA TensorRT Model Optimizer `_ is a library (referred to as **Model Optimizer**, or **ModelOpt**) comprising state-of-the-art model optimization techniques including `quantization `_, `sparsity `_, `distillation `_, and `pruning `_ to compress models. -`LLM Pruning and Distillation in Practice: The Minitron Approach `_ provides tools to perform teacher finetuning, pruning and distillation on Llama 3.1 as described in the `tech report `_. +`LLM Pruning and Distillation in Practice: The Minitron Approach `_ provides tools to perform teacher fine-tuning, pruning, and distillation on Llama 3.1 as described in the `tech report `_. `How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model `_ provides practical and effective structured compression best practices for LLMs that combine depth, width, attention, and MLP pruning with knowledge distillation-based retraining. These strategies are presented in the `Compact Language Models via Pruning and Knowledge Distillation `_ paper. @@ -16,30 +16,33 @@ Llama 3.1 Pruning and Distillation with NeMo Framework Objectives ---------- -This tutorial shows how to perform depth-pruning, teacher finetuning and distillation on **Llama 3.1 8B** using the `WikiText-103-v1 `_ dataset with NeMo Framework. The `WikiText-103-v1 `_ language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. For this demonstration, we will perform teacher correction by running a light finetuning procedure on the ``Meta Llama 3.1 8B`` teacher model to generate a finetuned teacher model ``megatron_llama_ft.nemo`` needed for optimal distillation. This finetuned teacher model is then trimmed. There are two methods to prune a model: depth-pruning and width-pruning. We will be exploring both pruning techniques which will yield ``4b_depth_pruned_model.nemo`` and ``4b_width_pruned_model.nemo`` respectively. These models will serve as a starting point for distillation to create the final distilled 4B models. +This tutorial demonstrates how to perform depth-pruning, width-pruning, teacher fine-tuning, and distillation on **Llama 3.1 8B** using the `WikiText-103-v1 _ dataset with the NeMo Framework. The WikiText-103-v1 `_ language modeling dataset comprises over 100 million tokens extracted from verified Good and Featured articles on Wikipedia. + +For this demonstration, we will perform teacher correction by running a light fine-tuning procedure on the ``Meta LLama 3.1 8B`` teacher model to generate a fine-tuned teacher model, ``megatron_llama_ft.nemo``, needed for optimal distillation. This fine-tuned teacher model is then trimmed. There are two methods to prune a model: depth-pruning and width-pruning. We will explore both techniques, yielding ``4b_depth_pruned_model.nemo`` and ``4b_width_pruned_model.nemo``, respectively. These models will serve as starting points for distillation to create the final distilled 4B models. + We are using models utilizing the ``meta-llama/Meta-Llama-3.1-8B`` tokenizer for this demonstration. -``NOTE:`` A subset of functions is being demonstrated in the notebooks. Some features like Neural Architecture Search (NAS) are unavailable but will be supported in future releases. +``NOTE:`` A subset of functions is being demonstrated in the notebooks. Some features like Neural Architecture Search (NAS) are unavailable, but will be supported in future releases. Requirements ------------- * System Configuration - * Access to at least 8 NVIDIA GPU with an individual memory of at least 80GB, for example: 8 x H100-80GB or 8 x A100-80GB. + * Access to at least 8 NVIDIA GPUs, each with a memory of at least 80GB (e.g., 8 x H100-80GB or 8 x A100-80GB). * A Docker-enabled environment, with `NVIDIA Container Runtime `_ installed, which will make the container GPU-aware. -* `Authenticate with NVIDIA NGC `_, and download `NGC CLI Tool `_. You will use this tool to download the model and customize it with NeMo Framework. +* `Authenticate with NVIDIA NGC `_ and download `NGC CLI Tool `_. You will use this tool to download the model and customize it with NeMo Framework. * Get your Hugging Face `access token `_, which will be used to obtain the tokenizer required during training. -``NOTE:`` The default configuration in the notebook runs on 8 x 80GB NVIDIA GPUs but you can potentially reduce Tensor Parallel size ``(TENSOR_PARALLEL_SIZE)`` along with the Micro-Batchsize ``(MICRO_BATCH_SIZE)`` in the teacher finetuning and distillation scripts to accommodate lower resource availability. +``NOTE:`` The default configuration in the notebook runs on 8 x 80GB NVIDIA GPUs. However, you can potentially reduce the Tensor Parallel size ``(TENSOR_PARALLEL_SIZE)`` along with the Micro-Batchsize ``(MICRO_BATCH_SIZE)`` in the teacher fine-tuning and distillation scripts to accommodate lower resource availability. -Create a pruned and distilled model with NeMo Framework +Create a Pruned and Distilled Model with NeMo Framework ------------------------------------------------------------------------------ -For pruning and distilling the model, you will use the NeMo Framework which is available as a `docker container `_. +For pruning and distilling the model, you will use the NeMo Framework, which is available as a `Docker container `_. -``NOTE:`` These notebooks use `NVIDIA TensorRT Model Optimizer `_ under the hood for pruning and distillation. +``NOTE:`` These notebooks use the `NVIDIA TensorRT Model Optimizer `_ under the hood for pruning and distillation. 1. Download the `Llama 3.1 8B .nemo `_ from NVIDIA NGC using the `NGC CLI `_. Generate the ``NGC_API_KEY`` following these `instructions `_. The following command saves the ``.nemo`` format model in a folder named ``llama-3_1-8b-nemo_v1.0`` in the current directory. You can specify another path using the ``-d`` option in the CLI tool. @@ -75,7 +78,7 @@ For pruning and distilling the model, you will use the NeMo Framework which is a 4. Then, navigate to `this notebook <./introduction.ipynb>`_ to get started. -This directory contains a list of notebooks which will go over all the steps to create a distilled 4B model. +This directory contains a list of notebooks that cover all the steps to create a distilled 4B model. :: @@ -91,7 +94,7 @@ This directory contains a list of notebooks which will go over all the steps to Results ------------------------------------------------------------------------------ -``NOTE:`` This notebook demonstrates the use of the teacher finetuning, pruning and the distillation scripts. These scripts should ideally be run on a multi-node cluster with a larger ``GLOBAL_BATCH_SIZE`` and ``STEPS`` to see improvement in the validation loss. +``NOTE:`` This notebook demonstrates the use of the teacher fine-tuning, pruning, and the distillation scripts. These scripts should ideally be run on a multi-node cluster with a larger ``GLOBAL_BATCH_SIZE`` and ``STEPS`` to see improvement in the validation loss. Here are the validation loss plots over 30 steps of running the training step in the distillation script (at the end of the `notebook <./05_display_results.ipynb>`_). @@ -100,11 +103,11 @@ Here are the validation loss plots over 30 steps of running the training step in :alt: Diagram showing the validation loss over 30 steps of running the training step in the distillation script when using the depth-pruned model as the student :align: center - Figure 1: Validation Loss Plot when using the depth-pruned model as the student + Figure 1: Validation Loss Plot When Using the Depth-Pruned Model as the Student .. figure:: https://github.com/NVIDIA/NeMo/releases/download/r2.0.0rc1/val_loss_width_pruned_student_distillation.png :width: 400px :alt: Diagram showing the validation loss over 30 steps of running the training step in the distillation script when using the width-pruned model as the student :align: center - Figure 2: Validation Loss Plot when using the width-pruned model as the student \ No newline at end of file + Figure 2: Validation Loss Plot When Using the Width-Pruned Model as the Student \ No newline at end of file diff --git a/tutorials/llm/llama-3/pruning-distillation/introduction.ipynb b/tutorials/llm/llama-3/pruning-distillation/introduction.ipynb index 1a3efc9f5f1e..71a5a6cfb03c 100644 --- a/tutorials/llm/llama-3/pruning-distillation/introduction.ipynb +++ b/tutorials/llm/llama-3/pruning-distillation/introduction.ipynb @@ -7,7 +7,7 @@ "tags": [] }, "source": [ - "# Pruning and Distillation of Llama 3.1 model with NeMo Framework" + "# Efficient Model Reduction with Pruning and Distillation of Llama 3.1 Using NeMo Framework" ] }, { @@ -15,15 +15,15 @@ "id": "03fd1cf4-c67a-4b8d-a5e5-46531be0f991", "metadata": {}, "source": [ - "This demonstration showcases performing pruning and distillation on **Llama 3.1-8B** with the [WikiText-103-v1](https://huggingface.co/datasets/Salesforce/wikitext/viewer/wikitext-103-v1) dataset using NeMo Framework. The [WikiText-103-v1](https://huggingface.co/datasets/Salesforce/wikitext/viewer/wikitext-103-v1) language modeling dataset is a collection of over 100 million tokens extracted from the set of verified 'Good' and 'Featured' articles on Wikipedia. \n", + "This tutorial demonstrates how to perform depth-pruning, teacher fine-tuning, and distillation on **Llama 3.1-8B** using the [WikiText-103-v1](https://huggingface.co/datasets/Salesforce/wikitext/viewer/wikitext-103-v1) dataset with NeMo Framework. The [WikiText-103-v1](https://huggingface.co/datasets/Salesforce/wikitext/viewer/wikitext-103-v1) language modeling dataset comprises over 100 million tokens extracted from verified Good and Featured articles on Wikipedia.\n", "\n", - "For this demonstration, we will perform a light finetuning procedure on the `Meta Llama 3.1 8B` teacher model to generate a finetuned teacher model. This finetuned teacher model will then be trimmed. There are two methods to prune a model: depth-pruning and width-pruning. This workflow will showcase both methods which will yield `4b_depth_pruned_model.nemo` and `4b_width_pruned_model.nemo` respectively, that will serve as a starting point for distillation to the final 4B models. \n", + "For this demonstration, we will perform teacher correction by running a light fine-tuning procedure on the `Meta Llama 3.1 8B` teacher model to generate a fine-tuned teacher model, `megatron_llama_ft.nemo`, needed for optimal distillation. This fine-tuned teacher model is then trimmed. There are two methods to prune a model: depth-pruning and width-pruning. We will explore both techniques, yielding `4b_depth_pruned_model.nemo` and `4b_width_pruned_model.nemo`, respectively. These models will serve as starting points for distillation to create the final distilled 4B models.\n", "\n", "> We are using models utilizing the `meta-llama/Meta-Llama-3.1-8B` tokenizer for this demonstration.\n", "\n", "> `NOTE:` Ensure that you run this notebook inside the [NeMo Framework container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) which has all the required dependencies. \n", "\n", - "**Instructions are available in the associated tutorial README to download the model and the container.**" + "**Instructions for downloading the model and the container are available in the [README](./README.rst).**" ] }, { @@ -49,8 +49,8 @@ "source": [ "---\n", "## Prerequisites\n", - "Ensure you have the following -\n", - "1. **Get the teacher model**: Download the `Meta Llama 3.1 8B .nemo` model. You must follow the instructions in the associated README to download and mount the folder to the NeMo FW container." + "Ensure you meet the prerequisites listed in this section.\n", + "1. **Get the teacher model**: Download the `Meta Llama 3.1 8B .nemo` model. You must follow the instructions in the associated README to download and mount the folder to the NeMo Framework container." ] }, { @@ -149,12 +149,12 @@ }, "source": [ "---\n", - "## Step-by-step instructions\n", + "## Step-by-Step Instructions\n", "\n", "This workflow is structured into seven notebooks:\n", "1. [Prepare the dataset](./01_data_preparation.ipynb)\n", - "2. [Finetune the teacher on the dataset](./02_teacher_finetuning.ipynb)\n", - "3. Prune the finetuned-teacher model to create a student \n", + "2. [Fine-tune the teacher on the dataset](./02_teacher_finetuning.ipynb)\n", + "3. Prune the fine-tuned teacher model to create a student \n", " - 3.a. [Using depth-pruning](./03_a_depth_pruning.ipynb)\n", " - 3.b. [Using width-pruning](./03_b_width_pruning.ipynb)\n", "4. Distill knowledge from teacher into student\n", @@ -162,7 +162,7 @@ " - 4.b. [Using width-pruned student](./04_b_distilling_width_pruned_student.ipynb)\n", "5. [Display the validation loss](./05_display_results.ipynb)\n", "\n", - "> `NOTE:` We are exploring two methods to prune the finetuned teacher model: [depth-pruning](./03_a_depth_pruning.ipynb) and [width-pruning](./03_b_width_pruning.ipynb). Per the [tech report](https://arxiv.org/pdf/2408.11796), we can observe that width-pruning generally outperforms depth-pruning so users can choose to perform either [depth-pruning](./03_a_depth_pruning.ipynb) or [width-pruning](./03_b_width_pruning.ipynb) or both methods." + "> `NOTE:` We are exploring two methods to prune the fine-tuned teacher model: [depth-pruning](./03_a_depth_pruning.ipynb) and [width-pruning](./03_b_width_pruning.ipynb). Per the [tech report](https://arxiv.org/pdf/2408.11796), we can observe that width-pruning generally outperforms depth-pruning so users can choose to perform either [depth-pruning](./03_a_depth_pruning.ipynb) or [width-pruning](./03_b_width_pruning.ipynb) or both methods." ] } ],