PEFT support (inference/finetuning) (#1153)

* . * . * Update the default cublas behavior when CUDA_VERSION is not specified * fix bugs in IncMHA peft_bwd kernel * uncomment softmaxbackward * add layernorm to align test * add peft test scripts * fix import * fix * add code to convert peft models * add script to download peft for c++, fix bug * fix * add script to fine-tune models * implement loading lora configs/weights from file * remove peft_bwd assertion failure in embedding * fix download script * add peft dependencies in dockerfile * fix softmax backward * fix bc print indentation * Temporarily Revert "Update the default cublas behavior when CUDA_VERSION is not specified" This reverts commit 4ee710a. * Fix cublas default (#1220) * Fix Legion prebuild workflow (2) (#1208) * fix * fix * fix * fix * Fix Legion prebuild workflow (3) (#1210) * fix hip error * use CUBLAS_COMPUTE_FAST_16F for full-precision gemm --------- Co-authored-by: Zhihao Jia <[email protected]> * fix bugs, work on align opt-lora * update scripts * add code to output peft tensors in hf * update, fixes * linting * fix printing of tensors for numpy * update save_inference_tensors_to_file * linting * update * fix issue with save_inference_tensors_to_file * fix layer names for save_inference_tensors_to_file * fix peft * fix bwd bugs * linting * fixes * fix * fix * fix * add bc fields for peft training * linting * fix * remove ptr check * fix * implement save_operators for bwd * fix bug * implement save tensors for bwd * . * bug fix * fix * align linear * fix * bwd kernel updates * undo use of CUBLAS_COMPUTE_32F_FAST_16F for now * only send dataset entry once * update peft test scripts * loss * . * update generate/request api to take both inference and fine-tuning prompts * linting * alignment fixes in lora & linear layer * alignment fix * diagonal * fix * alignment fix ssm * sigmoid-silu-multi now fully aligned * rms norm kernel updates * fix * in-place residual rms * bug fix and linting * align backward of o_proj, attn_heads, qk_prods_softmax, and v_proj with huggingface * cleanup * finished all alignment fixes in attention backward kernel * fix * Update inc_multihead_self_attention.cu * Update inc_multihead_self_attention.cu * use grad to store peft in/output (#1241) * use grad to store peft in/output * format * . * format * enable peft request * several hacks for performance measurement; some of the changes should be reverted * Update sigmoid_silu_multi.cu * RoPE backward * PEFT bug fixes and alignment (#1269) * Revert "several hacks for performance measurement; some of the changes should be reverted" This reverts commit b9c3926. * backup * backup * updates * update * backup * backup * backup * fix * cleanup * linting * Fuse bias + relu in OPT (#1271) * fuse bias and relu in opt * fix * fix * fix * fix * Peft alignment & debugging tools (#1288) * Revert "several hacks for performance measurement; some of the changes should be reverted" This reverts commit b9c3926. * backup * backup * updates * update * backup * backup * backup * fix * cleanup * fix * fix * fix * update * simplify tensor names * fix * fixes and updates * fixes * fix * cleanup * . * restore softmax * cleanup * update alignment scripts * newline * fix legion aliasing error * fix warnings * fix * fix pipeline parallelism * fix tp issue in combine op * fix lora weight loading with tensor parallelism * fixes, implement Combine::peft_bwd_task * fix * replicate peft bwd * fixes * fix * fix combine and fwd-bwd pass dependencies * fix replicate bwd * fix * let user control amount of peft memory * only run peft_bwd if peft is enabled * fix rms norm inference region reqs * fix in-place fusion (part 1) * fix inplace fusion (part 2) * fix * disable automatic inplace rms norm for now * fix inf fusion inplace * fix rest input grads for peft without inplace residuals * fix * fix * fix residual rms * fix * fix * enable inf debugging in fusion bwd * hack to silence warning in fused bwd * fix * fix * fix build * fix * fix * add draft peft test * Peft python interface (#1306) * update script * less model renaming * fix * fix * fix * backup * . * update * . * fixes * fix * fix build * fix * fix * fix issues for downloading peft model * solved issues for download peft model * added printouts for debugging * fix * fix seg fault * add test, separate peft script in cpp * fix * fixes * fix * update peft python interface * update * update * update * updates * fix * fixes * fix * fixes --------- Co-authored-by: april-yyt <[email protected]> * fix * update * fix * fix to support prompts larger than max tokens per batch * fixes to support benchmarking of finetuning throughput * many upgrades and updates related to finetuning * add ttft statistics * add warmup phase * add benchmarking code * Add scripts for evaluation with Microsoft Azure trace (#1363) * Add scripts for evaluation * Add absolute request rate value * Fix script for target arrival rate * Fix cpp req rate benchmark * update to use new dataset * Fix infinite loop * update * add data --------- Co-authored-by: Remi Delacourt <[email protected]> Co-authored-by: Gabriele Oliaro <[email protected]> * fix * fix * add peft tests to ci * shellcheck * fix * fix python requirements * fix * fix * update ci test * update alignment doc * fix cross entropy loss bug * update alignment test * update test * add llama peft alignment test to ci * Fix values for unused params in incr_decoding * Add PEFTModelID NO_ID singleton instead of None * Fix PEFTModelID::NO_ID reference * reduce logging * fix * fix * Add peft demo * Add readme for demo * fix alignment issue * Peft optimizer (#1290) * add optimizer config, only allocate weights for training * sgd 1 * sgd 2 * update * fix * linting * . * . * fix * fix allreduce bug * update * update * add optimizer hook in hf * update * update script * . * fix * fwd * bwd * start grads * fix gradient misalignment! * update * Add support for llama3 * various fixes --------- Co-authored-by: Remi Delacourt <[email protected]> * Optimizers python interface (#1441) * python interface for optimizer * update lora linear config to support python interface * update python interface * finished lora python interface * fix * fix * update * update * more fixes * fix * initialize lora weights where needed * Add notebook * Update demo to use dataset * Fix' * Save weights after end of finetuning (#1446) * support accumulation of gradients without update * add code to save peft weights * fix * save configs * cleanup * Fully use notebook for demo * Parameterize generation and finetuning configs * Comment out inference for now * fix bug in lora inference only mode * fix * Add finetuning or inference only flags * fix * fix * fix * PEFT model upload (#1450) * upload test * fix * Make demo_class.py executable * fix * add base_model_name_or_path * fix * fix * support llama-3 tokenizer * print output tokens when not benchmarking * Use Llama3 in demo_class * Use Llama3 in demo * fix data loading for llama-3 * Add download models to demo * return/print loss at each finetuning step * fix * Adjust demo parameters * Fix for finetuning * pass finetuning losses to python interface * Update demo * Fix upload * Refactor demo * rename demo_class to demo * fix * remove epoch from loss print * Finish demo * fix test * rocm fixes * more rocm fixes * fix rocm build * docker fix * fix inference test * fix workflow * fix makefile * fix peft test * fix all-reduce issue with lora for TP scenario * fix bwd lm head * fixes * more fixes * update * fix alignment up to input ln * finished aligning all backward (tp>1) * align all peft * fix * fix broken link * formatting * fix * update * Revert "update" This reverts commit 90b2c87. * update * fix hip build * fix gpu ci * fix gpu ci * update default gpu ci version to 12.0 * update ci to 12.0 * fix * fix * update * fix * fix * update * fix * add cleanup * downgrade to cuda=11.8 --------- Co-authored-by: Gabriele Oliaro <[email protected]> Co-authored-by: xinhaoc <[email protected]> Co-authored-by: Xinhao Cheng <[email protected]> Co-authored-by: april-yyt <[email protected]> Co-authored-by: Remi <[email protected]> Co-authored-by: Remi Delacourt <[email protected]> Co-authored-by: Rémi Delacourt <[email protected]>
flexflow · Sep 4, 2024 · a0f1ed7 · a0f1ed7
1 parent 49523d6
commit a0f1ed7
Show file tree

Hide file tree

Showing 285 changed files with 35,212 additions and 6,650 deletions.
diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
@@ -52,13 +52,14 @@ jobs:
         run: .github/workflows/helpers/free_space_on_runner.sh
 
       - name: Install CUDA
-        uses: Jimver/[email protected].11
+        uses: Jimver/[email protected].16
         if: ${{ matrix.gpu_backend == 'cuda' }}
         id: cuda-toolkit
         with:
-          cuda: "11.8.0"
+          cuda: "12.1.1"
           # Disable caching of the CUDA binaries, since it does not give us any significant performance improvement
           use-github-cache: "false"
+          log-file-suffix: 'cmake_${{matrix.gpu_backend}}.txt'
 
       - name: Install system dependencies
         run: .github/workflows/helpers/install_dependencies.sh
@@ -156,11 +157,12 @@ jobs:
         run: .github/workflows/helpers/free_space_on_runner.sh
 
       - name: Install CUDA
-        uses: Jimver/[email protected].11
+        uses: Jimver/[email protected].16
         id: cuda-toolkit
         with:
-          cuda: "11.8.0"
+          cuda: "12.1.1"
           use-github-cache: "false"
+          log-file-suffix: 'makefile_${{matrix.gpu_backend}}.txt'
 
       - name: Install system dependencies
         run: .github/workflows/helpers/install_dependencies.sh
@@ -169,7 +171,7 @@ jobs:
         uses: conda-incubator/setup-miniconda@v2
         with:
           activate-environment: flexflow
-          environment-file: conda/environment.yml
+          environment-file: conda/flexflow.yml
           auto-activate-base: false
 
       - name: Build FlexFlow

diff --git a/.github/workflows/gpu-ci.yml b/.github/workflows/gpu-ci.yml
@@ -181,6 +181,16 @@ jobs:
           ../config/config.linux
           make -j
 
+      - name: Run PEFT tests
+        run: |
+          export PATH=$CONDA_PREFIX/bin:$PATH
+          export CUDNN_DIR=/usr/local/cuda
+          export CUDA_DIR=/usr/local/cuda
+          export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib
+
+          source ./build/set_python_envs.sh
+          ./tests/peft_test.sh
+      
       - name: Run inference tests
         env:
           CPP_INFERENCE_TESTS: ${{ vars.CPP_INFERENCE_TESTS }}

diff --git a/.github/workflows/helpers/install_cudnn.sh b/.github/workflows/helpers/install_cudnn.sh
@@ -5,8 +5,11 @@ set -x
 # Cd into directory holding this script
 cd "${BASH_SOURCE[0]%/*}"
 
+ubuntu_version=$(lsb_release -rs)
+ubuntu_version=${ubuntu_version//./}
+
 # Install CUDNN
-cuda_version=${1:-11.8.0}
+cuda_version=${1:-12.1.1}
 cuda_version=$(echo "${cuda_version}" | cut -f1,2 -d'.')
 echo "Installing CUDNN for CUDA version: ${cuda_version} ..."
 CUDNN_LINK=http://developer.download.nvidia.com/compute/redist/cudnn/v8.0.5/cudnn-11.1-linux-x64-v8.0.5.39.tgz
@@ -44,8 +47,11 @@ elif [[ "$cuda_version" == "11.7" ]]; then
 elif [[ "$cuda_version" == "11.8" ]]; then
     CUDNN_LINK=https://developer.download.nvidia.com/compute/redist/cudnn/v8.7.0/local_installers/11.8/cudnn-linux-x86_64-8.7.0.84_cuda11-archive.tar.xz
     CUDNN_TARBALL_NAME=cudnn-linux-x86_64-8.7.0.84_cuda11-archive.tar.xz
-elif [[ "$cuda_version" == "12.0" ]]; then
-    echo "CUDNN support for CUDA version 12.0 not yet added"
+elif [[ "$cuda_version" == "12.0" || "$cuda_version" == "12.1" || "$cuda_version" == "12.2" || "$cuda_version" == "12.3" || "$cuda_version" == "12.4" || "$cuda_version" == "12.5" ]]; then
+    CUDNN_LINK=https://developer.download.nvidia.com/compute/redist/cudnn/v8.8.0/local_installers/12.0/cudnn-local-repo-ubuntu2004-8.8.0.121_1.0-1_amd64.deb
+    CUDNN_TARBALL_NAME=cudnn-local-repo-ubuntu2004-8.8.0.121_1.0-1_amd64.deb
+else
+    echo "CUDNN support for CUDA version above 12.5 not yet added"
     exit 1
 fi
 wget -c -q $CUDNN_LINK
@@ -55,6 +61,17 @@ if [[ "$cuda_version" == "11.6" || "$cuda_version" == "11.7" || "$cuda_version"
     sudo cp -r "$CUDNN_EXTRACTED_TARBALL_NAME"/include/* /usr/local/include
     sudo cp -r "$CUDNN_EXTRACTED_TARBALL_NAME"/lib/* /usr/local/lib
     rm -rf "$CUDNN_EXTRACTED_TARBALL_NAME"
+elif [[ "$CUDNN_TARBALL_NAME" == *.deb ]]; then
+    wget -c -q "https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${ubuntu_version}/x86_64/cuda-keyring_1.1-1_all.deb"
+    sudo dpkg -i cuda-keyring_1.1-1_all.deb
+    sudo apt update -y
+    rm -f cuda-keyring_1.1-1_all.deb
+    sudo dpkg -i $CUDNN_TARBALL_NAME
+    sudo cp /var/cudnn-local-repo-ubuntu2004-8.8.0.121/cudnn-local-A9E17745-keyring.gpg /usr/share/keyrings/
+    sudo apt update -y
+    sudo apt install -y libcudnn8
+    sudo apt install -y libcudnn8-dev
+    sudo apt install -y libcudnn8-samples
 else
     sudo tar -xzf $CUDNN_TARBALL_NAME -C /usr/local
 fi

diff --git a/.github/workflows/helpers/install_nccl.sh b/.github/workflows/helpers/install_nccl.sh
@@ -8,13 +8,13 @@ cd "${BASH_SOURCE[0]%/*}"
 # Add NCCL key ring
 ubuntu_version=$(lsb_release -rs)
 ubuntu_version=${ubuntu_version//./}
-wget "https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${ubuntu_version}/x86_64/cuda-keyring_1.0-1_all.deb"
-sudo dpkg -i cuda-keyring_1.0-1_all.deb
+wget "https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${ubuntu_version}/x86_64/cuda-keyring_1.1-1_all.deb"
+sudo dpkg -i cuda-keyring_1.1-1_all.deb
 sudo apt update -y
-rm -f cuda-keyring_1.0-1_all.deb
+rm -f cuda-keyring_1.1-1_all.deb
 
 # Install NCCL
-cuda_version=${1:-11.8.0}
+cuda_version=${1:-12.1.1}
 cuda_version=$(echo "${cuda_version}" | cut -f1,2 -d'.')
 echo "Installing NCCL for CUDA version: ${cuda_version} ..."
 

diff --git a/.github/workflows/multinode-test.yml b/.github/workflows/multinode-test.yml
@@ -38,7 +38,7 @@ jobs:
     # 10h timeout, instead of default of 360min (6h)
     timeout-minutes: 600
     container:
-      image: ghcr.io/flexflow/flexflow-environment-cuda-11.8:latest
+      image: ghcr.io/flexflow/flexflow-environment-cuda-12.0:latest
       options: --gpus all --shm-size=8192m
     steps:
       - name: Install updated git version
@@ -87,7 +87,7 @@ jobs:
     runs-on: self-hosted
     needs: gpu-ci-concierge
     container:
-      image: ghcr.io/flexflow/flexflow-environment-cuda-11.8:latest
+      image: ghcr.io/flexflow/flexflow-environment-cuda-12.0:latest
       options: --gpus all --shm-size=8192m
     # 10h timeout, instead of default of 360min (6h)
     timeout-minutes: 600
@@ -138,7 +138,7 @@ jobs:
     runs-on: self-hosted
     needs: gpu-ci-concierge
     container:
-      image: ghcr.io/flexflow/flexflow-environment-cuda-11.8:latest
+      image: ghcr.io/flexflow/flexflow-environment-cuda-12.0:latest
       options: --gpus all --shm-size=8192m
     steps:
       - name: Install updated git version

diff --git a/.github/workflows/pip-install.yml b/.github/workflows/pip-install.yml
@@ -44,10 +44,10 @@ jobs:
         run: .github/workflows/helpers/free_space_on_runner.sh
 
       - name: Install CUDA
-        uses: Jimver/[email protected].11
+        uses: Jimver/[email protected].16
         id: cuda-toolkit
         with:
-          cuda: "11.8.0"
+          cuda: "12.1.1"
           # Disable caching of the CUDA binaries, since it does not give us any significant performance improvement
           use-github-cache: "false"
 

diff --git a/.github/workflows/prebuild-legion.yml b/.github/workflows/prebuild-legion.yml
@@ -23,13 +23,13 @@ jobs:
     strategy:
       matrix:
         gpu_backend: ["cuda", "hip_rocm"]
-        gpu_backend_version: ["11.8", "5.6"]
+        gpu_backend_version: ["12.0", "5.6"]
         python_version: ["3.11"]
         exclude:
           - gpu_backend: "cuda"
             gpu_backend_version: "5.6"
           - gpu_backend: "hip_rocm"
-            gpu_backend_version: "11.8"
+            gpu_backend_version: "12.0"
       fail-fast: false
     steps:
       - name: Checkout Git Repository

diff --git a/.gitignore b/.gitignore
@@ -187,4 +187,9 @@ gpt_tokenizer
 python/flexflow/version.txt
 
 inference_tensors
+hf_peft_tensors
+lora_training_logs
+
+Untitled-1.ipynb
+Untitled-2.ipynb
 tests/inference/python_test_configs/*.json
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -567,6 +567,7 @@ if(NOT BUILD_LEGION_ONLY)
   if(FF_BUILD_ALL_INFERENCE_EXAMPLES OR FF_BUILD_ALL_EXAMPLES)
     add_subdirectory(inference/spec_infer)
     add_subdirectory(inference/incr_decoding)
+    add_subdirectory(inference/peft)
   endif()
 
 

diff --git a/conda/flexflow.yml b/conda/flexflow.yml
@@ -25,3 +25,10 @@ dependencies:
     - sentencepiece
     - einops
     - requests
+    - scipy
+    - bitsandbytes
+    - datasets
+    - accelerate
+    - loralib
+    - triton
+    - peft
diff --git a/config/config.inc b/config/config.inc
@@ -197,7 +197,7 @@ fi
 
 # set ROCM path
 if [ -n "$ROCM_PATH" ]; then
-  SET_ROCM_PATH="-DROCM_PATH=${ROCM_PATH}"
+  SET_ROCM_PATH="-DROCM_PATH=${ROCM_PATH} -DHIP_ROOT_DIR=${ROCM_PATH}"
 fi
 
 ADD_ROCM_TO_PATH=""

diff --git a/docker/build.sh b/docker/build.sh
@@ -56,15 +56,14 @@ if [[ "${FF_GPU_BACKEND}" == "cuda" || "${FF_GPU_BACKEND}" == "hip_cuda" ]]; the
     cuda_version_input=${cuda_version}.3
   elif [[ "$cuda_version" == @(11.8) ]]; then 
     cuda_version_input=${cuda_version}.0
+  elif [[ "$cuda_version" == @(12.3|12.4|12.5|12.6|12.7|12.8|12.9) ]]; then
+    # Use CUDA 12.2 for all versions greater or equal to 12.2 for now (the Docker machine with CUDNN is not yet available)
+    cuda_version=12.2
+    cuda_version_input=${cuda_version}.2
   else
     echo "cuda_version is not supported, please choose among {11.1|11.2|11.3|11.4|11.5|11.6|11.7|11.8|12.0|12.1|12.2}"
     exit 1
   fi
-  # Use CUDA 12.2 for all versions greater or equal to 12.2 for now (the Docker machine with CUDNN is not yet available)
-  if [[ "$cuda_version" == @(12.3|12.4|12.5|12.6|12.7|12.8|12.9) ]]; then
-    cuda_version=12.2
-    cuda_version_input=${cuda_version}.2
-  fi
   echo "Building $image docker image with CUDA $cuda_version"
   ff_environment_base_image="nvidia/cuda:${cuda_version_input}-cudnn8-devel-ubuntu20.04"
   gpu_backend_version="-${cuda_version}"

diff --git a/docker/flexflow-environment/Dockerfile b/docker/flexflow-environment/Dockerfile
@@ -94,6 +94,8 @@ RUN conda install -c conda-forge cmake make pillow cmake-build-extension pybind1
 RUN conda install pytorch torchvision torchaudio -c pytorch
 RUN conda install -c conda-forge onnx transformers>=4.31.0 sentencepiece einops
 RUN pip3 install tensorflow notebook
+# PEFT-related
+RUN pip3 install scipy bitsandbytes datasets accelerate loralib triton peft
 
 # Install Rust
 RUN curl https://sh.rustup.rs -sSf | sh -s -- -y

diff --git a/docker/run.sh b/docker/run.sh
@@ -58,7 +58,7 @@ if [[ "${FF_GPU_BACKEND}" == "cuda" || "${FF_GPU_BACKEND}" == "hip_cuda" ]]; the
     fi
   fi
   # Check that CUDA version is supported
-  if [[ "$cuda_version" != @(11.1|11.2|11.3|11.4|11.5|11.6|11.7|11.8|12.0|12.1|12.2) ]]; then
+  if [[ "$cuda_version" != @(11.1|11.2|11.3|11.4|11.5|11.6|11.7|11.8|12.0|12.1|12.2|12.3|12.4|12.5|12.6|12.7|12.8|12.9) ]]; then
     echo "cuda_version is not supported, please choose among {11.1|11.2|11.3|11.4|11.5|11.6|11.7|11.8|12.0|12.1|12.2}"
     exit 1
   fi

diff --git a/include/flexflow/batch_config.h b/include/flexflow/batch_config.h
@@ -16,6 +16,7 @@
 #pragma once
 
 #include "flexflow/ffconst.h"
+#include "flexflow/fftype.h"
 #include "legion.h"
 #include <cstddef>
 #include <cstdlib>
@@ -36,13 +37,27 @@ using BeamSearchBatchConfigFuture = Legion::Future;
 using TreeVerifyBatchConfigFuture = Legion::Future;
 using BeamInferenceResultFuture = Legion::Future;
 
+struct OptimizerTasks {
+  bool compute_gradients = true;
+  bool reset_gradients_to_zero = false;
+  bool update_weights = false;
+  bool save_updated_weights = false;
+};
+
+void set_optimizer_tasks(OptimizerTasks &tasks,
+                         int max_training_steps,
+                         int completed_training_steps,
+                         int gradient_accumulation_steps);
+
 class BatchConfig {
 public:
   using RequestGuid = size_t;
   using TokenId = int;
   BatchConfig();
   int num_active_requests() const;
   int num_active_tokens() const;
+  int num_active_infr_tokens() const;
+  int num_active_peft_tokens() const;
   static int max_requests_per_batch();
   static int max_tokens_per_batch();
   static int max_verify_tokens_per_batch();
@@ -56,26 +71,43 @@ class BatchConfig {
   // Maximum possible values for different parameters
   // These maximum values are used for copying BatchConfig
   // across workers
-  static int const MAX_NUM_REQUESTS = 64;
+  static int const MAX_NUM_REQUESTS = 65;
   static int const MAX_NUM_TOKENS = 1024;
   static int const MAX_SPEC_TREE_TOKEN_NUM = 64;
 
   //  Set by update
-  int num_tokens;
+
+  int num_tokens = 0, num_peft_tokens = 0, num_peft_label_tokens = 0;
   // number of tokens in prompt phase, start offset of tokens in inc_decoding
   // phase. num_tokens - num_prompt_tokens = num_generation_tokens;
-  int num_generation_tokens;
+  int num_generation_tokens = 0;
 
   struct PerRequestInfo {
+    PerRequestInfo() {
+      first_token_depth_in_request = 0;
+      first_token_offset_in_batch = 0;
+      num_tokens_in_batch = 0;
+      max_sequence_length = 0;
+      request_guid = 0;
+      prompt_phase = false;
+      batch_config_request_id = -1;
+      peft_model_id = PEFTModelID::NO_ID;
+      peft_bwd = false;
+      optimizer_tasks = {true, false, false, false};
+    }
     int first_token_depth_in_request;
     int first_token_offset_in_batch;
     int num_tokens_in_batch;
     int max_sequence_length;
 
     // request id in batch config:
-    int batch_config_request_id;
+    int batch_config_request_id = -1;
     bool prompt_phase = false;
     RequestGuid request_guid;
+    // PEFT fields
+    PEFTModelID peft_model_id;
+    bool peft_bwd;
+    OptimizerTasks optimizer_tasks;
   };
   struct PerTokenInfo {
     int abs_depth_in_request;
@@ -102,6 +134,7 @@ class BatchConfig {
   BitMask causalMask[MAX_NUM_REQUESTS];
   PerRequestInfo requestsInfo[MAX_NUM_REQUESTS];
   PerTokenInfo tokensInfo[MAX_NUM_TOKENS];
+  PerTokenInfo labelsInfo[MAX_NUM_TOKENS];
 
   bool request_completed[MAX_NUM_REQUESTS];
   bool request_running[MAX_NUM_REQUESTS];
@@ -129,6 +162,7 @@ class TreeVerifyBatchConfig : public BatchConfig {
 struct InferenceResult {
   static int const MAX_NUM_TOKENS = BatchConfig::MAX_NUM_TOKENS;
   BatchConfig::TokenId token_ids[MAX_NUM_TOKENS];
+  float finetuning_loss;
 };
 
 class BeamSearchBatchConfig : public BatchConfig {