Skip to content

Commit

Permalink
PEFT support (inference/finetuning) (#1153)
Browse files Browse the repository at this point in the history
* .

* .

* Update the default cublas behavior when CUDA_VERSION is not specified

* fix bugs in IncMHA peft_bwd kernel

* uncomment softmaxbackward

* add layernorm to align test

* add peft test scripts

* fix import

* fix

* add code to convert peft models

* add script to download peft for c++, fix bug

* fix

* add script to fine-tune models

* implement loading lora configs/weights from file

* remove peft_bwd assertion failure in embedding

* fix download script

* add peft dependencies in dockerfile

* fix softmax backward

* fix bc print indentation

* Temporarily Revert "Update the default cublas behavior when CUDA_VERSION is not specified"

This reverts commit 4ee710a.

* Fix cublas default (#1220)

* Fix Legion prebuild workflow (2) (#1208)

* fix

* fix

* fix

* fix

* Fix Legion prebuild workflow (3) (#1210)

* fix hip error

* use CUBLAS_COMPUTE_FAST_16F for full-precision gemm

---------

Co-authored-by: Zhihao Jia <[email protected]>

* fix bugs, work on align opt-lora

* update scripts

* add code to output peft tensors in hf

* update, fixes

* linting

* fix printing of tensors for numpy

* update save_inference_tensors_to_file

* linting

* update

* fix issue with save_inference_tensors_to_file

* fix layer names for save_inference_tensors_to_file

* fix peft

* fix bwd bugs

* linting

* fixes

* fix

* fix

* fix

* add bc fields for peft training

* linting

* fix

* remove ptr check

* fix

* implement save_operators for bwd

* fix bug

* implement save tensors for bwd

* .

* bug fix

* fix

* align linear

* fix

* bwd kernel updates

* undo use of CUBLAS_COMPUTE_32F_FAST_16F for now

* only send dataset entry once

* update peft test scripts

* loss

* .

* update generate/request api to take both inference and fine-tuning prompts

* linting

* alignment fixes in lora & linear layer

* alignment fix

* diagonal

* fix

* alignment fix ssm

* sigmoid-silu-multi now fully aligned

* rms norm kernel updates

* fix

* in-place residual rms

* bug fix and linting

* align backward of o_proj, attn_heads, qk_prods_softmax, and v_proj with huggingface

* cleanup

* finished all alignment fixes in attention backward kernel

* fix

* Update inc_multihead_self_attention.cu

* Update inc_multihead_self_attention.cu

* use grad to store peft in/output (#1241)

* use grad to store peft in/output

* format

* .

* format

* enable peft request

* several hacks for performance measurement; some of the changes should be reverted

* Update sigmoid_silu_multi.cu

* RoPE backward

* PEFT bug fixes and alignment (#1269)

* Revert "several hacks for performance measurement; some of the changes should be reverted"

This reverts commit b9c3926.

* backup

* backup

* updates

* update

* backup

* backup

* backup

* fix

* cleanup

* linting

* Fuse bias + relu in OPT (#1271)

* fuse bias and relu in opt

* fix

* fix

* fix

* fix

* Peft alignment & debugging tools (#1288)

* Revert "several hacks for performance measurement; some of the changes should be reverted"

This reverts commit b9c3926.

* backup

* backup

* updates

* update

* backup

* backup

* backup

* fix

* cleanup

* fix

* fix

* fix

* update

* simplify tensor names

* fix

* fixes and updates

* fixes

* fix

* cleanup

* .

* restore softmax

* cleanup

* update alignment scripts

* newline

* fix legion aliasing error

* fix warnings

* fix

* fix pipeline parallelism

* fix tp issue in combine op

* fix lora weight loading with tensor parallelism

* fixes, implement Combine::peft_bwd_task

* fix

* replicate peft bwd

* fixes

* fix

* fix combine and fwd-bwd pass dependencies

* fix replicate bwd

* fix

* let user control amount of peft memory

* only run peft_bwd if peft is enabled

* fix rms norm inference region reqs

* fix in-place fusion (part 1)

* fix inplace fusion (part 2)

* fix

* disable automatic inplace rms norm for now

* fix inf fusion inplace

* fix rest input grads for peft without inplace residuals

* fix

* fix

* fix residual rms

* fix

* fix

* enable inf debugging in fusion bwd

* hack to silence warning in fused bwd

* fix

* fix

* fix build

* fix

* fix

* add draft peft test

* Peft python interface (#1306)

* update script

* less model renaming

* fix

* fix

* fix

* backup

* .

* update

* .

* fixes

* fix

* fix build

* fix

* fix

* fix issues for downloading peft model

* solved issues for download peft model

* added printouts for debugging

* fix

* fix seg fault

* add test, separate peft script in cpp

* fix

* fixes

* fix

* update peft python interface

* update

* update

* update

* updates

* fix

* fixes

* fix

* fixes

---------

Co-authored-by: april-yyt <[email protected]>

* fix

* update

* fix

* fix to support prompts larger than max tokens per batch

* fixes to support benchmarking of finetuning throughput

* many upgrades and updates related to finetuning

* add ttft statistics

* add warmup phase

* add benchmarking code

* Add scripts for evaluation with Microsoft Azure trace (#1363)

* Add scripts for evaluation

* Add absolute request rate value

* Fix script for target arrival rate

* Fix cpp req rate benchmark

* update to use new dataset

* Fix infinite loop

* update

* add data

---------

Co-authored-by: Remi Delacourt <[email protected]>
Co-authored-by: Gabriele Oliaro <[email protected]>

* fix

* fix

* add peft tests to ci

* shellcheck

* fix

* fix python requirements

* fix

* fix

* update ci test

* update alignment doc

* fix cross entropy loss bug

* update alignment test

* update test

* add llama peft alignment test to ci

* Fix values for unused params in incr_decoding

* Add PEFTModelID NO_ID singleton instead of None

* Fix PEFTModelID::NO_ID reference

* reduce logging

* fix

* fix

* Add peft demo

* Add readme for demo

* fix alignment issue

* Peft optimizer (#1290)

* add optimizer config, only allocate weights for training

* sgd 1

* sgd 2

* update

* fix

* linting

* .

* .

* fix

* fix allreduce bug

* update

* update

* add optimizer hook in hf

* update

* update script

* .

* fix

* fwd

* bwd

* start grads

* fix gradient misalignment!

* update

* Add support for llama3

* various fixes

---------

Co-authored-by: Remi Delacourt <[email protected]>

* Optimizers python interface (#1441)

* python interface for optimizer

* update lora linear config to support python interface

* update python interface

* finished lora python interface

* fix

* fix

* update

* update

* more fixes

* fix

* initialize lora weights where needed

* Add notebook

* Update demo to use dataset

* Fix'

* Save weights after end of finetuning (#1446)

* support accumulation of gradients without update

* add code to save peft weights

* fix

* save configs

* cleanup

* Fully use notebook for demo

* Parameterize generation and finetuning configs

* Comment out inference for now

* fix bug in lora inference only mode

* fix

* Add finetuning or inference only flags

* fix

* fix

* fix

* PEFT model upload (#1450)

* upload test

* fix

* Make demo_class.py executable

* fix

* add base_model_name_or_path

* fix

* fix

* support llama-3 tokenizer

* print output tokens when not benchmarking

* Use Llama3 in demo_class

* Use Llama3 in demo

* fix data loading for llama-3

* Add download models to demo

* return/print loss at each finetuning step

* fix

* Adjust demo parameters

* Fix for finetuning

* pass finetuning losses to python interface

* Update demo

* Fix upload

* Refactor demo

* rename demo_class to demo

* fix

* remove epoch from loss print

* Finish demo

* fix test

* rocm fixes

* more rocm fixes

* fix rocm build

* docker fix

* fix inference test

* fix workflow

* fix makefile

* fix peft test

* fix all-reduce issue with lora for TP scenario

* fix bwd lm head

* fixes

* more fixes

* update

* fix alignment up to input ln

* finished aligning all backward (tp>1)

* align all peft

* fix

* fix broken link

* formatting

* fix

* update

* Revert "update"

This reverts commit 90b2c87.

* update

* fix hip build

* fix gpu ci

* fix gpu ci

* update default gpu ci version to 12.0

* update ci to 12.0

* fix

* fix

* update

* fix

* fix

* update

* fix

* add cleanup

* downgrade to cuda=11.8

---------

Co-authored-by: Gabriele Oliaro <[email protected]>
Co-authored-by: xinhaoc <[email protected]>
Co-authored-by: Xinhao Cheng <[email protected]>
Co-authored-by: april-yyt <[email protected]>
Co-authored-by: Remi <[email protected]>
Co-authored-by: Remi Delacourt <[email protected]>
Co-authored-by: Rémi Delacourt <[email protected]>
  • Loading branch information
8 people authored Sep 4, 2024
1 parent 49523d6 commit a0f1ed7
Show file tree
Hide file tree
Showing 285 changed files with 35,212 additions and 6,650 deletions.
12 changes: 7 additions & 5 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,13 +52,14 @@ jobs:
run: .github/workflows/helpers/free_space_on_runner.sh

- name: Install CUDA
uses: Jimver/[email protected].11
uses: Jimver/[email protected].16
if: ${{ matrix.gpu_backend == 'cuda' }}
id: cuda-toolkit
with:
cuda: "11.8.0"
cuda: "12.1.1"
# Disable caching of the CUDA binaries, since it does not give us any significant performance improvement
use-github-cache: "false"
log-file-suffix: 'cmake_${{matrix.gpu_backend}}.txt'

- name: Install system dependencies
run: .github/workflows/helpers/install_dependencies.sh
Expand Down Expand Up @@ -156,11 +157,12 @@ jobs:
run: .github/workflows/helpers/free_space_on_runner.sh

- name: Install CUDA
uses: Jimver/[email protected].11
uses: Jimver/[email protected].16
id: cuda-toolkit
with:
cuda: "11.8.0"
cuda: "12.1.1"
use-github-cache: "false"
log-file-suffix: 'makefile_${{matrix.gpu_backend}}.txt'

- name: Install system dependencies
run: .github/workflows/helpers/install_dependencies.sh
Expand All @@ -169,7 +171,7 @@ jobs:
uses: conda-incubator/setup-miniconda@v2
with:
activate-environment: flexflow
environment-file: conda/environment.yml
environment-file: conda/flexflow.yml
auto-activate-base: false

- name: Build FlexFlow
Expand Down
10 changes: 10 additions & 0 deletions .github/workflows/gpu-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -181,6 +181,16 @@ jobs:
../config/config.linux
make -j
- name: Run PEFT tests
run: |
export PATH=$CONDA_PREFIX/bin:$PATH
export CUDNN_DIR=/usr/local/cuda
export CUDA_DIR=/usr/local/cuda
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib
source ./build/set_python_envs.sh
./tests/peft_test.sh
- name: Run inference tests
env:
CPP_INFERENCE_TESTS: ${{ vars.CPP_INFERENCE_TESTS }}
Expand Down
23 changes: 20 additions & 3 deletions .github/workflows/helpers/install_cudnn.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,11 @@ set -x
# Cd into directory holding this script
cd "${BASH_SOURCE[0]%/*}"

ubuntu_version=$(lsb_release -rs)
ubuntu_version=${ubuntu_version//./}

# Install CUDNN
cuda_version=${1:-11.8.0}
cuda_version=${1:-12.1.1}
cuda_version=$(echo "${cuda_version}" | cut -f1,2 -d'.')
echo "Installing CUDNN for CUDA version: ${cuda_version} ..."
CUDNN_LINK=http://developer.download.nvidia.com/compute/redist/cudnn/v8.0.5/cudnn-11.1-linux-x64-v8.0.5.39.tgz
Expand Down Expand Up @@ -44,8 +47,11 @@ elif [[ "$cuda_version" == "11.7" ]]; then
elif [[ "$cuda_version" == "11.8" ]]; then
CUDNN_LINK=https://developer.download.nvidia.com/compute/redist/cudnn/v8.7.0/local_installers/11.8/cudnn-linux-x86_64-8.7.0.84_cuda11-archive.tar.xz
CUDNN_TARBALL_NAME=cudnn-linux-x86_64-8.7.0.84_cuda11-archive.tar.xz
elif [[ "$cuda_version" == "12.0" ]]; then
echo "CUDNN support for CUDA version 12.0 not yet added"
elif [[ "$cuda_version" == "12.0" || "$cuda_version" == "12.1" || "$cuda_version" == "12.2" || "$cuda_version" == "12.3" || "$cuda_version" == "12.4" || "$cuda_version" == "12.5" ]]; then
CUDNN_LINK=https://developer.download.nvidia.com/compute/redist/cudnn/v8.8.0/local_installers/12.0/cudnn-local-repo-ubuntu2004-8.8.0.121_1.0-1_amd64.deb
CUDNN_TARBALL_NAME=cudnn-local-repo-ubuntu2004-8.8.0.121_1.0-1_amd64.deb
else
echo "CUDNN support for CUDA version above 12.5 not yet added"
exit 1
fi
wget -c -q $CUDNN_LINK
Expand All @@ -55,6 +61,17 @@ if [[ "$cuda_version" == "11.6" || "$cuda_version" == "11.7" || "$cuda_version"
sudo cp -r "$CUDNN_EXTRACTED_TARBALL_NAME"/include/* /usr/local/include
sudo cp -r "$CUDNN_EXTRACTED_TARBALL_NAME"/lib/* /usr/local/lib
rm -rf "$CUDNN_EXTRACTED_TARBALL_NAME"
elif [[ "$CUDNN_TARBALL_NAME" == *.deb ]]; then
wget -c -q "https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${ubuntu_version}/x86_64/cuda-keyring_1.1-1_all.deb"
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update -y
rm -f cuda-keyring_1.1-1_all.deb
sudo dpkg -i $CUDNN_TARBALL_NAME
sudo cp /var/cudnn-local-repo-ubuntu2004-8.8.0.121/cudnn-local-A9E17745-keyring.gpg /usr/share/keyrings/
sudo apt update -y
sudo apt install -y libcudnn8
sudo apt install -y libcudnn8-dev
sudo apt install -y libcudnn8-samples
else
sudo tar -xzf $CUDNN_TARBALL_NAME -C /usr/local
fi
Expand Down
8 changes: 4 additions & 4 deletions .github/workflows/helpers/install_nccl.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,13 @@ cd "${BASH_SOURCE[0]%/*}"
# Add NCCL key ring
ubuntu_version=$(lsb_release -rs)
ubuntu_version=${ubuntu_version//./}
wget "https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${ubuntu_version}/x86_64/cuda-keyring_1.0-1_all.deb"
sudo dpkg -i cuda-keyring_1.0-1_all.deb
wget "https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${ubuntu_version}/x86_64/cuda-keyring_1.1-1_all.deb"
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update -y
rm -f cuda-keyring_1.0-1_all.deb
rm -f cuda-keyring_1.1-1_all.deb

# Install NCCL
cuda_version=${1:-11.8.0}
cuda_version=${1:-12.1.1}
cuda_version=$(echo "${cuda_version}" | cut -f1,2 -d'.')
echo "Installing NCCL for CUDA version: ${cuda_version} ..."

Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/multinode-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ jobs:
# 10h timeout, instead of default of 360min (6h)
timeout-minutes: 600
container:
image: ghcr.io/flexflow/flexflow-environment-cuda-11.8:latest
image: ghcr.io/flexflow/flexflow-environment-cuda-12.0:latest
options: --gpus all --shm-size=8192m
steps:
- name: Install updated git version
Expand Down Expand Up @@ -87,7 +87,7 @@ jobs:
runs-on: self-hosted
needs: gpu-ci-concierge
container:
image: ghcr.io/flexflow/flexflow-environment-cuda-11.8:latest
image: ghcr.io/flexflow/flexflow-environment-cuda-12.0:latest
options: --gpus all --shm-size=8192m
# 10h timeout, instead of default of 360min (6h)
timeout-minutes: 600
Expand Down Expand Up @@ -138,7 +138,7 @@ jobs:
runs-on: self-hosted
needs: gpu-ci-concierge
container:
image: ghcr.io/flexflow/flexflow-environment-cuda-11.8:latest
image: ghcr.io/flexflow/flexflow-environment-cuda-12.0:latest
options: --gpus all --shm-size=8192m
steps:
- name: Install updated git version
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/pip-install.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,10 +44,10 @@ jobs:
run: .github/workflows/helpers/free_space_on_runner.sh

- name: Install CUDA
uses: Jimver/[email protected].11
uses: Jimver/[email protected].16
id: cuda-toolkit
with:
cuda: "11.8.0"
cuda: "12.1.1"
# Disable caching of the CUDA binaries, since it does not give us any significant performance improvement
use-github-cache: "false"

Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/prebuild-legion.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,13 +23,13 @@ jobs:
strategy:
matrix:
gpu_backend: ["cuda", "hip_rocm"]
gpu_backend_version: ["11.8", "5.6"]
gpu_backend_version: ["12.0", "5.6"]
python_version: ["3.11"]
exclude:
- gpu_backend: "cuda"
gpu_backend_version: "5.6"
- gpu_backend: "hip_rocm"
gpu_backend_version: "11.8"
gpu_backend_version: "12.0"
fail-fast: false
steps:
- name: Checkout Git Repository
Expand Down
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -187,4 +187,9 @@ gpt_tokenizer
python/flexflow/version.txt

inference_tensors
hf_peft_tensors
lora_training_logs

Untitled-1.ipynb
Untitled-2.ipynb
tests/inference/python_test_configs/*.json
1 change: 1 addition & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -567,6 +567,7 @@ if(NOT BUILD_LEGION_ONLY)
if(FF_BUILD_ALL_INFERENCE_EXAMPLES OR FF_BUILD_ALL_EXAMPLES)
add_subdirectory(inference/spec_infer)
add_subdirectory(inference/incr_decoding)
add_subdirectory(inference/peft)
endif()


Expand Down
7 changes: 7 additions & 0 deletions conda/flexflow.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,3 +25,10 @@ dependencies:
- sentencepiece
- einops
- requests
- scipy
- bitsandbytes
- datasets
- accelerate
- loralib
- triton
- peft
2 changes: 1 addition & 1 deletion config/config.inc
Original file line number Diff line number Diff line change
Expand Up @@ -197,7 +197,7 @@ fi

# set ROCM path
if [ -n "$ROCM_PATH" ]; then
SET_ROCM_PATH="-DROCM_PATH=${ROCM_PATH}"
SET_ROCM_PATH="-DROCM_PATH=${ROCM_PATH} -DHIP_ROOT_DIR=${ROCM_PATH}"
fi

ADD_ROCM_TO_PATH=""
Expand Down
9 changes: 4 additions & 5 deletions docker/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -56,15 +56,14 @@ if [[ "${FF_GPU_BACKEND}" == "cuda" || "${FF_GPU_BACKEND}" == "hip_cuda" ]]; the
cuda_version_input=${cuda_version}.3
elif [[ "$cuda_version" == @(11.8) ]]; then
cuda_version_input=${cuda_version}.0
elif [[ "$cuda_version" == @(12.3|12.4|12.5|12.6|12.7|12.8|12.9) ]]; then
# Use CUDA 12.2 for all versions greater or equal to 12.2 for now (the Docker machine with CUDNN is not yet available)
cuda_version=12.2
cuda_version_input=${cuda_version}.2
else
echo "cuda_version is not supported, please choose among {11.1|11.2|11.3|11.4|11.5|11.6|11.7|11.8|12.0|12.1|12.2}"
exit 1
fi
# Use CUDA 12.2 for all versions greater or equal to 12.2 for now (the Docker machine with CUDNN is not yet available)
if [[ "$cuda_version" == @(12.3|12.4|12.5|12.6|12.7|12.8|12.9) ]]; then
cuda_version=12.2
cuda_version_input=${cuda_version}.2
fi
echo "Building $image docker image with CUDA $cuda_version"
ff_environment_base_image="nvidia/cuda:${cuda_version_input}-cudnn8-devel-ubuntu20.04"
gpu_backend_version="-${cuda_version}"
Expand Down
2 changes: 2 additions & 0 deletions docker/flexflow-environment/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,8 @@ RUN conda install -c conda-forge cmake make pillow cmake-build-extension pybind1
RUN conda install pytorch torchvision torchaudio -c pytorch
RUN conda install -c conda-forge onnx transformers>=4.31.0 sentencepiece einops
RUN pip3 install tensorflow notebook
# PEFT-related
RUN pip3 install scipy bitsandbytes datasets accelerate loralib triton peft

# Install Rust
RUN curl https://sh.rustup.rs -sSf | sh -s -- -y
Expand Down
2 changes: 1 addition & 1 deletion docker/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ if [[ "${FF_GPU_BACKEND}" == "cuda" || "${FF_GPU_BACKEND}" == "hip_cuda" ]]; the
fi
fi
# Check that CUDA version is supported
if [[ "$cuda_version" != @(11.1|11.2|11.3|11.4|11.5|11.6|11.7|11.8|12.0|12.1|12.2) ]]; then
if [[ "$cuda_version" != @(11.1|11.2|11.3|11.4|11.5|11.6|11.7|11.8|12.0|12.1|12.2|12.3|12.4|12.5|12.6|12.7|12.8|12.9) ]]; then
echo "cuda_version is not supported, please choose among {11.1|11.2|11.3|11.4|11.5|11.6|11.7|11.8|12.0|12.1|12.2}"
exit 1
fi
Expand Down
42 changes: 38 additions & 4 deletions include/flexflow/batch_config.h
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
#pragma once

#include "flexflow/ffconst.h"
#include "flexflow/fftype.h"
#include "legion.h"
#include <cstddef>
#include <cstdlib>
Expand All @@ -36,13 +37,27 @@ using BeamSearchBatchConfigFuture = Legion::Future;
using TreeVerifyBatchConfigFuture = Legion::Future;
using BeamInferenceResultFuture = Legion::Future;

struct OptimizerTasks {
bool compute_gradients = true;
bool reset_gradients_to_zero = false;
bool update_weights = false;
bool save_updated_weights = false;
};

void set_optimizer_tasks(OptimizerTasks &tasks,
int max_training_steps,
int completed_training_steps,
int gradient_accumulation_steps);

class BatchConfig {
public:
using RequestGuid = size_t;
using TokenId = int;
BatchConfig();
int num_active_requests() const;
int num_active_tokens() const;
int num_active_infr_tokens() const;
int num_active_peft_tokens() const;
static int max_requests_per_batch();
static int max_tokens_per_batch();
static int max_verify_tokens_per_batch();
Expand All @@ -56,26 +71,43 @@ class BatchConfig {
// Maximum possible values for different parameters
// These maximum values are used for copying BatchConfig
// across workers
static int const MAX_NUM_REQUESTS = 64;
static int const MAX_NUM_REQUESTS = 65;
static int const MAX_NUM_TOKENS = 1024;
static int const MAX_SPEC_TREE_TOKEN_NUM = 64;

// Set by update
int num_tokens;

int num_tokens = 0, num_peft_tokens = 0, num_peft_label_tokens = 0;
// number of tokens in prompt phase, start offset of tokens in inc_decoding
// phase. num_tokens - num_prompt_tokens = num_generation_tokens;
int num_generation_tokens;
int num_generation_tokens = 0;

struct PerRequestInfo {
PerRequestInfo() {
first_token_depth_in_request = 0;
first_token_offset_in_batch = 0;
num_tokens_in_batch = 0;
max_sequence_length = 0;
request_guid = 0;
prompt_phase = false;
batch_config_request_id = -1;
peft_model_id = PEFTModelID::NO_ID;
peft_bwd = false;
optimizer_tasks = {true, false, false, false};
}
int first_token_depth_in_request;
int first_token_offset_in_batch;
int num_tokens_in_batch;
int max_sequence_length;

// request id in batch config:
int batch_config_request_id;
int batch_config_request_id = -1;
bool prompt_phase = false;
RequestGuid request_guid;
// PEFT fields
PEFTModelID peft_model_id;
bool peft_bwd;
OptimizerTasks optimizer_tasks;
};
struct PerTokenInfo {
int abs_depth_in_request;
Expand All @@ -102,6 +134,7 @@ class BatchConfig {
BitMask causalMask[MAX_NUM_REQUESTS];
PerRequestInfo requestsInfo[MAX_NUM_REQUESTS];
PerTokenInfo tokensInfo[MAX_NUM_TOKENS];
PerTokenInfo labelsInfo[MAX_NUM_TOKENS];

bool request_completed[MAX_NUM_REQUESTS];
bool request_running[MAX_NUM_REQUESTS];
Expand Down Expand Up @@ -129,6 +162,7 @@ class TreeVerifyBatchConfig : public BatchConfig {
struct InferenceResult {
static int const MAX_NUM_TOKENS = BatchConfig::MAX_NUM_TOKENS;
BatchConfig::TokenId token_ids[MAX_NUM_TOKENS];
float finetuning_loss;
};

class BeamSearchBatchConfig : public BatchConfig {
Expand Down
Loading

0 comments on commit a0f1ed7

Please sign in to comment.