Skip to content

Commit

Permalink
Update TensorRT-LLM (NVIDIA#2094)
Browse files Browse the repository at this point in the history
* Update TensorRT-LLM

---------

Co-authored-by: akhoroshev <[email protected]>
Co-authored-by: Fabian Joswig <[email protected]>
Co-authored-by: Tayef Shah <[email protected]>
Co-authored-by: lfz941 <[email protected]>
  • Loading branch information
5 people authored Aug 7, 2024
1 parent a681853 commit be9cd71
Show file tree
Hide file tree
Showing 1,916 changed files with 9,927,840 additions and 9,296,820 deletions.
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -46,5 +46,5 @@ repos:
args:
- --skip=".git,3rdparty"
- --exclude-file=examples/whisper/tokenizer.py
- --ignore-words-list=rouge,inout,atleast,strat,nd,subtile
- --ignore-words-list=rouge,inout,atleast,strat,nd,subtile,thrid
exclude: 'tests/llm-test-defs/turtle/test_input_files'
16 changes: 10 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,18 @@ TensorRT-LLM
<div align="left">

## Latest News
* [2024/07/30] Introducing🍊 @SliceXAI ELM Turbo 🤖 train ELM once ⚡ #TensorRT #LLM optimize ☁️ deploy anywhere
[➡️ link](https://developer.nvidia.com/blog/supercharging-llama-3-1-across-nvidia-platforms)
<div align="center">
<img src="docs/source/media/picture-07-30-2024.png" width="70%">
<div align="left">

* [2024/07/23] 👀 @AIatMeta Llama 3.1 405B trained on 16K NVIDIA H100s - inference is #TensorRT #LLM optimized ⚡
🦙 400 tok/s - per node
🦙 37 tok/s - per user
🦙 1 node inference
➡️ [link](https://developer.nvidia.com/blog/supercharging-llama-3-1-across-nvidia-platforms)
<div align="center">
<img src="docs/source/media/picture-07-23-2024.png" width="45%">
<div align="left">
[➡️ link](https://developer.nvidia.com/blog/supercharging-llama-3-1-across-nvidia-platforms)


* [2024/07/09] Checklist to maximize multi-language performance of @meta #Llama3 with #TensorRT #LLM inference:
✅ MultiLingual
Expand Down Expand Up @@ -88,9 +92,9 @@ for integration with the
a production-quality system to serve LLMs. Models built with TensorRT-LLM can
be executed on a wide range of configurations going from a single GPU to
multiple nodes with multiple GPUs (using
[Tensor Parallelism](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/nemo_megatron/parallelisms.html#tensor-parallelism)
[Tensor Parallelism](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/features/parallelisms.html#tensor-parallelism)
and/or
[Pipeline Parallelism](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/nemo_megatron/parallelisms.html#pipeline-parallelism)).
[Pipeline Parallelism](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/features/parallelisms.html#pipeline-parallelism)).

The TensorRT-LLM Python API architecture looks similar to the
[PyTorch](https://pytorch.org) API. It provides a
Expand Down
25 changes: 20 additions & 5 deletions benchmarks/cpp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,11 @@ instead, and be sure to set DLL paths as specified in

#### Prepare dataset

Run a preprocessing script to prepare/generate dataset into a json that gptManagerBenchmark can consume later. The processed output json has *input tokens length, input token ids and output tokens length*.
Run a preprocessing script to prepare/generate dataset into a json that `gptManagerBenchmark` can consume later. The processed output json has *input tokens length, input token ids and output tokens length*.

This tool can be used in 2 different modes of traffic generation.
For `tokenizer`, specifying the path to the local tokenizer that have already been downloaded, or simply the name of the tokenizer from HuggingFace like `meta-llama/Llama-2-7b` will both work. The tokenizer will be downloaded automatically for the latter case.

This tool can be used in 3 different modes of traffic generation: `dataset`, `token-norm-dist` and `token-unif-dist`.

##### 1 – Dataset

Expand Down Expand Up @@ -63,8 +65,8 @@ python3 prepare_dataset.py \

##### 2 – Normal token length distribution

This mode allows the user to generate normal token length distributions with a mean and std deviation specified.
For example, setting mean=100 and std dev=10 would generate requests where 95.4% of values are in <80,120> range following the normal probability distribution. Setting std dev=0 will generate all requests with the same mean number of tokens.
This mode allows the user to generate normally distributed token lengths with a mean and std deviation specified.
For example, setting `mean=100` and `stdev=10` would generate requests where 95.4% of values are in <80,120> range following the normal probability distribution. Setting `stdev=0` will generate all requests with the same mean number of tokens.

```
python prepare_dataset.py \
Expand All @@ -76,7 +78,20 @@ python prepare_dataset.py \
--output-mean 15 --output-stdev 0
```

For `tokenizer`, specifying the path to the local tokenizer that have already been downloaded, or simply the name of the tokenizer from HuggingFace like `meta-llama/Llama-2-7b` will both work. The tokenizer will be downloaded automatically for the latter case.
##### 2 – Uniform token length distribution

This mode allows the user to generate uniformly distributed token lengths with min and max lengths specified.
For example, setting `min=50` and `max=100` would generate requests where lengths are in the range `[50, 100]` following the uniform probability distribution. Setting `min=x` and `max=x` will generate all requests with the same mean number of tokens `x`.

```
python prepare_dataset.py \
--output token-norm-dist.json \
--tokenizer <path/to/tokenizer> \
token-unif-dist \
--num-requests 100 \
--input-min 50 --input-max 100 \
--output-min 10 --output-max 15
```


#### Prepare TensorRT-LLM engines
Expand Down
14 changes: 12 additions & 2 deletions benchmarks/cpp/gptManagerBenchmark.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,7 @@ struct BenchmarkParams
std::optional<int> maxAttentionWindow{std::nullopt};
std::optional<int> sinkTokenLength{std::nullopt};
bool multiBlockMode{false};
bool enableContextFMHAFP32Acc{false};

// lora / peft params
std::optional<std::string> loraDir{std::nullopt};
Expand Down Expand Up @@ -806,6 +807,8 @@ class ExecutorServer
benchmarkParams.kvHostCacheSize, benchmarkParams.kvOnboardBlocks);
texec::PeftCacheConfig peftCacheConfig(0, benchmarkParams.loraDeviceNumModLayers, 8, 64, 4, 4, 4, 24, 8,
std::nullopt, benchmarkParams.loraHostCacheSize);
texec::ExtendedRuntimePerfKnobConfig extendedRuntimePerfKnobConfig(
benchmarkParams.multiBlockMode, benchmarkParams.enableContextFMHAFP32Acc);
texec::ExecutorConfig executorConfig(
maxBeamWidth, schedulerConfig, kvCacheConfig, benchmarkParams.enableChunkedContext, true);
executorConfig.setGpuWeightsPercent(benchmarkParams.gpuWeightsPercent);
Expand All @@ -824,7 +827,7 @@ class ExecutorServer
executorConfig.setDecodingConfig(texec::DecodingConfig(
benchmarkParams.medusaChoices.has_value() ? texec::DecodingMode::Medusa() : texec::DecodingMode::Auto(),
std::nullopt, benchmarkParams.medusaChoices));
executorConfig.setMultiBlockMode(benchmarkParams.multiBlockMode);
executorConfig.setExtendedRuntimePerfKnobConfig(extendedRuntimePerfKnobConfig);

if (executorModelType == texec::ModelType::kDECODER_ONLY)
{
Expand Down Expand Up @@ -1429,7 +1432,8 @@ void benchmarkGptManager(std::filesystem::path const& engineDir, TrtGptModelType
optionalParams.decodingConfig = texec::DecodingConfig(
benchmarkParams.medusaChoices.has_value() ? texec::DecodingMode::Medusa() : texec::DecodingMode::Auto(),
std::nullopt, benchmarkParams.medusaChoices);
optionalParams.multiBlockMode = benchmarkParams.multiBlockMode;
optionalParams.extendedRuntimePerfKnobConfig = texec::ExtendedRuntimePerfKnobConfig(
benchmarkParams.multiBlockMode, benchmarkParams.enableContextFMHAFP32Acc);

auto const jsonConfig = GptJsonConfig::parse(engineDir / "config.json");
auto const worldConfig = WorldConfig::mpi(jsonConfig.getGpusPerNode(), jsonConfig.getTensorParallelism(),
Expand Down Expand Up @@ -1891,6 +1895,9 @@ int main(int argc, char* argv[])
options.add_options()(
"encoder_engine_dir", "Directory that store the engines of the encoder models.", cxxopts::value<std::string>());

options.add_options()("enable_context_fmha_fp32_acc", "Enable FMHA runner FP32 accumulation",
cxxopts::value<bool>()->default_value("false"));

auto result = options.parse(argc, argv);

if (result.count("help"))
Expand Down Expand Up @@ -2051,6 +2058,9 @@ int main(int argc, char* argv[])
// Argument: multi_block_mode
benchmarkParams.multiBlockMode = result["multi_block_mode"].as<bool>();

// Argument: enable_context_fmha_fp32_acc
benchmarkParams.enableContextFMHAFP32Acc = result["enable_context_fmha_fp32_acc"].as<bool>();

std::optional<TokenIdType> padId;
// Argument: Padding token id
if (result.count("pad_id"))
Expand Down
3 changes: 2 additions & 1 deletion benchmarks/cpp/prepare_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
from transformers.tokenization_utils import PreTrainedTokenizer
from transformers.tokenization_utils_fast import PreTrainedTokenizerFast
from utils.prepare_real_data import dataset
from utils.prepare_synthetic_data import token_norm_dist
from utils.prepare_synthetic_data import token_norm_dist, token_unif_dist


class RootArgs(BaseModel):
Expand Down Expand Up @@ -97,6 +97,7 @@ def cli(ctx, **kwargs):

cli.add_command(dataset)
cli.add_command(token_norm_dist)
cli.add_command(token_unif_dist)

if __name__ == "__main__":
cli()
6 changes: 3 additions & 3 deletions benchmarks/cpp/utils/prepare_real_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
import click
from datasets import load_dataset
from pydantic import BaseModel, model_validator
from utils.utils import dataset_dump, get_norm_dist_tokens, print_dataset
from utils.utils import dataset_dump, get_norm_dist_lengths, print_dataset


def validate_output_len_dist(ctx, param, value):
Expand Down Expand Up @@ -214,8 +214,8 @@ def dataset(root_args, **kwargs):
# output if randomized
if kwargs['output_len_dist'] is not None:
osl_mean, osl_stdev = kwargs['output_len_dist']
output_lens = get_norm_dist_tokens(osl_mean, osl_stdev, len(input_ids),
root_args.random_seed)
output_lens = get_norm_dist_lengths(osl_mean, osl_stdev, len(input_ids),
root_args.random_seed)

logging.debug(f"Input lengths: {[len(i) for i in input_ids]}")
logging.debug(f"Output lengths: {output_lens}")
Expand Down
90 changes: 80 additions & 10 deletions benchmarks/cpp/utils/prepare_synthetic_data.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
import random

import click
from utils.utils import (dataset_dump, gen_random_tokens, get_norm_dist_tokens,
print_dataset)
from utils.utils import (dataset_dump, gen_random_tokens, get_norm_dist_lengths,
get_unif_dist_lengths, print_dataset)


@click.command()
Expand All @@ -28,21 +28,21 @@
help='normal dist stdev for output tokens')
@click.pass_obj
def token_norm_dist(root_args, **kwargs):
"""Prepare dataset by generating random tokens."""
"""Prepare synthetic dataset by generating random tokens with normal dist lengths."""
input_ids = []
input_lens = []
output_lens = []
task_ids = []

input_lens = get_norm_dist_tokens(kwargs['input_mean'],
kwargs['input_stdev'],
kwargs['num_requests'],
root_args.random_seed)
input_lens = get_norm_dist_lengths(kwargs['input_mean'],
kwargs['input_stdev'],
kwargs['num_requests'],
root_args.random_seed)

num_reqs = len(input_lens)
output_lens = get_norm_dist_tokens(kwargs['output_mean'],
kwargs['output_stdev'], num_reqs,
root_args.random_seed)
output_lens = get_norm_dist_lengths(kwargs['output_mean'],
kwargs['output_stdev'], num_reqs,
root_args.random_seed)

max_input_len = max(input_lens)
max_output_len = max(output_lens)
Expand Down Expand Up @@ -74,3 +74,73 @@ def token_norm_dist(root_args, **kwargs):
input_ids,
output_lens,
)


@click.command()
@click.option("--num-requests",
required=True,
type=int,
help='Number of requests to be generated')
@click.option('--input-min',
required=True,
type=int,
help='uniform dist (inclusive) min for input tokens')
@click.option('--input-max',
required=True,
type=int,
help='normal dist (inclusive) max for input tokens')
@click.option('--output-min',
required=True,
type=int,
help='normal dist (inclusive) min for output tokens')
@click.option('--output-max',
required=True,
type=int,
help='normal dist (inclusive) max for output tokens')
@click.pass_obj
def token_unif_dist(root_args, **kwargs):
"""Prepare synthetic dataset by generating random tokens with normal uniformly lengths."""
input_ids = []
input_lens = []
output_lens = []
task_ids = []

input_lens = get_unif_dist_lengths(kwargs['input_min'], kwargs['input_max'],
kwargs['num_requests'],
root_args.random_seed)

num_reqs = len(input_lens)
output_lens = get_unif_dist_lengths(kwargs['output_min'],
kwargs['output_max'], num_reqs,
root_args.random_seed)

max_input_len = max(input_lens)
max_output_len = max(output_lens)

input_ids = gen_random_tokens(input_lens, root_args.tokenizer,
root_args.random_seed)

if root_args.rand_task_id is None:
task_ids = [root_args.task_id for _ in range(num_reqs)]
else:
min_id, max_id = root_args.rand_task_id
task_ids = [random.randint(min_id, max_id) for _ in range(num_reqs)]

if not root_args.std_out:
dataset_dump(
input_lens, input_ids, output_lens, task_ids, {
"workload_type": "token-unif-dist",
"input_min": kwargs['input_min'],
"input_max": kwargs['input_max'],
"output_min": kwargs['output_min'],
"output_max": kwargs['output_max'],
"num_requests": kwargs['num_requests'],
"tokenize_vocabsize": root_args.tokenizer.vocab_size,
"max_input_len": max_input_len,
"max_output_len": max_output_len
}, root_args.output)
else:
print_dataset(
input_ids,
output_lens,
)
9 changes: 8 additions & 1 deletion benchmarks/cpp/utils/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,14 +72,21 @@ def get_exponential_dist_delays(mean_time_bet_reqs, num_reqs, random_seed):
return np.random.exponential(mean_time_bet_reqs, num_reqs).tolist()


def get_norm_dist_tokens(mean, stdev, num_reqs, random_seed):
def get_norm_dist_lengths(mean, stdev, num_reqs, random_seed):
# set seed for determinism
np.random.seed(random_seed)
numbers_list = np.random.normal(loc=mean, scale=stdev,
size=num_reqs).tolist()
return [max(1, math.ceil(x)) for x in numbers_list]


def get_unif_dist_lengths(min_len, max_len, num_reqs, random_seed):
# set seed for determinism
rng = np.random.default_rng(random_seed)
numbers = rng.integers(low=min_len, high=max_len + 1, size=num_reqs)
return numbers.tolist()


def gen_random_tokens(ip_lens, tokenizer, random_seed):

def get_sample_from_population(population_range, sample_size):
Expand Down
12 changes: 12 additions & 0 deletions cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
Original file line number Diff line number Diff line change
Expand Up @@ -553,8 +553,15 @@ class KVCacheManager

void addContextTokens(SizeType32 seqSlotIdx, SizeType32 numTokens);

/// @brief Increase size for request at seqSlotIdx. Allocate new KV cache block(s) if needed.
void addToken(SizeType32 seqSlotIdx);

/// @brief Add new request to the KV cache manager.
/// @param inputLength Input length for which KV cache need to be allocated.
/// @param beamWidth Beam width for which KV cache need to be allocated.
/// @param llmRequest Optional request to use for KV cache lookup.
/// @details If llmRequest is supplied and KV cache reuse is enabled, try to recover KV cache blocks for
/// inputLength - 1 tokens and populate prepopulatedPromptLen.
void addSequence(SizeType32 seqSlotIdx, SizeType32 inputLength, SizeType32 beamWidth,
std::shared_ptr<LlmRequest> const& llmRequest = nullptr);

Expand Down Expand Up @@ -604,6 +611,11 @@ class KVCacheManager
return mCacheType == CacheType::kCROSS;
}

[[nodiscard]] static SizeType32 getSinkBubbleLength(SizeType32 sinkTokenLen, SizeType32 tokensPerBlock);

[[nodiscard]] static SizeType32 getMaxAttentionWindowUpperBound(SizeType32 blocksInPrimaryPool,
SizeType32 tokensPerBlock, SizeType32 maxBeamWidth, SizeType32 sinkTokenLen, bool useOneMoreBlock);

private:
void setOffsets(kernels::KVCacheIndex* offsetsPtr, nvinfer1::Dims const& offsetsShape, SizeType32 seqSlotIdx,
SizeType32 beamIdx, SizeType32 blockIdx, KVCacheBlock::IdType blockId) const;
Expand Down
Loading

0 comments on commit be9cd71

Please sign in to comment.