Update TensorRT-LLM (NVIDIA#2094)

* Update TensorRT-LLM --------- Co-authored-by: akhoroshev <[email protected]> Co-authored-by: Fabian Joswig <[email protected]> Co-authored-by: Tayef Shah <[email protected]> Co-authored-by: lfz941 <[email protected]>
ampdot-io · Aug 7, 2024 · be9cd71 · be9cd71
1 parent a681853
commit be9cd71
Show file tree

Hide file tree

Showing 1,916 changed files with 9,927,840 additions and 9,296,820 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -46,5 +46,5 @@ repos:
         args:
         - --skip=".git,3rdparty"
         - --exclude-file=examples/whisper/tokenizer.py
-        - --ignore-words-list=rouge,inout,atleast,strat,nd,subtile
+        - --ignore-words-list=rouge,inout,atleast,strat,nd,subtile,thrid
         exclude: 'tests/llm-test-defs/turtle/test_input_files'
diff --git a/3rdparty/cutlass b/3rdparty/cutlass
diff --git a/README.md b/README.md
@@ -17,14 +17,18 @@ TensorRT-LLM
 <div align="left">
 
 ## Latest News
+* [2024/07/30] Introducing🍊 @SliceXAI ELM Turbo 🤖 train ELM once ⚡ #TensorRT #LLM optimize ☁️ deploy anywhere
+[➡️ link](https://developer.nvidia.com/blog/supercharging-llama-3-1-across-nvidia-platforms)
+<div align="center">
+<img src="docs/source/media/picture-07-30-2024.png" width="70%">
+<div align="left">
+
 * [2024/07/23] 👀 @AIatMeta Llama 3.1 405B trained on 16K NVIDIA H100s - inference is #TensorRT #LLM optimized ⚡
 🦙 400 tok/s - per node
 🦙 37 tok/s - per user
 🦙 1 node inference
-➡️ [link](https://developer.nvidia.com/blog/supercharging-llama-3-1-across-nvidia-platforms)
-<div align="center">
-<img src="docs/source/media/picture-07-23-2024.png" width="45%">
-<div align="left">
+[➡️ link](https://developer.nvidia.com/blog/supercharging-llama-3-1-across-nvidia-platforms)
+
 
 * [2024/07/09] Checklist to maximize multi-language performance of @meta #Llama3 with #TensorRT #LLM inference:
 ✅ MultiLingual
@@ -88,9 +92,9 @@ for integration with the
 a production-quality system to serve LLMs.  Models built with TensorRT-LLM can
 be executed on a wide range of configurations going from a single GPU to
 multiple nodes with multiple GPUs (using
-[Tensor Parallelism](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/nemo_megatron/parallelisms.html#tensor-parallelism)
+[Tensor Parallelism](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/features/parallelisms.html#tensor-parallelism)
 and/or
-[Pipeline Parallelism](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/nemo_megatron/parallelisms.html#pipeline-parallelism)).
+[Pipeline Parallelism](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/features/parallelisms.html#pipeline-parallelism)).
 
 The TensorRT-LLM Python API architecture looks similar to the
 [PyTorch](https://pytorch.org) API. It provides a

diff --git a/benchmarks/cpp/README.md b/benchmarks/cpp/README.md
@@ -20,9 +20,11 @@ instead, and be sure to set DLL paths as specified in
 
 #### Prepare dataset
 
-Run a preprocessing script to prepare/generate dataset into a json that gptManagerBenchmark can consume later. The processed output json has *input tokens length, input token ids and output tokens length*.
+Run a preprocessing script to prepare/generate dataset into a json that `gptManagerBenchmark` can consume later. The processed output json has *input tokens length, input token ids and output tokens length*.
 
-This tool can be used in 2 different modes of traffic generation.
+For `tokenizer`, specifying the path to the local tokenizer that have already been downloaded, or simply the name of the tokenizer from HuggingFace like `meta-llama/Llama-2-7b` will both work. The tokenizer will be downloaded automatically for the latter case.
+
+This tool can be used in 3 different modes of traffic generation: `dataset`, `token-norm-dist` and `token-unif-dist`.
 
 ##### 1 – Dataset
 
@@ -63,8 +65,8 @@ python3 prepare_dataset.py \
 
 ##### 2 – Normal token length distribution
 
-This mode allows the user to generate normal token length distributions with a mean and std deviation specified.
-For example, setting mean=100 and std dev=10 would generate requests where 95.4% of values are in <80,120> range following the normal probability distribution. Setting std dev=0 will generate all requests with the same mean number of tokens.
+This mode allows the user to generate normally distributed token lengths with a mean and std deviation specified.
+For example, setting `mean=100` and `stdev=10` would generate requests where 95.4% of values are in <80,120> range following the normal probability distribution. Setting `stdev=0` will generate all requests with the same mean number of tokens.
 
 ```
 python prepare_dataset.py \
@@ -76,7 +78,20 @@ python prepare_dataset.py \
    --output-mean 15 --output-stdev 0
 ```
 
-For `tokenizer`, specifying the path to the local tokenizer that have already been downloaded, or simply the name of the tokenizer from HuggingFace like `meta-llama/Llama-2-7b` will both work. The tokenizer will be downloaded automatically for the latter case.
+##### 2 – Uniform token length distribution
+
+This mode allows the user to generate uniformly distributed token lengths with min and max lengths specified.
+For example, setting `min=50` and  `max=100` would generate requests where lengths are in the range `[50, 100]` following the uniform probability distribution. Setting `min=x` and `max=x` will generate all requests with the same mean number of tokens `x`.
+
+```
+python prepare_dataset.py \
+  --output token-norm-dist.json \
+  --tokenizer <path/to/tokenizer> \
+   token-unif-dist \
+   --num-requests 100 \
+   --input-min 50 --input-max 100 \
+   --output-min 10 --output-max 15
+```
 
 
 #### Prepare TensorRT-LLM engines

diff --git a/benchmarks/cpp/gptManagerBenchmark.cpp b/benchmarks/cpp/gptManagerBenchmark.cpp
@@ -157,6 +157,7 @@ struct BenchmarkParams
     std::optional<int> maxAttentionWindow{std::nullopt};
     std::optional<int> sinkTokenLength{std::nullopt};
     bool multiBlockMode{false};
+    bool enableContextFMHAFP32Acc{false};
 
     // lora / peft params
     std::optional<std::string> loraDir{std::nullopt};
@@ -806,6 +807,8 @@ class ExecutorServer
             benchmarkParams.kvHostCacheSize, benchmarkParams.kvOnboardBlocks);
         texec::PeftCacheConfig peftCacheConfig(0, benchmarkParams.loraDeviceNumModLayers, 8, 64, 4, 4, 4, 24, 8,
             std::nullopt, benchmarkParams.loraHostCacheSize);
+        texec::ExtendedRuntimePerfKnobConfig extendedRuntimePerfKnobConfig(
+            benchmarkParams.multiBlockMode, benchmarkParams.enableContextFMHAFP32Acc);
         texec::ExecutorConfig executorConfig(
             maxBeamWidth, schedulerConfig, kvCacheConfig, benchmarkParams.enableChunkedContext, true);
         executorConfig.setGpuWeightsPercent(benchmarkParams.gpuWeightsPercent);
@@ -824,7 +827,7 @@ class ExecutorServer
         executorConfig.setDecodingConfig(texec::DecodingConfig(
             benchmarkParams.medusaChoices.has_value() ? texec::DecodingMode::Medusa() : texec::DecodingMode::Auto(),
             std::nullopt, benchmarkParams.medusaChoices));
-        executorConfig.setMultiBlockMode(benchmarkParams.multiBlockMode);
+        executorConfig.setExtendedRuntimePerfKnobConfig(extendedRuntimePerfKnobConfig);
 
         if (executorModelType == texec::ModelType::kDECODER_ONLY)
         {
@@ -1429,7 +1432,8 @@ void benchmarkGptManager(std::filesystem::path const& engineDir, TrtGptModelType
     optionalParams.decodingConfig = texec::DecodingConfig(
         benchmarkParams.medusaChoices.has_value() ? texec::DecodingMode::Medusa() : texec::DecodingMode::Auto(),
         std::nullopt, benchmarkParams.medusaChoices);
-    optionalParams.multiBlockMode = benchmarkParams.multiBlockMode;
+    optionalParams.extendedRuntimePerfKnobConfig = texec::ExtendedRuntimePerfKnobConfig(
+        benchmarkParams.multiBlockMode, benchmarkParams.enableContextFMHAFP32Acc);
 
     auto const jsonConfig = GptJsonConfig::parse(engineDir / "config.json");
     auto const worldConfig = WorldConfig::mpi(jsonConfig.getGpusPerNode(), jsonConfig.getTensorParallelism(),
@@ -1891,6 +1895,9 @@ int main(int argc, char* argv[])
     options.add_options()(
         "encoder_engine_dir", "Directory that store the engines of the encoder models.", cxxopts::value<std::string>());
 
+    options.add_options()("enable_context_fmha_fp32_acc", "Enable FMHA runner FP32 accumulation",
+        cxxopts::value<bool>()->default_value("false"));
+
     auto result = options.parse(argc, argv);
 
     if (result.count("help"))
@@ -2051,6 +2058,9 @@ int main(int argc, char* argv[])
     // Argument: multi_block_mode
     benchmarkParams.multiBlockMode = result["multi_block_mode"].as<bool>();
 
+    // Argument: enable_context_fmha_fp32_acc
+    benchmarkParams.enableContextFMHAFP32Acc = result["enable_context_fmha_fp32_acc"].as<bool>();
+
     std::optional<TokenIdType> padId;
     // Argument: Padding token id
     if (result.count("pad_id"))

diff --git a/benchmarks/cpp/prepare_dataset.py b/benchmarks/cpp/prepare_dataset.py
@@ -21,7 +21,7 @@
 from transformers.tokenization_utils import PreTrainedTokenizer
 from transformers.tokenization_utils_fast import PreTrainedTokenizerFast
 from utils.prepare_real_data import dataset
-from utils.prepare_synthetic_data import token_norm_dist
+from utils.prepare_synthetic_data import token_norm_dist, token_unif_dist
 
 
 class RootArgs(BaseModel):
@@ -97,6 +97,7 @@ def cli(ctx, **kwargs):
 
 cli.add_command(dataset)
 cli.add_command(token_norm_dist)
+cli.add_command(token_unif_dist)
 
 if __name__ == "__main__":
     cli()
diff --git a/benchmarks/cpp/utils/prepare_real_data.py b/benchmarks/cpp/utils/prepare_real_data.py
@@ -6,7 +6,7 @@
 import click
 from datasets import load_dataset
 from pydantic import BaseModel, model_validator
-from utils.utils import dataset_dump, get_norm_dist_tokens, print_dataset
+from utils.utils import dataset_dump, get_norm_dist_lengths, print_dataset
 
 
 def validate_output_len_dist(ctx, param, value):
@@ -214,8 +214,8 @@ def dataset(root_args, **kwargs):
     # output if randomized
     if kwargs['output_len_dist'] is not None:
         osl_mean, osl_stdev = kwargs['output_len_dist']
-        output_lens = get_norm_dist_tokens(osl_mean, osl_stdev, len(input_ids),
-                                           root_args.random_seed)
+        output_lens = get_norm_dist_lengths(osl_mean, osl_stdev, len(input_ids),
+                                            root_args.random_seed)
 
     logging.debug(f"Input lengths: {[len(i) for i in input_ids]}")
     logging.debug(f"Output lengths: {output_lens}")

diff --git a/benchmarks/cpp/utils/prepare_synthetic_data.py b/benchmarks/cpp/utils/prepare_synthetic_data.py
@@ -1,8 +1,8 @@
 import random
 
 import click
-from utils.utils import (dataset_dump, gen_random_tokens, get_norm_dist_tokens,
-                         print_dataset)
+from utils.utils import (dataset_dump, gen_random_tokens, get_norm_dist_lengths,
+                         get_unif_dist_lengths, print_dataset)
 
 
 @click.command()
@@ -28,21 +28,21 @@
               help='normal dist stdev for output tokens')
 @click.pass_obj
 def token_norm_dist(root_args, **kwargs):
-    """Prepare dataset by generating random tokens."""
+    """Prepare synthetic dataset by generating random tokens with normal dist lengths."""
     input_ids = []
     input_lens = []
     output_lens = []
     task_ids = []
 
-    input_lens = get_norm_dist_tokens(kwargs['input_mean'],
-                                      kwargs['input_stdev'],
-                                      kwargs['num_requests'],
-                                      root_args.random_seed)
+    input_lens = get_norm_dist_lengths(kwargs['input_mean'],
+                                       kwargs['input_stdev'],
+                                       kwargs['num_requests'],
+                                       root_args.random_seed)
 
     num_reqs = len(input_lens)
-    output_lens = get_norm_dist_tokens(kwargs['output_mean'],
-                                       kwargs['output_stdev'], num_reqs,
-                                       root_args.random_seed)
+    output_lens = get_norm_dist_lengths(kwargs['output_mean'],
+                                        kwargs['output_stdev'], num_reqs,
+                                        root_args.random_seed)
 
     max_input_len = max(input_lens)
     max_output_len = max(output_lens)
@@ -74,3 +74,73 @@ def token_norm_dist(root_args, **kwargs):
             input_ids,
             output_lens,
         )
+
+
+@click.command()
+@click.option("--num-requests",
+              required=True,
+              type=int,
+              help='Number of requests to be generated')
+@click.option('--input-min',
+              required=True,
+              type=int,
+              help='uniform dist (inclusive) min for input tokens')
+@click.option('--input-max',
+              required=True,
+              type=int,
+              help='normal dist (inclusive) max for input tokens')
+@click.option('--output-min',
+              required=True,
+              type=int,
+              help='normal dist (inclusive) min for output tokens')
+@click.option('--output-max',
+              required=True,
+              type=int,
+              help='normal dist (inclusive) max for output tokens')
+@click.pass_obj
+def token_unif_dist(root_args, **kwargs):
+    """Prepare synthetic dataset by generating random tokens with normal uniformly lengths."""
+    input_ids = []
+    input_lens = []
+    output_lens = []
+    task_ids = []
+
+    input_lens = get_unif_dist_lengths(kwargs['input_min'], kwargs['input_max'],
+                                       kwargs['num_requests'],
+                                       root_args.random_seed)
+
+    num_reqs = len(input_lens)
+    output_lens = get_unif_dist_lengths(kwargs['output_min'],
+                                        kwargs['output_max'], num_reqs,
+                                        root_args.random_seed)
+
+    max_input_len = max(input_lens)
+    max_output_len = max(output_lens)
+
+    input_ids = gen_random_tokens(input_lens, root_args.tokenizer,
+                                  root_args.random_seed)
+
+    if root_args.rand_task_id is None:
+        task_ids = [root_args.task_id for _ in range(num_reqs)]
+    else:
+        min_id, max_id = root_args.rand_task_id
+        task_ids = [random.randint(min_id, max_id) for _ in range(num_reqs)]
+
+    if not root_args.std_out:
+        dataset_dump(
+            input_lens, input_ids, output_lens, task_ids, {
+                "workload_type": "token-unif-dist",
+                "input_min": kwargs['input_min'],
+                "input_max": kwargs['input_max'],
+                "output_min": kwargs['output_min'],
+                "output_max": kwargs['output_max'],
+                "num_requests": kwargs['num_requests'],
+                "tokenize_vocabsize": root_args.tokenizer.vocab_size,
+                "max_input_len": max_input_len,
+                "max_output_len": max_output_len
+            }, root_args.output)
+    else:
+        print_dataset(
+            input_ids,
+            output_lens,
+        )
diff --git a/benchmarks/cpp/utils/utils.py b/benchmarks/cpp/utils/utils.py
@@ -72,14 +72,21 @@ def get_exponential_dist_delays(mean_time_bet_reqs, num_reqs, random_seed):
     return np.random.exponential(mean_time_bet_reqs, num_reqs).tolist()
 
 
-def get_norm_dist_tokens(mean, stdev, num_reqs, random_seed):
+def get_norm_dist_lengths(mean, stdev, num_reqs, random_seed):
     # set seed for determinism
     np.random.seed(random_seed)
     numbers_list = np.random.normal(loc=mean, scale=stdev,
                                     size=num_reqs).tolist()
     return [max(1, math.ceil(x)) for x in numbers_list]
 
 
+def get_unif_dist_lengths(min_len, max_len, num_reqs, random_seed):
+    # set seed for determinism
+    rng = np.random.default_rng(random_seed)
+    numbers = rng.integers(low=min_len, high=max_len + 1, size=num_reqs)
+    return numbers.tolist()
+
+
 def gen_random_tokens(ip_lens, tokenizer, random_seed):
 
     def get_sample_from_population(population_range, sample_size):

diff --git a/cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h b/cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
@@ -553,8 +553,15 @@ class KVCacheManager
 
     void addContextTokens(SizeType32 seqSlotIdx, SizeType32 numTokens);
 
+    /// @brief Increase size for request at seqSlotIdx. Allocate new KV cache block(s) if needed.
     void addToken(SizeType32 seqSlotIdx);
 
+    /// @brief Add new request to the KV cache manager.
+    /// @param inputLength Input length for which KV cache need to be allocated.
+    /// @param beamWidth Beam width for which KV cache need to be allocated.
+    /// @param llmRequest Optional request to use for KV cache lookup.
+    /// @details If llmRequest is supplied and KV cache reuse is enabled, try to recover KV cache blocks for
+    /// inputLength - 1 tokens and populate prepopulatedPromptLen.
     void addSequence(SizeType32 seqSlotIdx, SizeType32 inputLength, SizeType32 beamWidth,
         std::shared_ptr<LlmRequest> const& llmRequest = nullptr);
 
@@ -604,6 +611,11 @@ class KVCacheManager
         return mCacheType == CacheType::kCROSS;
     }
 
+    [[nodiscard]] static SizeType32 getSinkBubbleLength(SizeType32 sinkTokenLen, SizeType32 tokensPerBlock);
+
+    [[nodiscard]] static SizeType32 getMaxAttentionWindowUpperBound(SizeType32 blocksInPrimaryPool,
+        SizeType32 tokensPerBlock, SizeType32 maxBeamWidth, SizeType32 sinkTokenLen, bool useOneMoreBlock);
+
 private:
     void setOffsets(kernels::KVCacheIndex* offsetsPtr, nvinfer1::Dims const& offsetsShape, SizeType32 seqSlotIdx,
         SizeType32 beamIdx, SizeType32 blockIdx, KVCacheBlock::IdType blockId) const;
+7 −1		include/cutlass/detail/helper_macros.hpp
+10 −10		media/docs/cute/02_layout_algebra.md
+12 −8		media/docs/cute/03_tensor.md
+ −		media/images/cute/divide2.png
+17 −0		tools/util/include/cutlass/util/packed_stride.hpp