Update TensorRT-LLM Release branch #1445

kaiyux · 2024-04-12T08:40:52Z

Model Support
- Support distil-whisper, thanks to the contribution from @Bhuvanesh09 in PR Adding distil-whisper model support to TensorRT-LLM #1061
- Support HuggingFace StarCoder2
- Support VILA
- Support Smaug-72B-v0.1
- Migrate BLIP-2 examples to examples/multimodal
Features
- Add support to context chunking to work with KV cache reuse
- Enable different rewind tokens per sequence for Medusa
- BART LoRA support (limited to the Python runtime)
- Enable multi-LoRA for BART LoRA
- Support early_stopping=False in beam search for C++ Runtime
- Add logits post processor to the batch manager (see docs/source/batch_manager.md#logits-post-processor-optional)
- Support import and convert HuggingFace Gemma checkpoints, thanks for the contribution from @mfuntowicz in Make Gemma importable from transformers Gemma implementation #1147
- Support loading Gemma from HuggingFace
- Support auto parallelism planner for high-level API and unified builder workflow
- Support run GptSession without OpenMPI Run GptSession without openmpi? #1220
- [BREAKING CHANGE] TopP sampling optimization with deterministic AIR TopP algorithm is enabled by default
- Medusa IFB support
- [Experimental] Support FP8 FMHA, note that the performance is not optimal, and we will keep optimizing it
- [BREAKING CHANGE] Support embedding sharing for Gemma
- More head sizes support for LLaMA-like models
  - Ampere (sm80, sm86), Ada (sm89), Hopper(sm90) all support head sizes [32, 40, 64, 80, 96, 104, 128, 160, 256] now.
- OOTB functionality support
  - T5
  - Mixtral 8x7B
API
- C++ executor API
  - Add Python bindings, see documentation and examples in examples/bindings
  - Add advanced and multi-GPU examples for Python binding of executor C++ API, see examples/bindings/README.md
  - Add documents for C++ executor API, see docs/source/executor.md
- High-level API (refer to examples/high-level-api/README.md for guidance)
  - [BREAKING CHANGE] Reuse the QuantConfig used in trtllm-build tool, support broader quantization features
  - Support in LLM() API to accept engines built by trtllm-build command
  - Add support for TensorRT-LLM checkpoint as model input
  - Refine SamplingConfig used in LLM.generate or LLM.generate_async APIs, with the support of beam search, a variety of penalties, and more features
  - Add support for the StreamingLLM feature, enable it by setting LLM(streaming_llm=...)
  - Migrate Mixtral to high level API and unified builder workflow
- [BREAKING CHANGE] Refactored Qwen model to the unified build workflow, see examples/qwen/README.md for the latest commands
- [BREAKING CHANGE] Move LLaMA convert checkpoint script from examples directory into the core library
- [BREAKING CHANGE] Refactor GPT with unified building workflow, see examples/gpt/README.md for the latest commands
- [BREAKING CHANGE] Removed all the lora related flags from convert_checkpoint.py script and the checkpoint content to trtllm-build command, to generalize the feature better to more models
- [BREAKING CHANGE] Removed the use_prompt_tuning flag and options from convert_checkpoint.py script and the checkpoint content, to generalize the feature better to more models. Use the trtllm-build --max_prompt_embedding_table_size instead.
- [BREAKING CHANGE] Changed the trtllm-build --world_size flag to --auto_parallel flag, the option is used for auto parallel planner only.
- [BREAKING CHANGE] AsyncLLMEngine is removed, tensorrt_llm.GenerationExecutor class is refactored to work with both explicitly launching with mpirun in the application level, and accept an MPI communicator created by mpi4py
- [BREAKING CHANGE] examples/server are removed, see examples/app instead.
- [BREAKING CHANGE] Remove LoRA related parameters from convert checkpoint scripts
- [BREAKING CHANGE] Simplify Qwen convert checkpoint script
- [BREAKING CHANGE] Remove model parameter from gptManagerBenchmark and gptSessionBenchmark
Bug fixes
- Fix a weight-only quant bug for Whisper to make sure that the encoder_input_len_range is not 0, thanks to the contribution from @Eddie-Wang1120 in Fix enc_dec bug and Make several improvements to whisper #992
- Fix the issue that log probabilities in Python runtime are not returned Question: Return log probabilites #983
- Multi-GPU fixes for multimodal examples How to use multi-gpu in running llava？ #1003
- Fix wrong end_id issue for Qwen qwen end_id setting is wrong so cannot stop at right postition! #987
- Fix a non-stopping generation issue LLAVA is slow due to unnecessary output tokens #1118 Why is llava trt-llm not much faster than transformers? #1123
- Fix wrong link in examples/mixtral/README.md Mixtral - no run.py file #1181
- Fix LLaMA2-7B bad results when int8 kv cache and per-channel int8 weight only are enabled llama2-7b bad results for int8-kv-cache + per-channel-int8-weight #967
- Fix wrong head_size when importing Gemma model from HuggingFace Hub, thanks for the contribution from @mfuntowicz in Specify the head_size from the config when importing Gemma from Hugging Face. #1148
- Fix ChatGLM2-6B building failure on INT8 chatglm2-6b int8+kv8 build failed on 0.8.0 branch #1239
- Fix wrong relative path in Baichuan documentation Incorrect documentation in examples /baichuan/ #1242
- Fix wrong SamplingConfig tensors in ModelRunnerCpp ModelRunnerCpp does not transfer SamplingConfig Tensor fields correctly #1183
- Fix error when converting SmoothQuant LLaMA Smoothquant LLaMA builds not working on 0.8.0 release #1267
- Fix the issue that examples/run.py only load one line from --input_file
- Fix the issue that ModelRunnerCpp does not transfer SamplingConfig tensor fields correctly ModelRunnerCpp does not transfer SamplingConfig Tensor fields correctly #1183
Benchmark
- Add emulated static batching in gptManagerBenchmark
- Support arbitrary dataset from HuggingFace for C++ benchmarks, see “Prepare dataset” section in benchmarks/cpp/README.md
- Add percentile latency report to gptManagerBenchmark
Performance
- Optimize gptDecoderBatch to support batched sampling
- Enable FMHA for models in BART, Whisper and NMT family
- Remove router tensor parallelism to improve performance for MoE models, thanks to the contribution from @megha95 in moe router tp removed #1091
- Improve custom all-reduce kernel
Infra
- Base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.02-py3
- Base Docker image for TensorRT-LLM backend is updated to nvcr.io/nvidia/tritonserver:24.02-py3
- The dependent TensorRT version is updated to 9.3
- The dependent PyTorch version is updated to 2.2
- The dependent CUDA version is updated to 12.3.2 (a.k.a. 12.3 Update 2)

kaiyux added 3 commits April 12, 2024 00:27

Update TensorRT-LLM

e6186a7

Update cutlass

e5e7735

Add Windows libraries

9002127

litaotju approved these changes Apr 12, 2024

View reviewed changes

MartinMarciniszyn self-requested a review April 12, 2024 08:52

MartinMarciniszyn approved these changes Apr 12, 2024

View reviewed changes

tp5uiuc self-requested a review April 12, 2024 09:46

tp5uiuc approved these changes Apr 12, 2024

View reviewed changes

kaiyux merged commit 250d9c2 into rel Apr 12, 2024

kaiyux deleted the kaiyu/update-rel branch April 12, 2024 09:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update TensorRT-LLM Release branch #1445

Update TensorRT-LLM Release branch #1445

kaiyux commented Apr 12, 2024 •

edited

Loading

Update TensorRT-LLM Release branch #1445

Update TensorRT-LLM Release branch #1445

Conversation

kaiyux commented Apr 12, 2024 • edited Loading

kaiyux commented Apr 12, 2024 •

edited

Loading