TensorRT-LLM 0.14.0 Release #2403

kaiyux · 2024-11-01T12:01:21Z

kaiyux
Nov 1, 2024
Maintainer

Hi,

We are very pleased to announce the 0.14.0 version of TensorRT-LLM. This update includes:

Key Features and Enhancements

Enhanced the LLM class in the LLM API.
- Added support for calibration with offline dataset.
- Added support for Mamba2.
- Added support for finish_reason and stop_reason.
Added FP8 support for CodeLlama.
Added __repr__ methods for class Module, thanks to the contribution from @1ytic in Add module __repr__ methods #2191.
Added BFloat16 support for fused gated MLP.
Updated ReDrafter beam search logic to match Apple ReDrafter v1.1.
Improved customAllReduce performance.
Draft model now can copy logits directly over MPI to the target model's process in orchestrator mode. This fast logits copy reduces the delay between draft token generation and the beginning of target model inference.
NVIDIA Volta GPU support is deprecated and will be removed in a future release.

API Changes

[BREAKING CHANGE] The default max_batch_size of the trtllm-build command is set to 2048.
[BREAKING CHANGE] Remove builder_opt from the BuildConfig class and the trtllm-build command.
Add logits post-processor support to the ModelRunnerCpp class.
Added isParticipant method to the C++ Executor API to check if the current process is a participant in the executor instance.

Model Updates

Added support for NemotronNas, see examples/nemotron_nas/README.md.
Added support for Deepseek-v1, see examples/deepseek_v1/README.md.
Added support for Phi-3.5 models, see examples/phi/README.md.

Fixed Issues

Fixed a typo in tensorrt_llm/models/model_weights_loader.py, thanks to the contribution from @wangkuiyi in Update model_weights_loader.py #2152.
Fixed duplicated import module in tensorrt_llm/runtime/generation.py, thanks to the contribution from @lkm2835 in Fix duplicated import module #2182.
Enabled share_embedding for the models that have no lm_head in legacy checkpoint conversion path, thanks to the contribution from @lkm2835 in Fix check_share_embedding #2232.
Fixed kv_cache_type issue in the Python benchmark, thanks to the contribution from @qingquansong in Fix kv_cache_type issue #2219.
Fixed an issue with SmoothQuant calibration with custom datasets. Thanks to the contribution by @Bhuvanesh09 in fix: add support for passing calib sequence length, and num samples + fixing use of custom calibration dataset for smoothquant in llama #2243.
Fixed an issue surrounding trtllm-build --fast-build with fake or random weights. Thanks to @ZJLi2013 for flagging it in trtllm-build with --fast-build ignore transformer layers #2135.
Fixed missing use_fused_mlp when constructing BuildConfig from dict, thanks for the fix from @ethnzhng in Include use_fused_mlp when constructing BuildConfig from dict #2081.
Fixed lookahead batch layout for numNewTokensCumSum. ([Bug] Lookahead decoding is nondeterministic and wrong after the first call to runner.generate #2263)

Infrastructure Changes

The dependent ModelOpt version is updated to v0.17.

Documentation

@Sherlock113 added a tech blog to the latest news in Add blog for Tuning TensorRT-LLM for Optimal Serving #2169, thanks for the contribution.

Known Issues

Replit Code is not supported with the transformers 4.45+

We are updating the main branch regularly with new features, bug fixes and performance optimizations. The rel branch will be updated less frequently, and the exact frequencies depend on your feedback.

Thanks,
The TensorRT-LLM Engineering Team

This discussion was created from the release TensorRT-LLM 0.14.0 Release.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorRT-LLM 0.14.0 Release #2403

{{title}}

Replies: 0 comments

Select a reply

TensorRT-LLM 0.14.0 Release #2403

kaiyux Nov 1, 2024 Maintainer

Key Features and Enhancements

API Changes

Model Updates

Fixed Issues

Infrastructure Changes

Documentation

Known Issues

Replies: 0 comments

kaiyux
Nov 1, 2024
Maintainer