Update TensorRT-LLM #2436

kaiyux · 2024-11-12T07:14:54Z

Model Support
- Added support for Minitron, see examples/nemotron.
- Added a GPT Variant - Granite(20B and 34B), see “GPT Variant - Granite” section in examples/gpt/README.md.
- Added support for LLaVA-OneVision model, see “LLaVA, LLaVa-NeXT, LLaVA-OneVision and VILA” section in examples/multimodal/README.md.
Features
- Added a trtllm-serve command to launch a FastAPI based server.
- Added support for prompt-lookup speculative decoding, see examples/prompt_lookup/README.md.
- Added FP8 support for Nemotron NAS 51B. See examples/nemotron_nas/README.md.
- Integrated the QServe w4a8 per-group/per-channel quantization, see “w4aINT8 quantization (QServe)” section in examples/llama/README.md.
- Added a C++ example for fast logits using the executor API, see “executorExampleFastLogits” section in examples/cpp/executor/README.md.
API
- [BREAKING CHANGE] auto is used as the default value for --dtype option in quantize and checkpoints conversion scripts.
- [BREAKING CHANGE] Deprecated gptManager API path in gptManagerBenchmark.
Bug fixes
- Fix the issue that the kernel moeTopK() cannot find the correct expert when the number of experts is not a power of two. Thanks @dongjiyingdjy for reporting this bug.
- Fixed an assertion failure on crossKvCacheFraction. (Assertion failed: Must set crossKvCacheFraction for encoder-decoder model #2419)
- Fixed an issue when using smoothquant to quantize Qwen2 model. (Fix errors when using smoothquant to quantize Qwen2 model #2370)
- Fixed a PDL typo in docs/source/performance/perf-benchmarking.md, thanks @MARD1NO for pointing it out in Small Typo #2425.
Infrastructure Changes
- The base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.10-py3.
- The base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:24.10-py3.
- The dependent TensorRT version is updated to 10.6.
- The dependent CUDA version is updated to 12.6.2.
- The dependent PyTorch version is updated to 2.5.1.

kaiyux added 2 commits November 12, 2024 06:48

open source 1c2eb102257f836cd50faf985e693241d7a84dbe

06255cb

Bump version for deepseek_v1

4e11a67

Shixiaowei02 approved these changes Nov 12, 2024

View reviewed changes

kaiyux merged commit c629546 into main Nov 12, 2024

kaiyux deleted the preview/main branch November 12, 2024 07:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update TensorRT-LLM #2436

Update TensorRT-LLM #2436

kaiyux commented Nov 12, 2024 •

edited

Loading

Update TensorRT-LLM #2436

Update TensorRT-LLM #2436

Conversation

kaiyux commented Nov 12, 2024 • edited Loading

kaiyux commented Nov 12, 2024 •

edited

Loading