TensorRT-LLM 0.15.0 Release #2531
Pinned
Shixiaowei02
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
We are very pleased to announce the 0.15.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
examples/eagle/README.md
.trtllm-serve
command to start a FastAPI based server.examples/nemotron_nas/README.md
.examples/exaone/README.md
.examples/medusa/README.md
.Qwen2ForSequenceClassification
model architecture.examples/python_plugin/README.md
.docs/source/performance/perf-best-practices.md
for information about the required conditions for embedding sharing.beam_width
to256
.examples/multimodal/README.md
.examples/prompt_lookup/README.md
.examples/llama/README.md
.executor
API. Refer to “executorExampleFastLogits” section inexamples/cpp/executor/README.md
.LLM.generate
toLLM.__init__
for better generation performance without warmup.n
andbest_of
arguments to theSamplingParams
class. These arguments enable returning multiple generations for a single request.ignore_eos
,detokenize
,skip_special_tokens
,spaces_between_special_tokens
, andtruncate_prompt_tokens
arguments to theSamplingParams
class. These arguments enable more control over the tokenizer behavior.enable_prompt_adapter
argument to theLLM
class and theprompt_adapter_request
argument for theLLM.generate
method. These arguments enable prompt tuning.gpt_variant
argument to theexamples/gpt/convert_checkpoint.py
file. This enhancement enables checkpoint conversion with more GPT model variants. Thanks to the contribution from @tonylek in Passing gpt_variant to model conversion #2352.API Changes
builder_force_num_profiles
intrtllm-build
command to theBUILDER_FORCE_NUM_PROFILES
environment variable.BuildConfig
class so that they are aligned with thetrtllm-build
command.GptManager
.auto
is used as the default value for--dtype
option in quantize and checkpoints conversion scripts.gptManager
API path ingptManagerBenchmark
.beam_width
andnum_return_sequences
arguments to theSamplingParams
class in the LLM API. Use then
,best_of
anduse_beam_search
arguments instead.--trust_remote_code
argument to the OpenAI API server. (openai_server error #2357)Model Updates
examples/mllama/README.md
for more details on the llama 3.2-Vision model.examples/deepseek_v2/README.md
.examples/commandr/README.md
.examples/falcon/README.md
, thanks to the contribution from @puneeshkhanna in Add support for falcon2 #1926.examples/multimodal/README.md
.examples/nemotron
.examples/gpt/README.md
.examples/multimodal/README.md
.Fixed Issues
moeTopK()
cannot find the correct expert when the number of experts is not a power of two. Thanks @dongjiyingdjy for reporting this bug.crossKvCacheFraction
. (Assertion failed: Must set crossKvCacheFraction for encoder-decoder model #2419)docs/source/performance/perf-benchmarking.md
, thanks @MARD1NO for pointing it out in Small Typo #2425.Infrastructure Changes
nvcr.io/nvidia/pytorch:24.10-py3
.nvcr.io/nvidia/tritonserver:24.10-py3
.Documentation
We are updating the
main
branch regularly with new features, bug fixes and performance optimizations. Therel
branch will be updated less frequently, and the exact frequencies depend on your feedback.Thanks,
The TensorRT-LLM Engineering Team
This discussion was created from the release TensorRT-LLM 0.15.0 Release.
Beta Was this translation helpful? Give feedback.
All reactions