Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorRT-LLM Release 0.15.0 #2529

Merged
merged 2 commits into from
Dec 4, 2024
Merged

TensorRT-LLM Release 0.15.0 #2529

merged 2 commits into from
Dec 4, 2024

Conversation

Shixiaowei02
Copy link
Collaborator

@Shixiaowei02 Shixiaowei02 commented Dec 4, 2024

TensorRT-LLM Release 0.15.0

Key Features and Enhancements

  • Added support for EAGLE. Refer to examples/eagle/README.md.
  • Added functional support for GH200 systems.
  • Added AutoQ (mixed precision) support.
  • Added a trtllm-serve command to start a FastAPI based server.
  • Added FP8 support for Nemotron NAS 51B. Refer to examples/nemotron_nas/README.md.
  • Added INT8 support for GPTQ quantization.
  • Added TensorRT native support for INT8 Smooth Quantization.
  • Added quantization support for Exaone model. Refer to examples/exaone/README.md.
  • Enabled Medusa for Qwen2 models. Refer to “Medusa with Qwen2” section in examples/medusa/README.md.
  • Optimized pipeline parallelism with ReduceScatter and AllGather for Mixtral models.
  • Added support for Qwen2ForSequenceClassification model architecture.
  • Added Python plugin support to simplify plugin development efforts. Refer to examples/python_plugin/README.md.
  • Added different rank dimensions support for LoRA modules when using the Hugging Face format. Thanks for the contribution from @AlessioNetti in Allow for LoRA modules with different rank dimensions when using HF format #2366.
  • Enabled embedding sharing by default. Refer to "Embedding Parallelism, Embedding Sharing, and Look-Up Plugin" section in docs/source/performance/perf-best-practices.md for information about the required conditions for embedding sharing.
  • Added support for per-token per-channel FP8 (namely row-wise FP8) on Ada.
  • Extended the maximum supported beam_width to 256.
  • Added FP8 and INT8 SmoothQuant quantization support for the InternVL2-4B variant (LLM model only). Refer to examples/multimodal/README.md.
  • Added support for prompt-lookup speculative decoding. Refer to examples/prompt_lookup/README.md.
  • Integrated the QServe w4a8 per-group/per-channel quantization. Refer to “w4aINT8 quantization (QServe)” section in examples/llama/README.md.
  • Added a C++ example for fast logits using the executor API. Refer to “executorExampleFastLogits” section in examples/cpp/executor/README.md.
  • [BREAKING CHANGE] NVIDIA Volta GPU support is removed in this and future releases.
  • Added the following enhancements to the LLM API:
    • [BREAKING CHANGE] Moved the runtime initialization from the first invocation of LLM.generate to LLM.__init__ for better generation performance without warmup.
    • Added n and best_of arguments to the SamplingParams class. These arguments enable returning multiple generations for a single request.
    • Added ignore_eos, detokenize, skip_special_tokens, spaces_between_special_tokens, and truncate_prompt_tokens arguments to the SamplingParams class. These arguments enable more control over the tokenizer behavior.
    • Added support for incremental detokenization to improve the detokenization performance for streaming generation.
    • Added the enable_prompt_adapter argument to the LLM class and the prompt_adapter_request argument for the LLM.generate method. These arguments enable prompt tuning.
  • Added support for a gpt_variant argument to the examples/gpt/convert_checkpoint.py file. This enhancement enables checkpoint conversion with more GPT model variants. Thanks to the contribution from @tonylek in Passing gpt_variant to model conversion #2352.

API Changes

  • [BREAKING CHANGE] Moved the flag builder_force_num_profiles in trtllm-build command to the BUILDER_FORCE_NUM_PROFILES environment variable.
  • [BREAKING CHANGE] Modified defaults for BuildConfig class so that they are aligned with the trtllm-build command.
  • [BREAKING CHANGE] Removed Python bindings of GptManager.
  • [BREAKING CHANGE] auto is used as the default value for --dtype option in quantize and checkpoints conversion scripts.
  • [BREAKING CHANGE] Deprecated gptManager API path in gptManagerBenchmark.
  • [BREAKING CHANGE] Deprecated the beam_width and num_return_sequences arguments to the SamplingParams class in the LLM API. Use the n, best_of and use_beam_search arguments instead.
  • Exposed --trust_remote_code argument to the OpenAI API server. (openai_server error #2357)

Model Updates

  • Added support for Llama 3.2 and llama 3.2-Vision model. Refer to examples/mllama/README.md for more details on the llama 3.2-Vision model.
  • Added support for Deepseek-v2. Refer to examples/deepseek_v2/README.md.
  • Added support for Cohere Command R models. Refer to examples/commandr/README.md.
  • Added support for Falcon 2, refer to examples/falcon/README.md, thanks to the contribution from @puneeshkhanna in Add support for falcon2 #1926.
  • Added support for InternVL2. Refer to examples/multimodal/README.md.
  • Added support for Qwen2-0.5B and Qwen2.5-1.5B model. (Qwen2-1.5B-Instruct convert_checkpoint.py failed #2388)
  • Added support for Minitron. Refer to examples/nemotron.
  • Added a GPT Variant - Granite(20B and 34B). Refer to “GPT Variant - Granite” section in examples/gpt/README.md.
  • Added support for LLaVA-OneVision model. Refer to “LLaVA, LLaVa-NeXT, LLaVA-OneVision and VILA” section in examples/multimodal/README.md.

Fixed Issues

Infrastructure Changes

  • The base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.10-py3.
  • The base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:24.10-py3.
  • The dependent TensorRT version is updated to 10.6.
  • The dependent CUDA version is updated to 12.6.2.
  • The dependent PyTorch version is updated to 2.5.1.
  • The dependent ModelOpt version is updated to 0.19 for Linux platform, while 0.17 is still used on Windows platform.

Documentation

@Shixiaowei02 Shixiaowei02 requested a review from kaiyux December 4, 2024 05:33
@kaiyux kaiyux changed the title Update TensorRT-LLM v0.15.0 TensorRT-LLM Release 0.15.0 Dec 4, 2024
@Shixiaowei02 Shixiaowei02 merged commit 8f91cff into rel Dec 4, 2024
@Shixiaowei02 Shixiaowei02 deleted the preview/rel branch December 4, 2024 05:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants