Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update TensorRT-LLM Release branch #1445

Merged
merged 3 commits into from
Apr 12, 2024
Merged

Update TensorRT-LLM Release branch #1445

merged 3 commits into from
Apr 12, 2024

Conversation

kaiyux
Copy link
Member

@kaiyux kaiyux commented Apr 12, 2024

  • Model Support
  • Features
    • Add support to context chunking to work with KV cache reuse
    • Enable different rewind tokens per sequence for Medusa
    • BART LoRA support (limited to the Python runtime)
    • Enable multi-LoRA for BART LoRA
    • Support early_stopping=False in beam search for C++ Runtime
    • Add logits post processor to the batch manager (see docs/source/batch_manager.md#logits-post-processor-optional)
    • Support import and convert HuggingFace Gemma checkpoints, thanks for the contribution from @mfuntowicz in Make Gemma importable from transformers Gemma implementation #1147
    • Support loading Gemma from HuggingFace
    • Support auto parallelism planner for high-level API and unified builder workflow
    • Support run GptSession without OpenMPI Run GptSession without openmpi? #1220
    • [BREAKING CHANGE] TopP sampling optimization with deterministic AIR TopP algorithm is enabled by default
    • Medusa IFB support
    • [Experimental] Support FP8 FMHA, note that the performance is not optimal, and we will keep optimizing it
    • [BREAKING CHANGE] Support embedding sharing for Gemma
    • More head sizes support for LLaMA-like models
      • Ampere (sm80, sm86), Ada (sm89), Hopper(sm90) all support head sizes [32, 40, 64, 80, 96, 104, 128, 160, 256] now.
    • OOTB functionality support
      • T5
      • Mixtral 8x7B
  • API
    • C++ executor API
      • Add Python bindings, see documentation and examples in examples/bindings
      • Add advanced and multi-GPU examples for Python binding of executor C++ API, see examples/bindings/README.md
      • Add documents for C++ executor API, see docs/source/executor.md
    • High-level API (refer to examples/high-level-api/README.md for guidance)
      • [BREAKING CHANGE] Reuse the QuantConfig used in trtllm-build tool, support broader quantization features
      • Support in LLM() API to accept engines built by trtllm-build command
      • Add support for TensorRT-LLM checkpoint as model input
      • Refine SamplingConfig used in LLM.generate or LLM.generate_async APIs, with the support of beam search, a variety of penalties, and more features
      • Add support for the StreamingLLM feature, enable it by setting LLM(streaming_llm=...)
      • Migrate Mixtral to high level API and unified builder workflow
    • [BREAKING CHANGE] Refactored Qwen model to the unified build workflow, see examples/qwen/README.md for the latest commands
    • [BREAKING CHANGE] Move LLaMA convert checkpoint script from examples directory into the core library
    • [BREAKING CHANGE] Refactor GPT with unified building workflow, see examples/gpt/README.md for the latest commands
    • [BREAKING CHANGE] Removed all the lora related flags from convert_checkpoint.py script and the checkpoint content to trtllm-build command, to generalize the feature better to more models
    • [BREAKING CHANGE] Removed the use_prompt_tuning flag and options from convert_checkpoint.py script and the checkpoint content, to generalize the feature better to more models. Use the trtllm-build --max_prompt_embedding_table_size instead.
    • [BREAKING CHANGE] Changed the trtllm-build --world_size flag to --auto_parallel flag, the option is used for auto parallel planner only.
    • [BREAKING CHANGE] AsyncLLMEngine is removed, tensorrt_llm.GenerationExecutor class is refactored to work with both explicitly launching with mpirun in the application level, and accept an MPI communicator created by mpi4py
    • [BREAKING CHANGE] examples/server are removed, see examples/app instead.
    • [BREAKING CHANGE] Remove LoRA related parameters from convert checkpoint scripts
    • [BREAKING CHANGE] Simplify Qwen convert checkpoint script
    • [BREAKING CHANGE] Remove model parameter from gptManagerBenchmark and gptSessionBenchmark
  • Bug fixes
  • Benchmark
    • Add emulated static batching in gptManagerBenchmark
    • Support arbitrary dataset from HuggingFace for C++ benchmarks, see “Prepare dataset” section in benchmarks/cpp/README.md
    • Add percentile latency report to gptManagerBenchmark
  • Performance
    • Optimize gptDecoderBatch to support batched sampling
    • Enable FMHA for models in BART, Whisper and NMT family
    • Remove router tensor parallelism to improve performance for MoE models, thanks to the contribution from @megha95 in moe router tp removed #1091
    • Improve custom all-reduce kernel
  • Infra
    • Base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.02-py3
    • Base Docker image for TensorRT-LLM backend is updated to nvcr.io/nvidia/tritonserver:24.02-py3
    • The dependent TensorRT version is updated to 9.3
    • The dependent PyTorch version is updated to 2.2
    • The dependent CUDA version is updated to 12.3.2 (a.k.a. 12.3 Update 2)

@MartinMarciniszyn MartinMarciniszyn self-requested a review April 12, 2024 08:52
@tp5uiuc tp5uiuc self-requested a review April 12, 2024 09:46
@kaiyux kaiyux merged commit 250d9c2 into rel Apr 12, 2024
@kaiyux kaiyux deleted the kaiyu/update-rel branch April 12, 2024 09:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants