Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update TensorRT-LLM #1122

Merged
merged 2 commits into from
Feb 21, 2024
Merged

Update TensorRT-LLM #1122

merged 2 commits into from
Feb 21, 2024

Conversation

kaiyux
Copy link
Member

@kaiyux kaiyux commented Feb 21, 2024

  • Features
    • Enable different rewind tokens per sequence for Medusa
    • OOTB functionality support
      • T5
      • Mixtral 8x7B
    • Experimental: Weightless engine support (see examples/weightless_engine/README.md)
  • API
    • Add high-level C++ API for inflight batching
    • Migrate Mixtral to high level API and unified builder workflow
  • Bug fixes
  • Benchmark/Performance
    • Optimize gptDecoderBatch to support batched sampling
    • Enable FMHA for models in BART, Whisper and NMT family
    • Add emulated static batching in gptManagerBenchmark
  • Documentation
    • Blog: Speed up inference with SOTA quantization techniques in TRT-LLM (see docs/source/blogs/quantization-in-TRT-LLM.md)

@kaiyux kaiyux merged commit eb8f26c into main Feb 21, 2024
@kaiyux kaiyux deleted the kaiyu/update branch February 21, 2024 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants