Update TensorRT-LLM #1122

kaiyux · 2024-02-21T11:25:40Z

Features
- Enable different rewind tokens per sequence for Medusa
- OOTB functionality support
  - T5
  - Mixtral 8x7B
- Experimental: Weightless engine support (see examples/weightless_engine/README.md)
API
- Add high-level C++ API for inflight batching
- Migrate Mixtral to high level API and unified builder workflow
Bug fixes
- Fix a weight-only quant bug for Whisper to make sure that the encoder_input_len_range should not be 0, thanks to the contribution from @Eddie-Wang1120 in Fix enc_dec bug and Make several improvements to whisper #992
- Fix the issue that log probabilities in python runtime are not returned Question: Return log probabilites #983
Benchmark/Performance
- Optimize gptDecoderBatch to support batched sampling
- Enable FMHA for models in BART, Whisper and NMT family
- Add emulated static batching in gptManagerBenchmark
Documentation
- Blog: Speed up inference with SOTA quantization techniques in TRT-LLM (see docs/source/blogs/quantization-in-TRT-LLM.md)

kaiyux and others added 2 commits February 21, 2024 03:12

Update TensorRT-LLM

980dc28

update

d5a651d

Shixiaowei02 force-pushed the kaiyu/update branch from 2667092 to d5a651d Compare February 21, 2024 12:52

Shixiaowei02 approved these changes Feb 21, 2024

View reviewed changes

kaiyux merged commit eb8f26c into main Feb 21, 2024

kaiyux deleted the kaiyu/update branch February 21, 2024 13:31

Provide feedback