Update TensorRT-LLM Release branch #1445
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
examples/multimodal
early_stopping=False
in beam search for C++ Runtimetransformers
Gemma implementation #1147GptSession
without OpenMPI Run GptSession without openmpi? #1220executor
APIexamples/bindings
executor
C++ API, seeexamples/bindings/README.md
executor
API, seedocs/source/executor.md
examples/high-level-api/README.md
for guidance)QuantConfig
used intrtllm-build
tool, support broader quantization featuresLLM()
API to accept engines built bytrtllm-build
commandSamplingConfig
used inLLM.generate
orLLM.generate_async
APIs, with the support of beam search, a variety of penalties, and more featuresLLM(streaming_llm=...)
examples/qwen/README.md
for the latest commandsexamples/gpt/README.md
for the latest commandstrtllm-build
command, to generalize the feature better to more modelstrtllm-build --max_prompt_embedding_table_size
instead.trtllm-build --world_size
flag to--auto_parallel
flag, the option is used for auto parallel planner only.AsyncLLMEngine
is removed,tensorrt_llm.GenerationExecutor
class is refactored to work with both explicitly launching withmpirun
in the application level, and accept an MPI communicator created bympi4py
examples/server
are removed, seeexamples/app
instead.model
parameter fromgptManagerBenchmark
andgptSessionBenchmark
encoder_input_len_range
is not 0, thanks to the contribution from @Eddie-Wang1120 in Fix enc_dec bug and Make several improvements to whisper #992end_id
issue for Qwen qwen end_id setting is wrong so cannot stop at right postition! #987head_size
when importing Gemma model from HuggingFace Hub, thanks for the contribution from @mfuntowicz in Specify the head_size from the config when importing Gemma from Hugging Face. #1148SamplingConfig
tensors inModelRunnerCpp
ModelRunnerCpp
does not transferSamplingConfig
Tensor fields correctly #1183examples/run.py
only load one line from--input_file
ModelRunnerCpp
does not transferSamplingConfig
tensor fields correctlyModelRunnerCpp
does not transferSamplingConfig
Tensor fields correctly #1183gptManagerBenchmark
benchmarks/cpp/README.md
gptManagerBenchmark
gptDecoderBatch
to support batched samplingnvcr.io/nvidia/pytorch:24.02-py3
nvcr.io/nvidia/tritonserver:24.02-py3