Bucketing with vllm #7747

hbikki · 2024-08-21T17:59:25Z

hbikki
Aug 21, 2024

Hello ,

There is a concept of bucketing available in Transformers NeuronX that we can leverage with VLLM for improving the first token and subsequent token latency optimizations.

The bucketing works by building the HLO operations with a list of numbers for both context encoding and token generation instead of the maximum sequence length. This allows us to select the appropriate HLO for context encoding and token generation dynamically.

Example:
If we have context length buckets of [512, 1024, 2048, 3072, 4096] instead of [4096] for the LLaMa 70B model, and the input sequence is ~500 tokens, we can select the HLO generated with 512. This will give a significant latency boost. From offline experimentation, we can achieve a 20% overall improvement in latency with the Neuron devices.

Questions:

Do other hardware devices support this feature? I experimented with TRLLM but didn't see the configuration.
I am looking for input on enabling the bucketing configuration recommendations for the Neuron device.

Suggestion 1: As these configs are currently not available, creating a new set of configs to support bucketing will decouple the code from the existing feature and prevent any failures in existing workflows.

Suggestion 2: Updating the "max_model_len" to be a list or int. If it's a list, we create bucketing for the items in the list.

NOTE:
This bucketing is not just limited to context encoding and token generation it can also be leveraged with batch sizes and Speculation

hbikki · 2024-08-21T17:59:41Z

hbikki
Aug 21, 2024
Author

cc: @liangfu

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bucketing with vllm #7747

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Bucketing with vllm #7747

hbikki Aug 21, 2024

Replies: 1 comment

hbikki Aug 21, 2024 Author

hbikki
Aug 21, 2024

hbikki
Aug 21, 2024
Author