Replies: 1 comment
-
cc: @liangfu |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello ,
There is a concept of bucketing available in Transformers NeuronX that we can leverage with VLLM for improving the first token and subsequent token latency optimizations.
The bucketing works by building the HLO operations with a list of numbers for both context encoding and token generation instead of the maximum sequence length. This allows us to select the appropriate HLO for context encoding and token generation dynamically.
Example:
If we have context length buckets of [512, 1024, 2048, 3072, 4096] instead of [4096] for the LLaMa 70B model, and the input sequence is ~500 tokens, we can select the HLO generated with 512. This will give a significant latency boost. From offline experimentation, we can achieve a 20% overall improvement in latency with the Neuron devices.
Questions:
Suggestion 1: As these configs are currently not available, creating a new set of configs to support bucketing will decouple the code from the existing feature and prevent any failures in existing workflows.
Suggestion 2: Updating the "max_model_len" to be a list or int. If it's a list, we create bucketing for the items in the list.
NOTE:
This bucketing is not just limited to context encoding and token generation it can also be leveraged with batch sizes and Speculation
Beta Was this translation helpful? Give feedback.
All reactions