-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG fix] Rebase caused spec decode fix #613
base: habana_main
Are you sure you want to change the base?
[BUG fix] Rebase caused spec decode fix #613
Conversation
@michalkuligowski , please help to review. |
@kzawora-intel , please check a fix here: |
vllm/worker/hpu_model_runner.py
Outdated
@@ -741,6 +749,8 @@ def load_model(self) -> None: | |||
get_decoder_layer_suffix(model_config.model_type if | |||
model_config is not None else None), | |||
hidden_layer_markstep_interval) | |||
recompute_cos_sin = os.getenv('VLLM_COS_SIN_RECOMPUTE', | |||
'false').lower() == 'true' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather have "in ['1', 'true']" instead of "== 'true'"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On another note, do you need to get the value of this env var here? Can it be done in init of RotaryEmbedding instead? I don't think it's necessary to pass this value to rope.prepare_cos_sin from here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I want to pass the value from hpu_model_runner is because it is easier to be noticed. And I think the "RotaryEmbedding.init()" is general for all HW, so I don't want to do any change there.
offsets: Optional[torch.Tensor] = None): | ||
offsets: Optional[torch.Tensor] = None, | ||
recompute_cos_sin: bool = False): | ||
self.recompute_cos_sin = recompute_cos_sin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you set
self.recompute_cos_sin = os.getenv('VLLM_COS_SIN_RECOMPUTE', 'false').lower() in ['1', 'true'])
in the init method, you don't have to pass the recompute_cos_sin parameter here (see my other comment on hpu_model_runner.py:753)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the 'RotaryEmbedding.init()' is general for all HW, so I am thinking to only pass this new 'argument' in prepare_cos_sin() which is added by us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a valid point, it might be easier to upstream if we don't touch the constructor
offsets: Optional[torch.Tensor] = None): | ||
offsets: Optional[torch.Tensor] = None, | ||
recompute_cos_sin: bool = False): | ||
self.recompute_cos_sin = recompute_cos_sin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a valid point, it might be easier to upstream if we don't touch the constructor
vllm/worker/hpu_model_runner.py
Outdated
@@ -741,6 +749,8 @@ def load_model(self) -> None: | |||
get_decoder_layer_suffix(model_config.model_type if | |||
model_config is not None else None), | |||
hidden_layer_markstep_interval) | |||
recompute_cos_sin = os.getenv('VLLM_COS_SIN_RECOMPUTE', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You explained why you don't want to touch the init method of RotaryEmbedding, but maybe we can at least move this getter to the init of HpuModelAdapter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense, I've moved it into HpuModelAdapter - init func
upstream PR10555 Signed-off-by: Chendi.Xue <[email protected]>
Signed-off-by: Chendi.Xue <[email protected]>
For spec decode eagle mode, need to VLLM_COS_SIN_RECOMPUTE=true Signed-off-by: Chendi.Xue <[email protected]>
ef256b5
to
54ba9f1
Compare
@michalkuligowski , since I'll take long leave starts next week, would like to check with you if we can get this fix merged? test_spec.sh
qa.sh
|
Error reported in https://jira.habana-labs.com/browse/SW-212516
Found two recent merged PR breaks down Spec Decode functionality:
is not needed since we now use codes as below for init model_runner_cls to follow upstream design.
Due to input tensors is now different to the pre-assumption that decode_fwd only provide one token per seq. Spec Decode provides multiple candidates tokens as q.
To fix that, added a new ENV - "VLLM_COS_SIN_RECOMPUTE=true", need to use it to trigger recompute to cos and sin for spec decode.