[BUG fix] Rebase caused spec decode fix #613

xuechendi · 2024-12-11T05:27:26Z

Error reported in https://jira.habana-labs.com/browse/SW-212516

Found two recent merged PR breaks down Spec Decode functionality:

Support mllama (llama 3.2) model for HPU #491 overrides existing workerwrapperBase design for speculative decoding.

if model_runner_cls is not None:
    ModelRunnerClass = model_runner_cls

is not needed since we now use codes as below for init model_runner_cls to follow upstream design.

if model_runner_cls is not None:
            self.model_runner = model_runner_cls(self.model_runner)

Prepare sin/cos buffers for rope outside model forward #566 is not working in Spec Decode Eagle mode
Due to input tensors is now different to the pre-assumption that decode_fwd only provide one token per seq. Spec Decode provides multiple candidates tokens as q.
To fix that, added a new ENV - "VLLM_COS_SIN_RECOMPUTE=true", need to use it to trigger recompute to cos and sin for spec decode.

xuechendi · 2024-12-11T15:20:15Z

@michalkuligowski , please help to review.

xuechendi · 2024-12-11T17:53:07Z

@kzawora-intel , please check a fix here:
previous mllama PR will break spec decode, I added a fix PR
de79b5c

tzielinski-habana · 2024-12-16T15:12:11Z

vllm/worker/hpu_model_runner.py

@@ -741,6 +749,8 @@ def load_model(self) -> None:
                get_decoder_layer_suffix(model_config.model_type if
                                         model_config is not None else None),
                hidden_layer_markstep_interval)
+            recompute_cos_sin = os.getenv('VLLM_COS_SIN_RECOMPUTE',
+                                          'false').lower() == 'true'


I'd rather have "in ['1', 'true']" instead of "== 'true'"

On another note, do you need to get the value of this env var here? Can it be done in init of RotaryEmbedding instead? I don't think it's necessary to pass this value to rope.prepare_cos_sin from here

The reason I want to pass the value from hpu_model_runner is because it is easier to be noticed. And I think the "RotaryEmbedding.init()" is general for all HW, so I don't want to do any change there.

tzielinski-habana · 2024-12-16T15:22:53Z

vllm/model_executor/layers/rotary_embedding.py

-                        offsets: Optional[torch.Tensor] = None):
+                        offsets: Optional[torch.Tensor] = None,
+                        recompute_cos_sin: bool = False):
+        self.recompute_cos_sin = recompute_cos_sin


I think if you set
self.recompute_cos_sin = os.getenv('VLLM_COS_SIN_RECOMPUTE', 'false').lower() in ['1', 'true'])
in the init method, you don't have to pass the recompute_cos_sin parameter here (see my other comment on hpu_model_runner.py:753)

I think the 'RotaryEmbedding.init()' is general for all HW, so I am thinking to only pass this new 'argument' in prepare_cos_sin() which is added by us?

It's a valid point, it might be easier to upstream if we don't touch the constructor

tzielinski-habana · 2024-12-17T10:56:59Z

vllm/model_executor/layers/rotary_embedding.py

-                        offsets: Optional[torch.Tensor] = None):
+                        offsets: Optional[torch.Tensor] = None,
+                        recompute_cos_sin: bool = False):
+        self.recompute_cos_sin = recompute_cos_sin


It's a valid point, it might be easier to upstream if we don't touch the constructor

tzielinski-habana · 2024-12-17T12:02:25Z

vllm/worker/hpu_model_runner.py

@@ -741,6 +749,8 @@ def load_model(self) -> None:
                get_decoder_layer_suffix(model_config.model_type if
                                         model_config is not None else None),
                hidden_layer_markstep_interval)
+            recompute_cos_sin = os.getenv('VLLM_COS_SIN_RECOMPUTE',


You explained why you don't want to touch the init method of RotaryEmbedding, but maybe we can at least move this getter to the init of HpuModelAdapter?

Make sense, I've moved it into HpuModelAdapter - init func

upstream PR10555 Signed-off-by: Chendi.Xue <[email protected]>

Signed-off-by: Chendi.Xue <[email protected]>

For spec decode eagle mode, need to VLLM_COS_SIN_RECOMPUTE=true Signed-off-by: Chendi.Xue <[email protected]>

xuechendi · 2024-12-19T21:18:10Z

@michalkuligowski , since I'll take long leave starts next week, would like to check with you if we can get this fix merged?
I rebased this PR today and tested with below two scripts locally, both passed

test_spec.sh

VLLM_CONTIGUOUS_PA=false VLLM_SKIP_WARMUP=True pytest -v tests/spec_decode/e2e/test_mlp_correctness.py::test_mlp_e2e_greedy_correctness
VLLM_CONTIGUOUS_PA=false VLLM_SKIP_WARMUP=True pytest -v tests/spec_decode/e2e/test_medusa_correctness.py::test_medusa_e2e_greedy_correctness
VLLM_COS_SIN_RECOMPUTE=true VLLM_CONTIGUOUS_PA=false VLLM_SKIP_WARMUP=True pytest -v tests/spec_decode/e2e/test_eagle_correctness.py::test_eagle_e2e_greedy_correctness

qa.sh

VLLM_SKIP_WARMUP=true \
VLLM_CONTIGUOUS_PA=false \
python3 benchmarks/benchmark_throughput.py --model=meta-llama/Llama-2-13b-chat-hf --device=hpu --seed=2024 --backend=vllm --input_len=1024 --num-prompts=128 --output_len=128 --dtype=bfloat16 --num_scheduler_steps=1 --gpu-memory-util=0.9 --tensor_parallel_size=1 --max-model-len=4096 --speculative_model=ibm-fms/llama-13b-accelerator --use-v2-block-manager

xuechendi requested review from kzawora-intel, madamczykhabana, michalkuligowski and mgawarkiewicz as code owners December 11, 2024 05:27

michalkuligowski requested a review from tzielinski-habana December 12, 2024 09:21

tzielinski-habana requested changes Dec 16, 2024

View reviewed changes

xuechendi requested a review from tzielinski-habana December 16, 2024 17:19

tzielinski-habana reviewed Dec 17, 2024

View reviewed changes

xuechendi requested a review from tzielinski-habana December 17, 2024 15:51

tzielinski-habana approved these changes Dec 18, 2024

View reviewed changes

xuechendi added 3 commits December 19, 2024 21:07

model_runner_cls is now use WorkerWrapperBase instead as init cls after

541e180

upstream PR10555 Signed-off-by: Chendi.Xue <[email protected]>

Fix shape error detected by benchmark

66e1af8

Signed-off-by: Chendi.Xue <[email protected]>

Fix pre-compute not correct issue

54ba9f1

For spec decode eagle mode, need to VLLM_COS_SIN_RECOMPUTE=true Signed-off-by: Chendi.Xue <[email protected]>

xuechendi force-pushed the rebase_caused_spec_decode_fix branch from ef256b5 to 54ba9f1 Compare December 19, 2024 21:12

xuechendi requested a review from vivekgoe as a code owner December 19, 2024 21:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG fix] Rebase caused spec decode fix #613

[BUG fix] Rebase caused spec decode fix #613

xuechendi commented Dec 11, 2024 •

edited by github-actions bot

Loading

xuechendi commented Dec 11, 2024

xuechendi commented Dec 11, 2024

tzielinski-habana Dec 16, 2024

tzielinski-habana Dec 16, 2024

xuechendi Dec 16, 2024

tzielinski-habana Dec 16, 2024

xuechendi Dec 16, 2024

tzielinski-habana Dec 17, 2024

tzielinski-habana Dec 17, 2024

tzielinski-habana Dec 17, 2024

xuechendi Dec 17, 2024

xuechendi commented Dec 19, 2024

[BUG fix] Rebase caused spec decode fix #613

Are you sure you want to change the base?

[BUG fix] Rebase caused spec decode fix #613

Conversation

xuechendi commented Dec 11, 2024 • edited by github-actions bot Loading

xuechendi commented Dec 11, 2024

xuechendi commented Dec 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xuechendi commented Dec 19, 2024

xuechendi commented Dec 11, 2024 •

edited by github-actions bot

Loading