[ISSUE] The Pull Request at https://github.com/FasterDecoding/Medusa/pull/97 from Narsil/medusa2 needs to be rolled back. #112

super-ahn · 2024-07-11T01:02:32Z

Hello.

After fine-tuning the Medusa head, I discovered an issue affecting inference performance and would like to share my findings. Normally, when a model is trained correctly, using TGI to serve the model should reduce inference latency.

I followed the guide at https://github.com/huggingface/text-generation-inference/blob/main/docs/source/basic_tutorials/train_medusa.md for training. I used the same dataset as the guide, specifically ShareGPT_V4.3_unfiltered_cleaned_split.json, for self-distillation and then trained the Medusa head using the resulting dataset.

However, when I served the trained Medusa head using TGI (v2.0.4), I did not observe a reduction in inference latency compared to the original model. Upon examining the Mistral-7B-Instruct-v0.2-medusa model uploaded to Hugging Face at https://huggingface.co/text-generation-inference/, I noticed differences in the size of the medusa_lm_head.safetensors file and the contents of the config.json file.

Upon reviewing the Medusa code, I found that the architecture was changed by updating the Medusa head to v2 in the pull request at #97 by @Narsil.

When I reverted the architecture to its original form with an additional linear layer and re-trained it on the same dataset, serving it through TGI resulted in the expected latency reduction.

Although I am unsure why the latency did not decrease after changing the Medusa head architecture to v2 since TGI does not show any metrics related to speculative decoding efficiency, the current Medusa head v2 architecture has issues with inference performance.

A rollback to the main branch is necessary.

super-ahn closed this as completed Jul 11, 2024

super-ahn reopened this Jul 11, 2024

super-ahn changed the title ~~[ISSUE] The Pull Request at https://github.com/FasterDecoding/Medusa/pull/97 from Narsil/medusa2 should be rolled back.~~ [ISSUE] The Pull Request at https://github.com/FasterDecoding/Medusa/pull/97 from Narsil/medusa2 needs to be rolled back. Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ISSUE] The Pull Request at https://github.com/FasterDecoding/Medusa/pull/97 from Narsil/medusa2 needs to be rolled back. #112

[ISSUE] The Pull Request at https://github.com/FasterDecoding/Medusa/pull/97 from Narsil/medusa2 needs to be rolled back. #112

super-ahn commented Jul 11, 2024 •

edited

Loading

[ISSUE] The Pull Request at https://github.com/FasterDecoding/Medusa/pull/97 from Narsil/medusa2 needs to be rolled back. #112

[ISSUE] The Pull Request at https://github.com/FasterDecoding/Medusa/pull/97 from Narsil/medusa2 needs to be rolled back. #112

Comments

super-ahn commented Jul 11, 2024 • edited Loading

super-ahn commented Jul 11, 2024 •

edited

Loading