Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance]: Results from the vLLM Blog article "How Speculative Decoding Boosts vLLM Performance by up to 2.8x" are unreproducible #10318

Open
1 task done
yeonjoon-jung01 opened this issue Nov 14, 2024 · 1 comment
Labels
performance Performance-related issues

Comments

@yeonjoon-jung01
Copy link

yeonjoon-jung01 commented Nov 14, 2024

Proposal to improve performance

No response

Report of performance regression

We attempted to reproduce the results from the vLLM Blog article "How Speculative Decoding Boosts vLLM Performance by up to 2.8x". However, we were unable to achieve the performance levels reported in the article. The experimental setup we used is detailed below:

vLLM version: 0.6.3
Device: 4 x H100 PCIe
Target Model: meta-llama/Meta-Llama-3-70B-Instruct with TP=4
Draft Model: turboderp/Qwama-0.5B-Instruct with TP=1
Dataset: anon8231489123/ShareGPT_Vicuna_unfiltered
QPS: 1
Num Speculative Tokens: 4

Since the article did not specify the maximum batch size, we experimented with various batch sizes to align with the conditions implied in the blog. The results are summarized in the table below:

NUM_REQUESTS BATCH_SIZE latency output_token_throughput total_token_throughput sequence_throughput p99_ttft_ms mean_tpot_ms mean_e2e_ms
32 1 379.36 16.39 42.82 0.08 342058.66 59.24 192514.28
32 1 269.17 23.1 60.35 0.12 231910.81 48.47 132276.49
128 4 466.57 55.38 133.21 0.27 301887.07 68.16 158126.15
128 4 448.78 57.58 138.49 0.28 289373.59 71.17 151403.83
128 8 306.15 84.4 203.01 0.41 142675.01 85.46 80552.87
128 8 301.58 85.68 206.09 0.42 149036.84 94.52 83421.69
128 16 209.41 123.39 296.79 0.6 44303.73 103.11 37825.59
128 16 198.52 130.16 313.07 0.63 38716.87 107.46 33761.1
512 256 554.79 182.35 398.75 0.9 1598.33 127.45 25150.41
512 256 541.01 187 408.91 0.92 1659.16 117.04 20934.63

While our results generally indicate improved performance using speculative decoding for the ShareGPT dataset, we observed a maximum speedup of 1.4x with a batch size of 1. In contrast, the end-to-end (E2E) latency was significantly different. Even at a maximum batch size of 256, the average E2E latency exceeded 20 seconds, whereas the article reported an average of only 1–2 seconds.

Therefore, we kindly request that the vLLM team provide more detailed information regarding the experimental setup described in the blog, including specifics on the maximum batch size, input and output lengths of the dataset, number of requests, and CPU configuration. This information would greatly enhance the reliability of vLLM performance results and offer valuable insights to other users.

Thank you for your assistance.

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

vLLM version: 0.6.3
PyTorch version: 2.4.0+cu121
OS: Ubuntu 20.04.6 LTS (x86_64)

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.77
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.46.2
[pip3] triton==3.0.0
[conda] numpy 1.26.4 pypi_0 pypi
[conda] nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi
[conda] nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi
[conda] nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi
[conda] nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi
[conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi
[conda] nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi
[conda] nvidia-curand-cu12 10.3.2.106 pypi_0 pypi
[conda] nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi
[conda] nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi
[conda] nvidia-ml-py 12.560.30 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi
[conda] nvidia-nvjitlink-cu12 12.6.77 pypi_0 pypi
[conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi
[conda] pyzmq 26.2.0 pypi_0 pypi
[conda] torch 2.4.0 pypi_0 pypi
[conda] torchvision 0.19.0 pypi_0 pypi
[conda] transformers 4.46.2 pypi_0 pypi
[conda] triton 3.0.0 pypi_0 pypi

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@yeonjoon-jung01 yeonjoon-jung01 added the performance Performance-related issues label Nov 14, 2024
@yeonjoon-jung01 yeonjoon-jung01 changed the title [Performance]: Results in "vLLM Blog" article about speculative decoding are unreproducible [Performance]: Results from the vLLM Blog article "How Speculative Decoding Boosts vLLM Performance by up to 2.8x" are unreproducible Nov 15, 2024
@mgoin
Copy link
Collaborator

mgoin commented Nov 15, 2024

@LiuXiaoxuanPKU could you help with this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues
Projects
None yet
Development

No branches or pull requests

2 participants