[V1] TPU Prototype #10241

robertgshaw2-neuralmagic · 2024-11-12T03:16:13Z

SUMMARY:

prototyping TPU on vLLM V1
correctness pytest -s tests/entrypoints/openai/test_accuracy.py::test_lm_eval_accuracy_v1_engine
nice speedups for cpu intensive workload vs V0 (33% on Qwen-1.5B on tpuv5e for sharegpt)

TODOS:

LONG TERM TODO:

get a true variable length append kernel to enable chunked prefill

BENCHMARKS:

export MODEL=Qwen/Qwen2.5-1.5B-Instruct

VLLM_USE_V1=1 , VLLM_ENABLE_V1_MULTIPROCESSING=1

VLLM_ENABLE_V1_MULTIPROCESSING=1 VLLM_USE_V1=1 python3 benchmarks/benchmark_throughput.py --model $MODEL --dataset benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json --max-model-len 2048

# Throughput: 27.81 requests/s

VLLM_USE_V1=1, VLLM_ENABLE_V1_MULTIPROCESSING=0

VLLM_ENABLE_V1_MULTIPROCESSING=0 VLLM_USE_V1=1 python3 benchmarks/benchmark_throughput.py --model $MODEL --dataset benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json --max-model-len 2048

# Throughput: 25.98 requests/s

VLLM_USE_V1=0

VLLM_USE_V1=0 python3 benchmarks/benchmark_throughput.py --model $MODEL --dataset benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json --max-model-len 2048

# Throughput: 20.87 requests/s

Signed-off-by: Robert Shaw <[email protected]>

github-actions · 2024-11-12T03:16:24Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

This reverts commit 338e11c.

…eem to work for xla

robertgshaw2-neuralmagic added 6 commits November 12, 2024 00:16

prototype tpu on v1

f69bdea

profile run complete

1142c89

actually dummy run

9cc4fbe

stash

61f7792

update workflow

1887d81

updated

b8c6444

Signed-off-by: Robert Shaw <[email protected]>

mergify bot added the ci/build label Nov 12, 2024

robertgshaw2-neuralmagic added 22 commits November 12, 2024 03:17

updated

75e2e53

more cleaning

bebabfc

cleanup llmengine

338e11c

Revert "cleanup llmengine"

db49d3b

This reverts commit 338e11c.

fixt

4ade5b0

warmup is working!

dc78451

stash

7f8fdee

stash

f7de1b4

workin for prefill, except when I compile decode cudagraphs?

5de1d9f

working! It was the type of the position ids!

15a2f74

forward pass

14b9500

correct output for single prompt with --enforce-eager

6eeecb7

end to end passing working for single request with CUDAGraphs!

0b256c2

yay! working with multiple requests! the issue was copy_() does not s…

b44227d

…eem to work for xla

yay! working end to end via lm eval harness!

451dfbf

we have end to end correctness

d2ae4a5

nits

7dd18e0

updated

d89200d

update to call .cpu() before slicing to avoid recompilation

75c44b4

a bit faster

58e85eb

better performance due to better input processing

fcf4681

cleanup PR

d9dc36a

robertgshaw2-neuralmagic added 6 commits November 17, 2024 21:28

cleanup

85bc154

cleanup pr

25fff99

formatting

5a87b99

updated

63b301a

updated

1af03e0

fixed accuracy bug

02ee304

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1] TPU Prototype #10241

[V1] TPU Prototype #10241

robertgshaw2-neuralmagic commented Nov 12, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Nov 12, 2024

[V1] TPU Prototype #10241

Are you sure you want to change the base?

[V1] TPU Prototype #10241

Conversation

robertgshaw2-neuralmagic commented Nov 12, 2024 • edited by github-actions bot Loading

SUMMARY:

TODOS:

LONG TERM TODO:

BENCHMARKS:

github-actions bot commented Nov 12, 2024

robertgshaw2-neuralmagic commented Nov 12, 2024 •

edited by github-actions bot

Loading