Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Implementing disaggregated prefilling, and caching KV cache in CPU/disk/database. #8498

Open
wants to merge 323 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
323 commits
Select commit Hold shift + click to select a range
bb8c08a
check
KuntaiDu Jul 19, 2024
25a7cf3
locate the hanging line
KuntaiDu Jul 19, 2024
999bd72
add rank to CPU group
KuntaiDu Jul 19, 2024
3428ea6
narrow case
KuntaiDu Jul 19, 2024
91e3ed2
bug fix: need to align the distributed groups between prefill and dec…
KuntaiDu Jul 20, 2024
3dd2275
add disaggregated prefilling for flashinfer
KuntaiDu Jul 23, 2024
2b13f3c
adjust comments
KuntaiDu Jul 23, 2024
8c3f209
add logging for send and recv
KuntaiDu Jul 23, 2024
c6a5e57
turn off chunked prefill to use flashinfer kernel
KuntaiDu Jul 23, 2024
b3c47f3
confirm which backend is being used
KuntaiDu Jul 23, 2024
f05540c
remove debugging from parallel_state, its too much...
KuntaiDu Jul 23, 2024
eb96fe7
add disagg prefill for flash attn backend
KuntaiDu Jul 23, 2024
09d5588
edit flash attn to assign prefill_meta first
KuntaiDu Jul 23, 2024
43077e7
use print instead of attn
KuntaiDu Jul 23, 2024
f716737
make data contiguous
KuntaiDu Jul 23, 2024
0d07251
add more debug message
KuntaiDu Jul 23, 2024
2177737
turn on logging
KuntaiDu Jul 23, 2024
a293bd0
more debug prints in flash_attn
KuntaiDu Jul 23, 2024
cc7f646
remove enforce eager
KuntaiDu Jul 23, 2024
68f3d16
adjust printing order in flash attn
KuntaiDu Jul 23, 2024
21a61b9
avoid sending & receiving output tensor during profile run
KuntaiDu Jul 23, 2024
691cad7
also log the device
KuntaiDu Jul 23, 2024
c057f19
adjust implementation
KuntaiDu Jul 23, 2024
82b73bb
finish adjustment
KuntaiDu Jul 23, 2024
6db1d48
fall back to original flashinfer
KuntaiDu Jul 23, 2024
9e53071
Merge branch 'vllm-project:main' into kuntai-disagg
KuntaiDu Jul 23, 2024
dbaade7
add space
KuntaiDu Jul 23, 2024
f572db8
clean config.py
KuntaiDu Jul 23, 2024
9ebf3ad
keep flashattn implementation
KuntaiDu Jul 23, 2024
67b1c2e
commit changes that will be merged
KuntaiDu Jul 23, 2024
4acad6a
Merge branch 'kuntai-disagg' of https://github.com/KuntaiDu/vllm into…
KuntaiDu Jul 23, 2024
3abca47
revert custom allreduce changes
KuntaiDu Jul 23, 2024
0ce251b
remove debug logs from the file
KuntaiDu Jul 23, 2024
1f3ac2b
revert changes to prefix_caching_block --- unnecessary
KuntaiDu Jul 23, 2024
c93bf33
revert changes
KuntaiDu Jul 23, 2024
8dcaf43
fix typos
KuntaiDu Jul 23, 2024
4d83813
add example usage to disaggregated prefill
KuntaiDu Jul 23, 2024
11c3ace
can only use print instead of log.debug...
KuntaiDu Jul 23, 2024
0bd0cc9
kill vllm instance after run
KuntaiDu Jul 23, 2024
39973bb
add proxy server for disaggregated prefilling
KuntaiDu Jul 24, 2024
13a6d12
update disagg proxy server
KuntaiDu Jul 24, 2024
81cad25
add debug message for proxy server
KuntaiDu Jul 24, 2024
198931b
fix bug
KuntaiDu Jul 24, 2024
7412767
increase nccl buff size
KuntaiDu Jul 24, 2024
bd6f41b
increase nccl buffer size
KuntaiDu Jul 24, 2024
20f9de1
add debug flag
KuntaiDu Jul 24, 2024
11850d5
reduce gpu memory usage
KuntaiDu Jul 24, 2024
d6ad9bd
fix syntax bug
KuntaiDu Jul 24, 2024
57dd656
temporarily lift up nccl buffer size for send and recv
KuntaiDu Jul 24, 2024
9379fbb
reduce nccl buffer size and see if bug fixed
KuntaiDu Jul 24, 2024
c23d841
fix
KuntaiDu Jul 24, 2024
7fc62b4
add debug info -- see which layer the prefill instance got stuck
KuntaiDu Jul 24, 2024
e542366
remove nccl debug -- it is too loud
KuntaiDu Jul 24, 2024
e9f7dc2
change buffer size only for disagg communicator
KuntaiDu Jul 24, 2024
18ded4c
disable nccl debug
KuntaiDu Jul 24, 2024
e814f82
use isend and irecv
KuntaiDu Jul 24, 2024
a3399b3
try to increase the buffer size
KuntaiDu Jul 24, 2024
5e18bd7
Merge branch 'main' into kuntai-disagg
KuntaiDu Jul 30, 2024
e4e60d9
bug fix, now disaggregated prefill should work as expected
KuntaiDu Jul 31, 2024
87fbfae
add proxy server
KuntaiDu Jul 31, 2024
fa664c0
startr slow -- using pp=1 and tp=1
KuntaiDu Aug 1, 2024
6bf7583
adjust the API
KuntaiDu Aug 1, 2024
6aad5cc
support batch size >1
KuntaiDu Aug 2, 2024
e934286
update model runner
KuntaiDu Aug 2, 2024
b68435a
move group coordinator to a separate file, move disagg implementation…
KuntaiDu Aug 4, 2024
e54f7a3
no need to send during attention
KuntaiDu Aug 4, 2024
23c9949
debug tp
KuntaiDu Aug 4, 2024
87cb78b
resolve conflicts
KuntaiDu Aug 4, 2024
06a526a
Fix several bugs: tensor device placement, misc performance optimizat…
KuntaiDu Aug 5, 2024
34e6bb3
remove useless comments
KuntaiDu Aug 5, 2024
55bf3bf
update disaggregated prefill example
KuntaiDu Aug 5, 2024
b525510
add disaggregated prefill overhead benchmark
KuntaiDu Aug 6, 2024
ee6a6ec
change disagg prefill proxy server to support non-streaming case
KuntaiDu Aug 7, 2024
f3cc91d
avoid detokenizing the first token in prefill instance -- for shorter…
KuntaiDu Aug 7, 2024
0582265
add failure test cases --- try switching to another machine
KuntaiDu Aug 7, 2024
89d4ca4
update
KuntaiDu Aug 7, 2024
9f4dba2
remove debugging information
KuntaiDu Aug 8, 2024
aa55883
avoid broadcast by finding seqlen inside the attn metadata
KuntaiDu Aug 9, 2024
95df023
update examples
KuntaiDu Aug 9, 2024
d92223a
support pipeline parallel
KuntaiDu Aug 9, 2024
a8c202c
update benchmark --- compare chunked prefill w.r.t. disagg prefill
KuntaiDu Aug 10, 2024
310f3a3
mute round_robin_proxy -- too loud
KuntaiDu Aug 10, 2024
118aab1
bug fix: racing conditions, and rare cases where input hash is not ca…
KuntaiDu Aug 10, 2024
96d38b4
add visualization script
KuntaiDu Aug 11, 2024
3fc0c5c
fix bug: when KV transfer fails, do not return hidden state
KuntaiDu Aug 11, 2024
f9aadd8
add new abstractions
KuntaiDu Aug 26, 2024
db66a1e
major revision: add 3-layer abstractions. Transport, lookup buffer, a…
KuntaiDu Aug 28, 2024
e04430c
add kv transfer test
KuntaiDu Aug 28, 2024
30f9bb6
add test cases for pipe
KuntaiDu Aug 28, 2024
bbce62e
bug fix
KuntaiDu Aug 28, 2024
927800d
finalize send-recv test
KuntaiDu Aug 29, 2024
6680ea7
update test case so that there are both send and recv
KuntaiDu Aug 29, 2024
dfbfe80
update kv lookup buffer --- I am TOOOOOOO sleepy
KuntaiDu Aug 29, 2024
b566b18
add lookup buffer test
KuntaiDu Sep 4, 2024
fc2c972
update lookup buffer
KuntaiDu Sep 4, 2024
b2c765c
finish lookup buffer test
KuntaiDu Sep 6, 2024
8aef9dc
update parallel state to use the new class method
KuntaiDu Sep 7, 2024
1b6125d
move the implementatio to worker_base.py
KuntaiDu Sep 8, 2024
c4102ef
update test
KuntaiDu Sep 8, 2024
a576532
update a new implementation for distributed pipe. Much less CPU commu…
KuntaiDu Sep 8, 2024
24a231e
update tensor sending and receiving. Use CPU to transfer metadata ins…
KuntaiDu Sep 8, 2024
dca877a
update benchmark: use small model for quick iteration
KuntaiDu Sep 8, 2024
9f81f41
update implementation
KuntaiDu Sep 8, 2024
bb86588
[Add] optimized implementation for KV transfer pipe
ApostaC Sep 10, 2024
1377912
Merge pull request #3 from KuntaiDu/yihua-kv-pipe
KuntaiDu Sep 11, 2024
ffb792b
[Fix] the implementation of KV lookup buffer
ApostaC Sep 13, 2024
d7d32c1
remove unused file
ApostaC Sep 13, 2024
c5b7232
Merge pull request #4 from KuntaiDu/yihua-lookup-buffer
KuntaiDu Sep 13, 2024
4db6446
Merge pull request #5 from KuntaiDu/kuntai-disagg-refactor
YaoJiayi Sep 13, 2024
417ccb3
update vllm adapter
YaoJiayi Sep 13, 2024
0176ebb
update worker_base
YaoJiayi Sep 13, 2024
84fd0b8
update comm initialization
YaoJiayi Sep 13, 2024
826ca70
update
ApostaC Sep 13, 2024
3425ab6
update documentation
ApostaC Sep 13, 2024
9f3a3a5
adjust vllm adapter: now we separate CPU and device into different pipes
ApostaC Sep 13, 2024
ce79d59
build 2 pipes in vLLM adapter
ApostaC Sep 13, 2024
34dfdde
documentation chagne
ApostaC Sep 13, 2024
80b4200
Merge branch 'jiayi-dev-v2' into kuntai-disagg-refactor
YaoJiayi Sep 14, 2024
9eefec2
Merge pull request #6 from KuntaiDu/kuntai-disagg-refactor
YaoJiayi Sep 14, 2024
9355be3
update vllm_adapter
YaoJiayi Sep 14, 2024
54b68c9
minor fix
YaoJiayi Sep 15, 2024
2dff658
fix type hint
YaoJiayi Sep 15, 2024
c6a6714
fix comm init
YaoJiayi Sep 15, 2024
fef35b2
bug fix: remove self from bypass_model_exec
ApostaC Sep 15, 2024
4d0b5cd
bug fix: should init SimpleKVLookupBuffer with signal pipe first and …
ApostaC Sep 15, 2024
31b891d
adjust torch distributed logging
ApostaC Sep 15, 2024
7e68d08
remove unnecessaqry comments
ApostaC Sep 15, 2024
85c7a64
remove unnecessary comments
ApostaC Sep 15, 2024
01fe335
update documentation
ApostaC Sep 15, 2024
1f47731
Merge pull request #7 from KuntaiDu/jiayi-dev-v2
KuntaiDu Sep 15, 2024
caaaeb8
update overhead benchmark
ApostaC Sep 15, 2024
0dd3571
Merge pull request #8 from KuntaiDu/jiayi-dev-v2
KuntaiDu Sep 15, 2024
9c98d5f
resolve merge conflict
ApostaC Sep 15, 2024
515c47b
remove group coordinator import
ApostaC Sep 15, 2024
f166cf8
remove syntax bug
ApostaC Sep 15, 2024
f320518
update round robin proxy. Prior bash-based impl is buggy
ApostaC Sep 15, 2024
5b4a3e3
update docs for disagg overhead benchmark
ApostaC Sep 15, 2024
01b2fd3
use new round robin proxy in performance benchmark
ApostaC Sep 15, 2024
54bd11f
update
ApostaC Sep 15, 2024
b19f346
update benchmarking script
ApostaC Sep 15, 2024
cb7ff06
revert changes in model_runner.py --- no change needed for disagg pre…
ApostaC Sep 15, 2024
dd8c86d
no I was wrong
ApostaC Sep 15, 2024
4e8043c
update benchmark
ApostaC Sep 15, 2024
b51f891
remove sonnet 4x --- it can be automatically generated via benchmarki…
ApostaC Sep 15, 2024
168452f
revert change in flash attn and flash infer to clean up the diff
ApostaC Sep 15, 2024
784d905
update the example
ApostaC Sep 15, 2024
17d2505
make format checker happy
ApostaC Sep 15, 2024
36a382c
resolve circular import
ApostaC Sep 15, 2024
a0867dd
fix redundant import
ApostaC Sep 15, 2024
7f90903
rename to a shorter name
ApostaC Sep 15, 2024
5ca22fb
remove unnecessary file
ApostaC Sep 16, 2024
073642b
update kv transfer test
ApostaC Sep 16, 2024
70d6571
update tests
ApostaC Sep 16, 2024
4d6b00a
make fmt checker happy
ApostaC Sep 16, 2024
7c13e03
constraint the model length
ApostaC Sep 16, 2024
cf5b84c
adjust path
ApostaC Sep 16, 2024
eb751d6
add disagg prefill test to test pipeline
ApostaC Sep 16, 2024
f101b40
Merge pull request #9 from KuntaiDu/kuntai-disagg-refactor
KuntaiDu Sep 16, 2024
1e23e99
use new round robin proxy in performance benchmark
KuntaiDu Sep 15, 2024
b4225f8
update
KuntaiDu Sep 15, 2024
fa47857
update benchmarking script
KuntaiDu Sep 15, 2024
46f82a4
revert changes in model_runner.py --- no change needed for disagg pre…
KuntaiDu Sep 15, 2024
8d7bb78
no I was wrong
KuntaiDu Sep 15, 2024
b5f9db5
update benchmark
KuntaiDu Sep 15, 2024
0fc0091
remove sonnet 4x --- it can be automatically generated via benchmarki…
KuntaiDu Sep 15, 2024
afd7a29
revert change in flash attn and flash infer to clean up the diff
KuntaiDu Sep 15, 2024
cbf24b3
update the example
KuntaiDu Sep 15, 2024
4f4ea50
make format checker happy
KuntaiDu Sep 15, 2024
f78a2eb
resolve circular import
KuntaiDu Sep 15, 2024
44dfa3f
fix redundant import
KuntaiDu Sep 15, 2024
822f3dc
rename to a shorter name
KuntaiDu Sep 15, 2024
7682269
remove unnecessary file
KuntaiDu Sep 16, 2024
b6e5eb3
update kv transfer test
KuntaiDu Sep 16, 2024
58f5080
update tests
KuntaiDu Sep 16, 2024
8f0538c
make fmt checker happy
KuntaiDu Sep 16, 2024
dda1f31
constraint the model length
KuntaiDu Sep 16, 2024
85d72fa
adjust path
KuntaiDu Sep 16, 2024
60ede08
add disagg prefill test to test pipeline
KuntaiDu Sep 16, 2024
0d81aaf
Merge pull request #10 from KuntaiDu/kuntai-disagg-refactor
KuntaiDu Sep 16, 2024
0df7566
bugfix
YaoJiayi Sep 16, 2024
73c1683
bugfix
YaoJiayi Sep 16, 2024
2297c19
Merge pull request #11 from KuntaiDu/jiayi-dev-v2
KuntaiDu Sep 16, 2024
b2e0254
Merge branch 'main' into kuntai-disagg-refactor
KuntaiDu Sep 18, 2024
70bec94
rename the environment variable to KV producer and KV consumer, for m…
KuntaiDu Sep 19, 2024
e787e42
revert worker to vllm main
KuntaiDu Sep 19, 2024
9874b42
bug fix
KuntaiDu Sep 19, 2024
5950ad5
fix typo: Distributerd -> Distributed
KuntaiDu Sep 19, 2024
c116684
remove the debug flag in example -- user don't need it
KuntaiDu Sep 19, 2024
44e8875
fix typo
KuntaiDu Sep 19, 2024
181928f
fixing benchmark_serving.py
KuntaiDu Sep 19, 2024
c17d18d
fix the example
KuntaiDu Sep 19, 2024
0b00876
update build partial prefill input
KuntaiDu Sep 19, 2024
94a5086
bug fix for LMCache -- adjust vLLM's rebuild input, and merge the log…
KuntaiDu Sep 20, 2024
8099fb3
make format checker happy
KuntaiDu Sep 20, 2024
603864e
make ruff and yapf happy, also fix test bug
KuntaiDu Sep 20, 2024
1d7a1c9
remove empty file
KuntaiDu Sep 20, 2024
10ad09c
fix bug when world_size == -1
KuntaiDu Sep 20, 2024
38e3a57
adjust comments
KuntaiDu Sep 20, 2024
e2bd481
make yapf and ruff happy
KuntaiDu Sep 20, 2024
4979337
relaunch CI
KuntaiDu Sep 20, 2024
a2007dc
change get_open_port so that it is easier to understand
KuntaiDu Sep 24, 2024
ce434f5
adjust comment
KuntaiDu Sep 24, 2024
f224c71
make format checker happy
KuntaiDu Sep 24, 2024
5d9b007
adjust model runner docstring
KuntaiDu Sep 24, 2024
6255dca
make format checker happy
KuntaiDu Sep 24, 2024
71ae275
change data == [] to not data (thanks Cody)
KuntaiDu Sep 24, 2024
80164ea
fix misleading to available
KuntaiDu Sep 24, 2024
52c2d10
add new line and run format checker
KuntaiDu Sep 24, 2024
09478ef
add docstring for lookup buffer
KuntaiDu Sep 24, 2024
06cb15c
align docstring syntax
KuntaiDu Sep 24, 2024
7c11a39
add docstring for abstract classes
KuntaiDu Sep 24, 2024
37bac34
put assertion at the end of the function
KuntaiDu Sep 24, 2024
111abb4
add fp8 support to pipe
KuntaiDu Sep 24, 2024
394afaa
adjust docstrings
KuntaiDu Sep 24, 2024
76019f1
bug fix: check isinstance(torch.Tensor) before checking NOne
KuntaiDu Sep 24, 2024
93ec62b
make format check happy
KuntaiDu Sep 24, 2024
87b82cc
Merge branch 'main' into kuntai-disagg-refactor
KuntaiDu Oct 8, 2024
c5bdf64
Adjust to latest changes of `kv_caches`: it is now always a tensor.
KuntaiDu Oct 10, 2024
596eb64
debug
KuntaiDu Oct 10, 2024
683bd9c
bug fix: kv_caches will be list of torch.tensor([]) in profile run.
KuntaiDu Oct 10, 2024
81aa825
Merge branch 'vllm-project:main' into kuntai-disagg-refactor
KuntaiDu Oct 10, 2024
521daba
Relax server start timeout limit
KuntaiDu Oct 10, 2024
516f9ca
Merge branch 'kuntai-disagg-refactor' of https://github.com/KuntaiDu/…
KuntaiDu Oct 10, 2024
6edb723
Merge branch 'main' into kuntai-disagg-refactor
KuntaiDu Nov 7, 2024
7efdf60
Adjust folder format
KuntaiDu Nov 8, 2024
1c608e6
config fix
KuntaiDu Nov 9, 2024
303ff85
misc fixes
KuntaiDu Nov 10, 2024
0f172d5
stage changes
KuntaiDu Nov 13, 2024
228f78d
Merge remote-tracking branch 'upstream/main' into kuntai-disagg-refactor
KuntaiDu Nov 13, 2024
6f3d1b3
debugging pynccl pipe
KuntaiDu Nov 14, 2024
20e0450
bug found: NONE Tensor did not return none
KuntaiDu Nov 14, 2024
cc9e8f4
save code for Kaichao to debug
KuntaiDu Nov 14, 2024
3e7e341
adjust
KuntaiDu Nov 14, 2024
49e89a2
NCCL pipe bug fix: only transmit metadata when the tensor is None
KuntaiDu Nov 15, 2024
b6e83a2
Update docstring using GPT, and clean up unnecessary variables
KuntaiDu Nov 15, 2024
8d20116
Bug fix on PyNcclPipe: the device of sending tensor should be inferre…
KuntaiDu Nov 15, 2024
fdc4aad
Fix lookup buffer
KuntaiDu Nov 15, 2024
c0b9574
Merge remote-tracking branch 'upstream/main' into kuntai-disagg-refactor
KuntaiDu Nov 15, 2024
e7432e9
Adjust init parameters for connector.
KuntaiDu Nov 15, 2024
d1ce09d
Move KVTransferConfig outside ParallelConfig
KuntaiDu Nov 17, 2024
a478522
Merge remote-tracking branch 'upstream/main' into kuntai-disagg-refactor
KuntaiDu Nov 17, 2024
9c4cbc5
A series of bug fix: previous merge is buggy and need to manually rev…
KuntaiDu Nov 17, 2024
9e8affc
Fix typo (input_token wrongfully typed as input) and make default kv_…
KuntaiDu Nov 18, 2024
d8e79fa
Make sure the output is shown at the end of the run by sleeping longe…
KuntaiDu Nov 18, 2024
62f3966
A series of changes to clean up the diff
KuntaiDu Nov 18, 2024
6478fb2
Clean up the diff
KuntaiDu Nov 18, 2024
c2ebcb6
Clean up the diff in several executor files
KuntaiDu Nov 18, 2024
9158549
clean up wierd spaces in tokenizer groups
KuntaiDu Nov 18, 2024
98a44df
Remove previous environment variable -- now we initialize distributed…
KuntaiDu Nov 18, 2024
744a40f
add a new line at the end of parallel_state.py to clean up the diff
KuntaiDu Nov 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -474,6 +474,18 @@ steps:
- pytest -v -s distributed/test_pp_cudagraph.py
- pytest -v -s distributed/test_pipeline_parallel.py

- label: Disaggregated Prefill Test # 4min
working_dir: "/vllm-workspace/tests"
num_gpus: 4
source_file_dependencies:
- vllm/distributed/parallel_state.py
- vllm/distributed/kv_transfer
- vllm/worker/worker_base.py
- vllm/worker/model_runner.py
commands:
- pytest -v -s kv_transfer/module_test.py
- pytest -v -s kv_transfer/disagg_test.py

- label: LoRA Long Context (Distributed) # 11min
# This test runs llama 13B, so it is required to run on 4 GPUs.
num_gpus: 4
Expand Down
140 changes: 140 additions & 0 deletions benchmarks/disagg_benchmarks/disagg_overhead_benchmark.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
#!/bin/bash

# benchmark the overhead of disaggregated prefill.
# methodology:
# - send all request to prefill vLLM instance. It will buffer KV cache.
# - then send all request to decode instance.
# - The TTFT of decode instance is the overhead.

set -ex

kill_gpu_processes() {
# kill all processes on GPU.
pkill pt_main_thread
sleep 10

# remove vllm config file
rm -rf ~/.config/vllm

# Print the GPU memory usage
# so that we know if all GPU processes are killed.
gpu_memory_usage=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i 0)
# The memory usage should be 0 MB.
echo "GPU 0 Memory Usage: $gpu_memory_usage MB"
}

wait_for_server() {
# wait for vllm server to start
# return 1 if vllm server crashes
local port=$1
timeout 1200 bash -c "
until curl -s localhost:${port}/v1/completions > /dev/null; do
sleep 1
done" && return 0 || return 1
}


benchmark() {

export VLLM_LOGGING_LEVEL=DEBUG
export VLLM_HOST_IP=$(hostname -I | awk '{print $1}')
export VLLM_PORT=12345

# compare chunked prefill with disaggregated prefill

results_folder="./results"
model="meta-llama/Meta-Llama-3.1-8B-Instruct"
dataset_name="sonnet"
dataset_path="../sonnet_4x.txt"
num_prompts=10
qps=$1
prefix_len=50
input_len=2048
output_len=$2


VLLM_DISTRIBUTED_KV_ROLE=producer CUDA_VISIBLE_DEVICES=0 python3 \
-m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 8100 \
--max-model-len 10000 \
--gpu-memory-utilization 0.8 &

VLLM_DISTRIBUTED_KV_ROLE=consumer CUDA_VISIBLE_DEVICES=1 python3 \
-m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 8200 \
--max-model-len 10000 \
--gpu-memory-utilization 0.8 &

wait_for_server 8100
wait_for_server 8200

# let the prefill instance finish prefill
python3 ../benchmark_serving.py \
--backend vllm \
--model $model \
--dataset-name $dataset_name \
--dataset-path $dataset_path \
--sonnet-input-len $input_len \
--sonnet-output-len $output_len \
--sonnet-prefix-len $prefix_len \
--num-prompts $num_prompts \
--port 8100 \
--save-result \
--result-dir $results_folder \
--result-filename disagg_prefill_2xtp4.json \
--request-rate "inf"


# send the request to decode.
# The TTFT of this command will be the overhead of disagg prefill impl.
python3 ../benchmark_serving.py \
--backend vllm \
--model $model \
--dataset-name $dataset_name \
--dataset-path $dataset_path \
--sonnet-input-len $input_len \
--sonnet-output-len $output_len \
--sonnet-prefix-len $prefix_len \
--num-prompts $num_prompts \
--port 8200 \
--save-result \
--result-dir $results_folder \
--result-filename disagg_prefill_2xtp4.json \
--request-rate $qps
kill_gpu_processes

}


main() {

(which wget && which curl) || (apt-get update && apt-get install -y wget curl)
(which jq) || (apt-get -y install jq)
(which socat) || (apt-get -y install socat)

pip install quart httpx

cd "$(dirname "$0")"

cd ..
# create sonnet-4x.txt
echo "" > sonnet_4x.txt
for _ in {1..4}
do
cat sonnet.txt >> sonnet_4x.txt
done
cd disagg_benchmarks

rm -rf results
mkdir results

default_qps=1
default_output_len=1
benchmark $default_qps $default_output_len

}


main "$@"
172 changes: 172 additions & 0 deletions benchmarks/disagg_benchmarks/disagg_performance_benchmark.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
#!/bin/bash

# Requirement: 8x H100 GPUs.


# Model: neuralmagic/Meta-Llama-3-70B-Instruct-FP8-KV
# Query: 2048 input tokens, 11 output tokens, QPS 4, 500 requests
# Resource: 8x H100
# Approaches:
# 1. Chunked prefill: 1 vllm instance with tp=8
# 2. Chunked prefill: 2 vllm instance with tp=4, equivalent to 1 tp=4 instance with QPS 4
# 3. Disaggregated prefill: 1 prefilling instance and 1 decoding instance
# Prefilling instance: max_output_token=1
# Decoding instance: force the input tokens be the same across requests to bypass prefilling

set -ex

kill_gpu_processes() {
# kill all processes on GPU.
pkill -f pt_main_thread
pkill -f python3
ps -e | grep pt_main_thread | awk '{print $1}' | xargs kill -9
for port in 8000 8100 8200; do lsof -t -i:$port | xargs -r kill -9; done
sleep 1
}

wait_for_server() {
# wait for vllm server to start
# return 1 if vllm server crashes
local port=$1
timeout 1200 bash -c "
until curl -s localhost:${port}/v1/completions > /dev/null; do
sleep 1
done" && return 0 || return 1
}


launch_chunked_prefill() {
model="meta-llama/Meta-Llama-3.1-70B-Instruct"
# disagg prefill
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 \
-m vllm.entrypoints.openai.api_server \
--model $model \
--port 8100 \
-tp 4 \
--max-model-len 10000 \
--disable-log-stats \
--disable-log-requests \
--enable-chunked-prefill \
--gpu-memory-utilization 0.8 &
CUDA_VISIBLE_DEVICES=4,5,6,7 python3 \
-m vllm.entrypoints.openai.api_server \
--model $model \
--port 8200 \
-tp 4 \
--max-model-len 10000 \
--disable-log-stats \
--disable-log-requests \
--enable-chunked-prefill \
--gpu-memory-utilization 0.8 &
wait_for_server 8100
wait_for_server 8200
python3 round_robin_proxy.py &
sleep 1
}


launch_disagg_prefill() {
model="meta-llama/Meta-Llama-3.1-70B-Instruct"
# disagg prefill
VLLM_PORT=12345 VLLM_DISTRIBUTED_KV_ROLE=producer CUDA_VISIBLE_DEVICES=0,1,2,3 python3 \
-m vllm.entrypoints.openai.api_server \
--model $model \
--port 8100 \
-tp 4 \
--max-model-len 10000 \
--disable-log-stats \
--disable-log-requests \
--gpu-memory-utilization 0.8 &
VLLM_PORT=12345 VLLM_DISTRIBUTED_KV_ROLE=consumer CUDA_VISIBLE_DEVICES=4,5,6,7 python3 \
-m vllm.entrypoints.openai.api_server \
--model $model \
--port 8200 \
-tp 4 \
--max-model-len 10000 \
--disable-log-stats \
--disable-log-requests \
--gpu-memory-utilization 0.8 &
wait_for_server 8100
wait_for_server 8200
python3 disagg_prefill_proxy_server.py &
sleep 1
}


benchmark() {
results_folder="./results"
model="meta-llama/Meta-Llama-3.1-70B-Instruct"
dataset_name="sonnet"
dataset_path="../sonnet_4x.txt"
num_prompts=200
qps=$1
prefix_len=50
input_len=1024
output_len=$2
tag=$3

python3 ../benchmark_serving.py \
--backend vllm \
--model $model \
--dataset-name $dataset_name \
--dataset-path $dataset_path \
--sonnet-input-len $input_len \
--sonnet-output-len $output_len \
--sonnet-prefix-len $prefix_len \
--num-prompts $num_prompts \
--port 8000 \
--save-result \
--result-dir $results_folder \
--result-filename $tag-qps-$qps.json \
--request-rate $qps

sleep 2

}


main() {

(which wget && which curl) || (apt-get update && apt-get install -y wget curl)
(which jq) || (apt-get -y install jq)
(which socat) || (apt-get -y install socat)

pip install quart httpx matplotlib aiohttp

cd "$(dirname "$0")"

cd ..
# create sonnet-4x.txt so that we can sample 2048 tokens for input
echo "" > sonnet_4x.txt
for _ in {1..4}
do
cat sonnet.txt >> sonnet_4x.txt
done
cd disagg_benchmarks

rm -rf results
mkdir results

default_output_len=6

export VLLM_LOGGING_LEVEL=DEBUG
export VLLM_HOST_IP=$(hostname -I | awk '{print $1}')

launch_chunked_prefill
for qps in 2 4 6 8; do
benchmark $qps $default_output_len chunked_prefill
done
kill_gpu_processes

launch_disagg_prefill
for qps in 2 4 6 8; do
benchmark $qps $default_output_len disagg_prefill
done
kill_gpu_processes

python3 visualize_benchmark_results.py

}


main "$@"
61 changes: 61 additions & 0 deletions benchmarks/disagg_benchmarks/disagg_prefill_proxy_server.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
import os

import aiohttp
from quart import Quart, make_response, request

AIOHTTP_TIMEOUT = aiohttp.ClientTimeout(total=6 * 60 * 60)

app = Quart(__name__)


async def forward_request(url, data):
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
headers = {
"Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}"
}
async with session.post(url=url, json=data,
headers=headers) as response:
if response.status == 200:
# if response.headers.get('Transfer-Encoding') == 'chunked':
if True:
async for chunk_bytes in response.content.iter_chunked(
1024):
yield chunk_bytes
else:
content = await response.read()
yield content


@app.route('/v1/completions', methods=['POST'])
async def handle_request():
try:
original_request_data = await request.get_json()

prefill_request = original_request_data.copy()
# change max_tokens = 1 to let it only do prefill
prefill_request['max_tokens'] = 1

# finish prefill
async for _ in forward_request('http://localhost:8100/v1/completions',
prefill_request):
continue

# return decode
generator = forward_request('http://localhost:8200/v1/completions',
original_request_data)
response = await make_response(generator)
response.timeout = None

return response

except Exception as e:
import sys
import traceback
exc_info = sys.exc_info()
print("Error occurred in disagg prefill proxy server")
print(e)
print("".join(traceback.format_exception(*exc_info)))


if __name__ == '__main__':
app.run(port=8000)
Loading
Loading