We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decode latency is notable slower with --kv-cache-dtype fp8_e5m2, due to design choice of torch.view(dtype=).
--kv-cache-dtype fp8_e5m2
torch.view(dtype=)
No response
The text was updated successfully, but these errors were encountered:
This is observed on H100 as well as MI300X. Expect some design changes (may assign to me)
Sorry, something went wrong.
On H100:
# python3 -m sglang.bench_latency --batch-size 32 --input 1024 --output 256 --model amd/Meta-Llama-3.1-70B-Instruct-FP8-KV --tp 8 --quantization fp8 Benchmark ... Prefill. latency: 1.15925 s, throughput: 28266.67 token/s Decode. latency: 0.01402 s, throughput: 2281.72 token/s Decode. latency: 0.01353 s, throughput: 2365.70 token/s Decode. latency: 0.01350 s, throughput: 2369.66 token/s Decode. latency: 0.01346 s, throughput: 2377.09 token/s Decode. latency: 0.01354 s, throughput: 2363.53 token/s Decode. median latency: 0.01364 s, median throughput: 2346.63 token/s Total. latency: 4.614 s, throughput: 8876.73 token/s # python3 -m sglang.bench_latency --batch-size 32 --input 1024 --output 256 --model amd/Meta-Llama-3.1-70B-Instruct-FP8-KV --tp 8 --quantization fp8 --kv-cache-dtype fp8_e5m2 Benchmark ... Prefill. latency: 1.16278 s, throughput: 28180.77 token/s Decode. latency: 0.01554 s, throughput: 2059.15 token/s Decode. latency: 0.01456 s, throughput: 2197.55 token/s Decode. latency: 0.01453 s, throughput: 2202.10 token/s Decode. latency: 0.01452 s, throughput: 2204.34 token/s Decode. latency: 0.01453 s, throughput: 2202.13 token/s Decode. median latency: 0.01471 s, median throughput: 2175.89 token/s Total. latency: 4.886 s, throughput: 8383.15 token/s
No branches or pull requests
Checklist
Motivation
Decode latency is notable slower with
--kv-cache-dtype fp8_e5m2
, due to design choice oftorch.view(dtype=)
.Related resources
No response
The text was updated successfully, but these errors were encountered: