Llama: Merge query/key/value projection layers #498

mryab · 2023-09-02T21:47:26Z

This PR makes an ~7% optimization of the inference throughput (measured on a single A100-80GB) by merging the query/key/value projections into a single large matrix multiplication. This reduces the overhead of launching several matmul kernels, which turns out to be substantial for single-sequence single-token inference steps. Also, this code adds a --throughput dry_run option to estimate throughput without starting a server.

Sample results from running experiments with and without the optimization (the command in each case is CUDA_VISIBLE_DEVICES=0 python -m petals.cli.run_server petals-team/StableBeluga2 --throughput dry_run):

Current code (branch https://github.com/bigscience-workshop/petals/tree/no_qkv_merge):

Sep 03 00:50:35.135 [INFO] Inference throughput: 532.7 tokens/sec per block (1 tokens/batch, NVIDIA A100-SXM-80GB GPU, bfloat16, quantized to nf4)
Sep 03 00:50:47.722 [INFO] Forward pass throughput: 51749.0 tokens/sec per block (1024 tokens/batch, NVIDIA A100-SXM-80GB GPU, bfloat16, quantized to nf4)

Sep 03 00:52:07.524 [INFO] Inference throughput: 576.4 tokens/sec per block (1 tokens/batch, NVIDIA A100-SXM-80GB GPU, bfloat16, quantized to nf4)
Sep 03 00:52:20.919 [INFO] Forward pass throughput: 36552.9 tokens/sec per block (1024 tokens/batch, NVIDIA A100-SXM-80GB GPU, bfloat16, quantized to nf4)

Sep 03 00:53:54.616 [INFO] Inference throughput: 512.7 tokens/sec per block (1 tokens/batch, NVIDIA A100-SXM-80GB GPU, bfloat16, quantized to nf4)
Sep 03 00:54:14.464 [INFO] Forward pass throughput: 50242.5 tokens/sec per block (1024 tokens/batch, NVIDIA A100-SXM-80GB GPU, bfloat16, quantized to nf4)

Code from this PR:

Sep 03 00:55:25.680 [INFO] Inference throughput: 564.7 tokens/sec per block (1 tokens/batch, NVIDIA A100-SXM-80GB GPU, bfloat16, quantized to nf4)
Sep 03 00:55:38.648 [INFO] Forward pass throughput: 33023.0 tokens/sec per block (1024 tokens/batch, NVIDIA A100-SXM-80GB GPU, bfloat16, quantized to nf4)

Sep 03 00:56:45.526 [INFO] Inference throughput: 578.4 tokens/sec per block (1 tokens/batch, NVIDIA A100-SXM-80GB GPU, bfloat16, quantized to nf4)
Sep 03 00:56:59.632 [INFO] Forward pass throughput: 54655.0 tokens/sec per block (1024 tokens/batch, NVIDIA A100-SXM-80GB GPU, bfloat16, quantized to nf4)

Sep 03 00:58:18.783 [INFO] Inference throughput: 593.1 tokens/sec per block (1 tokens/batch, NVIDIA A100-SXM-80GB GPU, bfloat16, quantized to nf4)
Sep 03 00:58:33.015 [INFO] Forward pass throughput: 36200.4 tokens/sec per block (1024 tokens/batch, NVIDIA A100-SXM-80GB GPU, bfloat16, quantized to nf4)

mryab added 3 commits September 3, 2023 00:45

Add dry_run option to --throughput

b2ab84c

Merge query/key/value projection layers

57119bb

Remove unused import in throughput.py

c666a97

mryab marked this pull request as ready for review September 2, 2023 22:06

mryab added 5 commits September 3, 2023 01:08

Ignore missing qkv_proj.weight when loading a checkpoint

4644131

Reformat code with black

f100915

Fix removal of nonexistent keys

16fb547

Fix checking for nonexistent keys

9cb4c72

Create dummy data when materializing qkv_proj

4159e55

borzunov changed the title ~~Merge query/key/value projection layers~~ Llama: Merge query/key/value projection layers Sep 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama: Merge query/key/value projection layers #498

Llama: Merge query/key/value projection layers #498

mryab commented Sep 2, 2023 •

edited

Loading

Llama: Merge query/key/value projection layers #498

Are you sure you want to change the base?

Llama: Merge query/key/value projection layers #498

Conversation

mryab commented Sep 2, 2023 • edited Loading

mryab commented Sep 2, 2023 •

edited

Loading