Implement 2d tiled matmulnbits specialized for prefill #23058

sushraja-msft · 2024-12-09T23:08:59Z

Description

This change implements matmul4bits with tiling both for A and B. This is beneficial for prefill scenarios on Intel integrated GPUs, because each row of A has to run through the same set of shared rows of B. This change should improve core occupancy and model_benchmark does indicate improvements for prefill.

The same shader is not used for generation because when A has just a single row, the other threads in the workgroup get unused and that hurts performance.

-- Baseline run on an Alderlake GPU --

C:\onnxruntime>C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500
Batch size: 1, prompt tokens: 501, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       1.72338e+07
        avg (tokens/s): 29.0707                          << 
        p50 (us):       1.72548e+07
        stddev (us):    57012.8
        n:              5 * 501 token(s)
Token generation:
        avg (us):       79227.5
        avg (tokens/s): 12.6219
        p50 (us):       79284.4
        stddev (us):    2109.72
        n:              635 * 1 token(s)
Token sampling:
        avg (us):       15.8198
        avg (tokens/s): 63211.8
        p50 (us):       14.3
        stddev (us):    8.67178
        n:              640 * 1 token(s)
E2E generation (entire generation loop):
        avg (ms):       27297.8
        p50 (ms):       27269.8
        stddev (ms):    89.4322
        n:              5
Peak working set size (bytes): 5490987008
WebGPU device lost (2): Device was destroyed.

----------------------------------- With Prefill Optimization ----

C:\onnxruntime>C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500                                                                                                                                                               
Batch size: 1, prompt tokens: 501, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       1.2135e+07
        avg (tokens/s): 41.2856                                 << 
        p50 (us):       1.21288e+07
        stddev (us):    21282.1
        n:              5 * 501 token(s)
Token generation:
        avg (us):       78945.3
        avg (tokens/s): 12.667
        p50 (us):       78900.7
        stddev (us):    2232.43
        n:              635 * 1 token(s)
Token sampling:
        avg (us):       20.5608
        avg (tokens/s): 48636.3
        p50 (us):       18.7
        stddev (us):    19.0409
        n:              640 * 1 token(s)
E2E generation (entire generation loop):
        avg (ms):       22163.8
        p50 (ms):       22160.1
        stddev (ms):    31.3122
        n:              5
Peak working set size (bytes): 5478862848
WebGPU device lost (2): Device was destroyed.

guschmue · 2024-12-09T23:24:34Z

/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline

guschmue · 2024-12-09T23:24:48Z

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

azure-pipelines · 2024-12-09T23:24:50Z

Azure Pipelines successfully started running 2 pipeline(s).

guschmue · 2024-12-09T23:24:56Z

/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models

guschmue · 2024-12-09T23:25:07Z

/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2024-12-09T23:25:16Z

Azure Pipelines successfully started running 4 pipeline(s).

azure-pipelines · 2024-12-09T23:25:20Z

Azure Pipelines successfully started running 3 pipeline(s).

azure-pipelines · 2024-12-09T23:25:24Z

Azure Pipelines successfully started running 9 pipeline(s).

guschmue · 2024-12-10T18:30:31Z

just needs a fix for the macos build

guschmue · 2024-12-10T22:19:13Z

/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline

azure-pipelines · 2024-12-10T22:19:28Z

Azure Pipelines successfully started running 2 pipeline(s).

guschmue · 2024-12-10T22:19:28Z

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

guschmue · 2024-12-10T22:19:40Z

/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models

guschmue · 2024-12-10T22:19:53Z

/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2024-12-10T22:19:58Z

Azure Pipelines successfully started running 4 pipeline(s).

azure-pipelines · 2024-12-10T22:20:08Z

Azure Pipelines successfully started running 3 pipeline(s).

azure-pipelines · 2024-12-10T22:20:12Z

Azure Pipelines successfully started running 9 pipeline(s).

guschmue · 2024-12-10T23:15:57Z

I see similar prefill improvements on Xe TGL, ~1.5x

### Description This change implements matmul4bits with tiling both for A and B. This is beneficial for prefill scenarios on Intel integrated GPUs, because each row of A has to run through the same set of shared rows of B. This change should improve core occupancy and model_benchmark does indicate improvements for prefill. The same shader is not used for generation because when A has just a single row, the other threads in the workgroup get unused and that hurts performance. ``` -- Baseline run on an Alderlake GPU -- C:\onnxruntime>C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500 Batch size: 1, prompt tokens: 501, tokens to generate: 128 Prompt processing (time to first token): avg (us): 1.72338e+07 avg (tokens/s): 29.0707 << p50 (us): 1.72548e+07 stddev (us): 57012.8 n: 5 * 501 token(s) Token generation: avg (us): 79227.5 avg (tokens/s): 12.6219 p50 (us): 79284.4 stddev (us): 2109.72 n: 635 * 1 token(s) Token sampling: avg (us): 15.8198 avg (tokens/s): 63211.8 p50 (us): 14.3 stddev (us): 8.67178 n: 640 * 1 token(s) E2E generation (entire generation loop): avg (ms): 27297.8 p50 (ms): 27269.8 stddev (ms): 89.4322 n: 5 Peak working set size (bytes): 5490987008 WebGPU device lost (2): Device was destroyed. ----------------------------------- With Prefill Optimization ---- C:\onnxruntime>C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500 Batch size: 1, prompt tokens: 501, tokens to generate: 128 Prompt processing (time to first token): avg (us): 1.2135e+07 avg (tokens/s): 41.2856 << p50 (us): 1.21288e+07 stddev (us): 21282.1 n: 5 * 501 token(s) Token generation: avg (us): 78945.3 avg (tokens/s): 12.667 p50 (us): 78900.7 stddev (us): 2232.43 n: 635 * 1 token(s) Token sampling: avg (us): 20.5608 avg (tokens/s): 48636.3 p50 (us): 18.7 stddev (us): 19.0409 n: 640 * 1 token(s) E2E generation (entire generation loop): avg (ms): 22163.8 p50 (ms): 22160.1 stddev (ms): 31.3122 n: 5 Peak working set size (bytes): 5478862848 WebGPU device lost (2): Device was destroyed. ```

sushraja-msft added 2 commits December 9, 2024 14:13

Implement 2d tiled matmulnbits specialized for prefill

1ee552c

Run linter

ffb2dab

guschmue added the ep:WebGPU ort-web webgpu provider label Dec 10, 2024

guschmue previously approved these changes Dec 10, 2024

View reviewed changes

Mac fix and improve comments

aa51ec8

sushraja-msft dismissed guschmue’s stale review via aa51ec8 December 10, 2024 22:13

sushraja-msft mentioned this pull request Dec 10, 2024

Improves 2d tiled matmulnbits by repeating A, loads N times for each B load #23071

Open

guschmue approved these changes Dec 11, 2024

View reviewed changes

guschmue merged commit 8800830 into microsoft:main Dec 11, 2024
77 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement 2d tiled matmulnbits specialized for prefill #23058

Implement 2d tiled matmulnbits specialized for prefill #23058

sushraja-msft commented Dec 9, 2024

guschmue commented Dec 9, 2024

guschmue commented Dec 9, 2024

azure-pipelines bot commented Dec 9, 2024

guschmue commented Dec 9, 2024

guschmue commented Dec 9, 2024

azure-pipelines bot commented Dec 9, 2024

azure-pipelines bot commented Dec 9, 2024

azure-pipelines bot commented Dec 9, 2024

guschmue commented Dec 10, 2024

guschmue commented Dec 10, 2024

azure-pipelines bot commented Dec 10, 2024

guschmue commented Dec 10, 2024

guschmue commented Dec 10, 2024

guschmue commented Dec 10, 2024

azure-pipelines bot commented Dec 10, 2024

azure-pipelines bot commented Dec 10, 2024

azure-pipelines bot commented Dec 10, 2024

guschmue commented Dec 10, 2024

Implement 2d tiled matmulnbits specialized for prefill #23058

Implement 2d tiled matmulnbits specialized for prefill #23058

Conversation

sushraja-msft commented Dec 9, 2024

Description

guschmue commented Dec 9, 2024

guschmue commented Dec 9, 2024

azure-pipelines bot commented Dec 9, 2024

guschmue commented Dec 9, 2024

guschmue commented Dec 9, 2024

azure-pipelines bot commented Dec 9, 2024

azure-pipelines bot commented Dec 9, 2024

azure-pipelines bot commented Dec 9, 2024

guschmue commented Dec 10, 2024

guschmue commented Dec 10, 2024

azure-pipelines bot commented Dec 10, 2024

guschmue commented Dec 10, 2024

guschmue commented Dec 10, 2024

guschmue commented Dec 10, 2024

azure-pipelines bot commented Dec 10, 2024

azure-pipelines bot commented Dec 10, 2024

azure-pipelines bot commented Dec 10, 2024

guschmue commented Dec 10, 2024