onnx export with dynamic shapes, fast attention #324

jpata · 2024-05-25T07:39:23Z

factorize ONNX exporting to notebooks/cms/cms-validate-onnx.py which contains a minimal, standalone version of the attention-based model, validated against the base model, and the exporting code to ONNX, with both unfused (default, slow) and fused (fast on GPU) attention
dynamic shapes: ONNX export now works through TorchScript (the old, non-beta way in pytorch)
fused attention: we override the aten::scaled_dot_product_attention op to use com.microsoft.MultiHeadAttention from ONNX contrib
it's fast: on A100, the MultiHeadAttention op executes efficient flash attention kernels, such that the whole model runs in about 15ms and <2GB for 5000 inputs: Integrate new pytorch attention model in CMSSW via ONNX #216 (comment)
the outputs are equivalent in terms of physics to the original model
added to huggingface in https://huggingface.co/jpata/particleflow/commit/65729d798b0e51598916b4ae9c8c4f712820c79d

Here's how the direct export of torch.nn.functional.scaled_dot_product_attention to an unfused ONNX model, with full matrix multiplications looks like:

Using the SDPA fused operation that will use flash attention on sufficiently new GPUs, where the MatMul->Softmax->MatMul part in the very end is rolled into an op SDPA that calls MultiHeadAttention:

Here are the timings, showing the benefit of the fused model:

timing/gpu_fp32_fused.txt:Nelem=2560 mean_time=6.99 ms stddev_time=2.89 ms mem_used=1678 MB
timing/gpu_fp32_fused.txt:Nelem=5120 mean_time=16.59 ms stddev_time=0.15 ms mem_used=1946 MB
timing/gpu_fp32_fused.txt:Nelem=10240 mean_time=53.13 ms stddev_time=0.23 ms mem_used=1946 MB

timing/gpu_fp32_unfused.txt:Nelem=2560 mean_time=39.31 ms stddev_time=1.73 ms mem_used=3817 MB
timing/gpu_fp32_unfused.txt:Nelem=5120 mean_time=130.18 ms stddev_time=6.52 ms mem_used=12407 MB
timing/gpu_fp32_unfused.txt:Nelem=10240 mean_time=465.09 ms stddev_time=25.82 ms mem_used=46766 MB

… into fix_onnx_export

* enable onnx export via dynamo with dynamic shapes * added standalone export script * fp16 quantization sort of works also * use sdpa * MultiheadAttention op runs * update timing study * cleanup * model closes * update timing study * onnx is factorized * update onnx script * revert main model code * move to notebook

jpata and others added 6 commits May 25, 2024 09:43

enable onnx export via dynamo with dynamic shapes

6f8fa22

added standalone export script

5388864

fp16 quantization sort of works also

275ff49

up

bda2938

up

4aab293

use sdpa

6eaad81

jpata changed the title ~~enable onnx export via dynamo with dynamic shapes~~ onnx export of quantized model with dynamic shapes May 25, 2024

jpata mentioned this pull request May 25, 2024

Integrate new pytorch attention model in CMSSW via ONNX #216

Closed

3 tasks

jpata and others added 9 commits May 26, 2024 13:02

MultiheadAttention op runs

7a301da

update timing study

2d89ad2

cleanup

67ae302

Merge branch 'fix_onnx_export' of https://github.com/jpata/particleflow…

eb3ee0a

… into fix_onnx_export

model closes

e40f5c3

update timing study

cfa6e4f

onnx is factorized

ebba0d4

update onnx script

eec765e

revert main model code

f8cce36

jpata changed the title ~~onnx export of quantized model with dynamic shapes~~ onnx export with dynamic shapes, fast attention May 27, 2024

move to notebook

8d4595c

jpata linked an issue May 27, 2024 that may be closed by this pull request

Integrate new pytorch attention model in CMSSW via ONNX #216

Closed

3 tasks

jpata added the hard label May 27, 2024

jpata marked this pull request as ready for review May 27, 2024 15:27

jpata merged commit a7b00c1 into main May 27, 2024
5 checks passed

jpata added hard and removed hard labels May 27, 2024

jpata deleted the fix_onnx_export branch July 13, 2024 09:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

onnx export with dynamic shapes, fast attention #324

onnx export with dynamic shapes, fast attention #324

jpata commented May 25, 2024 •

edited

Loading

onnx export with dynamic shapes, fast attention #324

onnx export with dynamic shapes, fast attention #324

Conversation

jpata commented May 25, 2024 • edited Loading

jpata commented May 25, 2024 •

edited

Loading