Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vit executorch inference speed much slower than onnx #6961

Open
salvadog opened this issue Nov 19, 2024 · 11 comments
Open

vit executorch inference speed much slower than onnx #6961

salvadog opened this issue Nov 19, 2024 · 11 comments

Comments

@salvadog
Copy link

salvadog commented Nov 19, 2024

🐛 Describe the bug

I've encountered a performance issue where executorch's inference speed is significantly slower compared to ONNX, both on linux pc and Android phone. I believe this is a critical issue that needs to be addressed as it affects the efficiency of our model deployment.

Environment:

onnx==1.17.0
onnxruntime==1.20.0
executorch==0.3.0
torch==2.4.0+cu121
python=3.10.15

Linux pc hardware: NVIDIA A100 80GB, Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
Android phone hardware: Qualcomm Snapdragon 8+ Gen 1

Reproduction Steps:

The vit is an InternVIT-300M model, with 7 * 3 * 448 * 448 input size.

I export vit model with:

python -m examples.xnnpack.aot_compiler --model_name="internvit" --delegate --quantize

And inference it on linux pc with:

./cmake-out/backends/xnnpack/xnn_executor_runner --model_path=./internvit_xnnpack_q8.pte

inference on Android with:

adb shell ./data/local/tmp/vit/xnn_executor_runner_android --model_path /data/local/tmp/vit/internvit_xnnpack_q8.pte

Expected Behavior:

I'm not quite familiar with inference times for both ONNX and executorch, but I thought they should be within an acceptable performance margin. And I've already exported a llama2-2B model, with a considerable speed TTFT 0.5s + 30 tokens/s on my Android phone. I thought vit-300M inference speed shoud be some how similiar.

Actual Behavior:

onnx inference time on linux pc: 12s
vit executorch inference time on linux pc: 450s
vit executorch inference time on Android: 200s

Questions:

Is there any known performance regression in executorch compared to ONNX?
Are there any optimization techniques or configurations that can improve vit excutorch's performance?
I would appreciate any guidance on how to resolve this performance discrepancy. Thank you for your attention to this issue.

Versions

Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (conda-forge gcc 13.3.0-1) 13.3.0
Clang version: Could not collect
CMake version: version 3.30.3
Libc version: glibc-2.31

Python version: 3.10.15 | packaged by conda-forge | (main, Sep 20 2024, 16:37:05) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-4.15.0-191-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.6.68
CUDA_MODULE_LOADING set to: LAZY

Tasks

No tasks being tracked yet.
@metascroy
Copy link
Contributor

@salvadog is ONNX using CUDA or CPU? ExecuTorch is designed for mobile deployment and does not have a CUDA backend.

@salvadog
Copy link
Author

salvadog commented Nov 20, 2024

@salvadog is ONNX using CUDA or CPU? ExecuTorch is designed for mobile deployment and does not have a CUDA backend.

Yeah I know, ONNX is also using CPU. I run a 300M-vit model on Android with 8 * 3 * 448 * 448 input, and the inference latency is quite high, about 200s, much slower than llama2B on Android, with TTFT 0.5s + 30 tokens/, and also much slower than vit ONNX inference on Android. Both llama2B and VIT are running with XNNPACK backend.

@metascroy
Copy link
Contributor

@salvadog is ONNX using CUDA or CPU? ExecuTorch is designed for mobile deployment and does not have a CUDA backend.

Yeah I know, ONNX is also using CPU. I run a 300M-vit model on Android with 8 * 3 * 448 * 448 input, and the inference latency is quite high, about 200s, much slower than llama2B on Android, with TTFT 0.5s + 30 tokens/, and also much slower than vit ONNX inference on Android. Both llama2B and VIT are running with XNNPACK backend.

Some things to check are to make sure ExecuTorch is built with release mode, and to check how much of the model graph is lowered to XNNPACK vs. running the portable ops in ExecuTorch (print the graph after running to_backend).

Another thing to call out is because ExecuTorch is focused on mobile, we usually have better performance on ARM CPU vs. x86.

cc @digantdesai for ExecuTorch vs.. ONNX perf issues with XNNPACK

@salvadog
Copy link
Author

@salvadog is ONNX using CUDA or CPU? ExecuTorch is designed for mobile deployment and does not have a CUDA backend.

Yeah I know, ONNX is also using CPU. I run a 300M-vit model on Android with 8 * 3 * 448 * 448 input, and the inference latency is quite high, about 200s, much slower than llama2B on Android, with TTFT 0.5s + 30 tokens/, and also much slower than vit ONNX inference on Android. Both llama2B and VIT are running with XNNPACK backend.

Some things to check are to make sure ExecuTorch is built with release mode, and to check how much of the model graph is lowered to XNNPACK vs. running the portable ops in ExecuTorch (print the graph after running to_backend).

Another thing to call out is because ExecuTorch is focused on mobile, we usually have better performance on ARM CPU vs. x86.

cc @digantdesai for ExecuTorch vs.. ONNX perf issues with XNNPACK

I've made sure ExecuTorch is built with relase mode. My main concern is the inference speed is good for ExecuTorch llama2-2B on Android, but quite slow for VIT under similar export method and settings. Is this an expected behavior or something goes wrong. @metascroy @digantdesai

@digantdesai
Copy link
Contributor

digantdesai commented Nov 21, 2024

Thanks @salvadog for trying this out. And I am glad Llama is running with decent perf for you on the Android phone.

onnx inference time on linux pc: 12s
vit executorch inference time on linux pc: 450s
vit executorch inference time on Android: 200s

This is not what I would expect. I guess some operators could be running on the reference (also slow) implementation and not on XNNPACK.

check how much of the model graph is lowered to XNNPACK vs. running the portable ops in ExecuTorch (print the graph after running to_backend).

As @metascroy suggested, can we try this?

@salvadog
Copy link
Author

Thanks @salvadog for trying this out. And I am glad Llama is running with decent perf for you on the Android phone.

onnx inference time on linux pc: 12s
vit executorch inference time on linux pc: 450s
vit executorch inference time on Android: 200s

This is not what I would expect. I guess some operators could be running on the reference (also slow) implementation and not on XNNPACK.

check how much of the model graph is lowered to XNNPACK vs. running the portable ops in ExecuTorch (print the graph after running to_backend).

As @metascroy suggested, can we try this?

Thanks for helping out! My export commands are

Llama:
python -m examples.models.llama2.export_llama --checkpoint /XXX/checkpoint.pth \ -p /XXX/config.json \ -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32 \ --metadata '{"get_bos_id":1, "get_eos_id":2}' \ --embedding-quantize 4,32 --output_name="internlm2_2B_kv_sdpa_xnn_qe_4_32.pte"

VIT:

python -m examples.xnnpack.aot_compiler --model_name="internvit" --delegate --quantize

I've attached the Llama and VIT export logs, the VIT log is quite long, so I only attached the beginning and ending part. I didn't see information about model graph in VIT log. Could you tell me how to modify the code to

check how much of the model graph is lowered to XNNPACK vs. running the portable ops in ExecuTorch (print the graph after running to_backend).

And let me know if any other information are needed.

llama_export_log.txt
vit_export_log.txt

@digantdesai
Copy link
Contributor

Thanks a ton for sharing the output. The vit text file vit_export_log.txt does contain the exported and w/ delegation graph.

So looking at the graph post delegation,

# line 336 where the export graph with delegate starts in your file.
$ awk 'NR > 336' vit_export_log.txt \
  | grep -o "call_function\[target=.*\](" \
  | sed -r "s/call_function\[target=(.*)\]\(/\1/g" \
  | sort -h | uniq -c | sort -n
   1 executorch.exir.dialects.edge._ops.aten.select_copy.int
   6 executorch.exir.dialects.edge._ops.aten.gelu.default
  11 executorch.exir.dialects.edge._ops.aten.native_layer_norm.default
  12 executorch.exir.dialects.edge._ops.aten.bmm.default
  16 executorch.exir.dialects.edge._ops.aten.squeeze_copy.dims
  18 executorch.exir.dialects.edge._ops.aten.clone.default
  24 executorch.exir.dialects.edge._ops.aten.expand_copy.default
  47 executorch.exir.dialects.edge._ops.aten.view_copy.default
  52 torch.ops.higher_order.executorch_call_delegate # these lower to XNNPACK
  96 operator.getitem

So a bunch of operators from the ViT graph are running from outside XNNPACK. In ET they can either run on Optimized library or Portable library. Portable impl for bmm or gelu can be slow.

You can validate this by doing something like
adb shell "cd /data/local/tmp; simpleperf record xnn_executor_runner_android --model_path ./vit/internvit_xnnpack_q8.pte && simpleperf report" | less

And skimming the CMake file, it seems like we may not be using optimized library optimized_ops_lib for xnn_executor_runner.

@salvadog
Copy link
Author

Thanks a ton for sharing the output. The vit text file vit_export_log.txt does contain the exported and w/ delegation graph.

So looking at the graph post delegation,

# line 336 where the export graph with delegate starts in your file.
$ awk 'NR > 336' vit_export_log.txt \
  | grep -o "call_function\[target=.*\](" \
  | sed -r "s/call_function\[target=(.*)\]\(/\1/g" \
  | sort -h | uniq -c | sort -n
   1 executorch.exir.dialects.edge._ops.aten.select_copy.int
   6 executorch.exir.dialects.edge._ops.aten.gelu.default
  11 executorch.exir.dialects.edge._ops.aten.native_layer_norm.default
  12 executorch.exir.dialects.edge._ops.aten.bmm.default
  16 executorch.exir.dialects.edge._ops.aten.squeeze_copy.dims
  18 executorch.exir.dialects.edge._ops.aten.clone.default
  24 executorch.exir.dialects.edge._ops.aten.expand_copy.default
  47 executorch.exir.dialects.edge._ops.aten.view_copy.default
  52 torch.ops.higher_order.executorch_call_delegate # these lower to XNNPACK
  96 operator.getitem

So a bunch of operators from the ViT graph are running from outside XNNPACK. In ET they can either run on Optimized library or Portable library. Portable impl for bmm or gelu can be slow.

You can validate this by doing something like adb shell "cd /data/local/tmp; simpleperf record xnn_executor_runner_android --model_path ./vit/internvit_xnnpack_q8.pte && simpleperf report" | less

And skimming the CMake file, it seems like we may not be using optimized library optimized_ops_lib for xnn_executor_runner.

Thank you so much for your invaluable help, @digantdesai! I've included the output from the Android ET runner and the simpleperf report. It currently takes 55 seconds to process a tensor of shape [8,3,448,448] with a 300M VIT model. The simpleperf report indicates that the BMM operation is not leveraging XNNPACK and is responsible for 70% of the total time expenditure.

Does this imply that if we were to optimize the BMM operation with XNNPACK, we could potentially reduce the total time to 55s * 0.3 = 16s? Even so, this would still be a significant amount of time. I'm curious about the expected performance for executing a VIT model of this scale with ET and whether there are any benchmarks or examples I could use for reference. Additionally, I am eager to explore optimization strategies to achieve my ideal running speed of 1 second. Is this goal attainable, and if so, what steps should I take to optimize the performance further?

vit_simpleperf_log.txt

@salvadog
Copy link
Author

salvadog commented Dec 2, 2024

Hi @digantdesai @metascroy

I was wondering if there's any update or if there's anything additional we can provide to help move this forward. We're eager to get this resolved, and any information you can share would be greatly appreciated.

Thank you for your time and effort on this.

@digantdesai
Copy link
Contributor

Overhead Command Pid Tid Shared Object Symbol
72.03% ./xnn_executor_runner_android 32703 32703 /data/local/tmp/internlm2-2B-pte/vit/xnn_executor_runner_android torch::executor::native::bmm_out(torch::executor::KernelRuntimeContext&, torch::executor::Tensor const&, torch::executor::Tensor const&, torch::executor::Tensor&)

Looks like this is the problem.

@mcr229 - don't we lower bmm to xnnpack now? Or at least we should use optimized op.

@mcr229
Copy link
Contributor

mcr229 commented Dec 2, 2024

I think if you link the optimized kernel lib there should be a fast bmm which should help with perf. I do believe that bmm support was added. if you're using v0.3.0 of executorch, then it might not have it. Perhaps you can try exporting and running the model on v0.4.0?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants