-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow model loading time for CoreML quantized model #5718
Comments
Could you please clarify what's the goal of this issue? Is the 3.10 ms cold-start time too much? |
Thanks @YifanShenSZ, I'd like to correct the above numbers. The goal of this issue is to resolve the long load time for quantized models using the CoreML delegate. Load times: |
Sorry the description is not clear; David's response is clearer, and it was original from him |
Thanks David, so the issue is, using executorch CoreML delegate has much longer loading time than directly using CoreML runtime? Handing it over to @cymbalrush to investigate where the overhead came from |
Thanks @YifanShenSZ @cymbalrush. Both models are using the ExecuTorch CoreML delegate. The quantized model takes much longer. |
@d-findlay how are you getting the load time? is it from the devtools? |
I just asked @d-findlay and he said both devtools and the xcode instruments showed long load time. |
@cymbalrush, we are using devtools while specifying profile so we can inspect it with Instruments. We can see that the quantized model takes 1.3seconds to Load (prepare and cache) the model on CoreML, where 1.14 seconds is spent on The Neural Engine Compile. Compared to the unquantized model that takes 464ms to Load (prepare and cache) the model on CoreML, where 297ms is spent on The Neural Engine Compile. |
@cymbalrush It's also worth noting that when we try to use MODEL_TYPE.COMPILED_MODEL, we get a failure: However, this is unrelated to the above concern that with default MODEL_TYPE we still get longer load times for quantized models. |
Thanks @d-findlay! Could you try iOS18, there is an optimization that was part of the iOS18 release that improves NeuralEngine compile time. I am seeing load improvement when I test locally on iOS18 but it would be great if you could confirm it. This is a one time cost as you know, the subsequent Investigating |
Thank you very much @cymbalrush. Do you recommend the |
It won't improve the NeuralEngine compile time but it could improve the model load time. If the type is |
🐛 Describe the bug
Get #5710 and run
The FP32 model runs fully resident on ANE at 0.9ms on average and 11.13ms cold-start (first inference).
The int8 quantized model runs also fully resident on ANE at 0.54ms on average and 3.10 ms cold-start. Also looking at the layers, looks like there is a lot of quantize followed immediately by dequantize.
Versions
The text was updated successfully, but these errors were encountered: