Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoreML - Writing CoreML Model on every inference session creation #21761

Open
henryruhs opened this issue Aug 15, 2024 · 15 comments · May be fixed by #23065
Open

CoreML - Writing CoreML Model on every inference session creation #21761

henryruhs opened this issue Aug 15, 2024 · 15 comments · May be fixed by #23065
Labels
contributions welcome lower priority issues for the core ORT teams ep:CoreML issues related to CoreML execution provider

Comments

@henryruhs
Copy link

henryruhs commented Aug 15, 2024

Describe the issue

I set onnxruntime to verbose to understand why the CoreML provider is that slow. Then I figured out it does convert the ONNX to mlmodel every time without a cache.

2024-08-15 23:01:33.924908 [I:onnxruntime:, model_builder.cc:926 SaveModel] Writing CoreML Model to /var/folders/nz/t7blntb55gv6k4hzbd91kl5h0000gn/T/onnxruntime-72B02264-8413-46DF-8067-45E9808043DF-16638-000199F2EF486AAE.model.mlmodel

To reproduce

Just use set_default_logger_severity(0)

I use a ONNX with opset 17 - but the issue is not related to a single model

Urgency

No response

Platform

Mac

OS Version

14.5

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.18.0

ONNX Runtime API

Python

Architecture

ARM64

Execution Provider

CoreML

Execution Provider Library Version

No response

@github-actions github-actions bot added the ep:CoreML issues related to CoreML execution provider label Aug 15, 2024
Copy link
Contributor

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

@github-actions github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Sep 15, 2024
@henryruhs
Copy link
Author

Any updates on this?

@github-actions github-actions bot removed the stale issues that have not been addressed in a while; categorized by a bot label Sep 16, 2024
@yuygfgg
Copy link

yuygfgg commented Sep 19, 2024

I'm having the same issue here. And disk usage is also a problem.

AmusementClub/vs-mlrt#107

It seems onnxruntime is converting .onnx to coreml models and load again from disk.

Status ModelBuilder::Build(const GraphViewer& graph_viewer, const logging::Logger& logger,
int32_t coreml_version, uint32_t coreml_flags,
std::vector<std::string>&& onnx_input_names,
std::vector<std::string>&& onnx_output_names,
std::unique_ptr<Model>& model) {
ModelBuilder builder(graph_viewer, logger, coreml_version, coreml_flags,
std::move(onnx_input_names), std::move(onnx_output_names));
ORT_RETURN_IF_ERROR(builder.CreateModel());
ORT_RETURN_IF_ERROR(builder.SaveModel());
return builder.LoadModel(model);
}

@henryruhs
Copy link
Author

henryruhs commented Oct 2, 2024

@skottmckay Could you please take a look? Using the cached models cannot be such big of a deal to implement.

@skottmckay skottmckay added the contributions welcome lower priority issues for the core ORT teams label Oct 3, 2024
@skottmckay
Copy link
Contributor

The basics are simple. Making it robust is more work.

For example, if the source onnx model changes how do you detect that? e.g. an app could download a new version of the onnx model in the background invalidating the cached version.

An app could have multiple onnx models so the caching logic would need a way to ensure that a specific input model is cached in a unique location.

An inference session could be created with model bytes instead of a path. Should the solution handle this as well? If not, we need to explain to users how/when/why they can/can't cache the CoreML model via log messages and documentation.

I've added the request to the backlog.

@henryruhs
Copy link
Author

henryruhs commented Oct 3, 2024

I suggest creating CRV32 based hashes as sha and even md5 are quite slow for larger models.

def create_hash(content : bytes) -> str:
        return format(zlib.crc32(content), '08x')

Save the HASH.mlmodel files once an ONNX model has been cached. Later you can perform a simple file lookup to either use an existing mlmodel or converting one.

Let me share the FaceFusion code.

https://github.com/facefusion/facefusion/blob/master/facefusion/hash_helper.py

Both path and bytes inputs can be supported easily with that approach.

@henryruhs
Copy link
Author

@skottmckay Could you please add this to the roadmap? I have many macOS users suffering with the performance of CoreML.

@skottmckay
Copy link
Contributor

We'll look at adding it in the 1.21 release which will be early next year.

We need to handle a mix of scenarios where a simple checksum may not be possible. We don't necessarily have a nice path to the model in the CoreML EP to checksum a file. e.g. an InferenceSession can be created with bytes instead of a path and those bytes are converted to an onnxruntime Graph instance way before the CoreML EP is involved. That means the most consistent thing the CoreML EP has to work with is the Graph instance, and that's not one contiguous block of memory.

On the performance side of things, even if there's a cached model we need to run the logic in CoreMLExecutionProvider::GetCapability to figure out how many partitions there are (there's one CoreML model per partition) and which parts of the ONNX model map to which cached CoreML models to know how and when to execute the CoreML model.

@wejoncy
Copy link
Contributor

wejoncy commented Nov 28, 2024

Hi @henryruhs @yuygfgg

I am debugging the issue and have some findings about this initilization costing too much time.

  1. When I use the default backend "Netural Network" it looks like ORT does consume a few minutes to convert model, but the truth is APPLE SDK "compilingmodel" did. this function try to dispatch different ops to different hardward CPU/GPU/ANE, espeicially for ANE. ANE used float-16 by default, so APPLE has to convert the model from float to float16 at runtime
  2. When I set the computeunit to CPUonly, then it works good then.

Could you please get it a try with either 1. set the backend to MLprogram instead of netrulnetwork or 2. set computeunit to CPUonly and check if CoreML provider is slow as before?

@wejoncy
Copy link
Contributor

wejoncy commented Nov 28, 2024

Besides, CoreML EP support float16 model directly if you select MLProgram as the model format.

@yuygfgg
Copy link

yuygfgg commented Nov 29, 2024

Could you please get it a try with either 1. set the backend to MLprogram instead of netrulnetwork or 2. set computeunit to CPUonly and check if CoreML provider is slow as before?

AmusementClub/vs-mlrt#116 (comment)

ml_program=1 + fp16=True: ANE 120% usage, 10.52fps
ml_program=1 + fp16=False: GPU 97% usage, 5.40fps
ml_program=0 + fp16=True: CPU 100% usage, I have no patience to wait
ml_program=0 + fp16=False: ANE 114% usage, 10.23fps

@wejoncy
Copy link
Contributor

wejoncy commented Nov 29, 2024

Hi @yuygfgg Thanks for the update.

More operators would be supported once ml_program is enabled. Does this resolve the issue induced by writing the coreml model to disk? ORT will get those subgraphs which can be captured by CoreML EP and it wouldn't take too much time as this wouldn't convert to FP16 at runtime.

@wejoncy
Copy link
Contributor

wejoncy commented Nov 29, 2024

With the caching mechanism we can of course save the cost of writing disk, but I think it's not costing too much time.

Please let me know if it's still a problem in any product senario.

@wejoncy
Copy link
Contributor

wejoncy commented Nov 29, 2024

I'm having the same issue here. And disk usage is also a problem.
ModelBuilder builder(graph_viewer, logger, coreml_version, coreml_flags,
std::move(onnx_input_names), std::move(onnx_output_names));

ORT_RETURN_IF_ERROR(builder.CreateModel());
ORT_RETURN_IF_ERROR(builder.SaveModel());

return builder.LoadModel(model);
}

There is no way to avoid reloading the CoreML models from disk to the system framework. Based on that, I did a lot of profiling to measuring the time cost of builder.SaveModel(). This operator cost at most 100ms if model size is up to more than 300MB.
And once a session is initialized, we can re-use this session as long as you want. So this shouldn't be a bottleneck.

LoadModel is indeed a bottleneck and it possibly cost lots of time, but Apple framework didn't expose any API to cache it.

@henryruhs
Copy link
Author

henryruhs commented Dec 4, 2024

Is there something we can do nowdays like monkey-patching the onnxruntime?

@wejoncy wejoncy linked a pull request Dec 10, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributions welcome lower priority issues for the core ORT teams ep:CoreML issues related to CoreML execution provider
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants