Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoreML EP inference result is improperly scaled #21170

Open
frenetj opened this issue Jun 25, 2024 · 4 comments
Open

CoreML EP inference result is improperly scaled #21170

frenetj opened this issue Jun 25, 2024 · 4 comments
Labels
ep:CoreML issues related to CoreML execution provider platform:mobile issues related to ONNX Runtime mobile; typically submitted using template stale issues that have not been addressed in a while; categorized by a bot

Comments

@frenetj
Copy link

frenetj commented Jun 25, 2024

Describe the issue

When running inference of a specific dynamic-shape image filter model using CoreML EP, output pixels are slightly shifted towards the bottom left of the image. Pixels at the bottom left are not shifted at all, while pixels at the top right are shifted by almost a whole pixel to the left & downwards.

I cannot reproduce the issue with small images (size of ~1024 pixels or less. The issue is quite apparent using a 2048x2048 colour noise as input.

Here the top right portion of the input and output images:

TopRightPixels_InVsOut

Here is the shift over the hole image (absolute difference of the input vs output pixels). Notice the shift is present in the hole image, but more pronounced in the top right area:

InVsOutAbsDiff

I will provide the specific model to Microsoft directly as it has some proprietary content.

I cannot reproduce this issue when using macOS' native CPU handler. The issue is also NOT reproducible when using the CUDA or TensorRT handlers on linux. The issue is also NOT reproducible with macOS's CoreML EP when setting the COREML_FLAG_USE_CPU_ONLY flag.

Note that I am however using the COREML_FLAG_ONLY_ALLOW_STATIC_INPUT_SHAPES flag. I am thus surprised to see rendering difference with the CPU implementation since the model uses dynamic shapes and should thus NOT run using CoreML.

To reproduce

On macOS, setup the CoreML EP with the COREML_FLAG_ONLY_ALLOW_STATIC_INPUT_SHAPES flag.
Run the inference using the given model on a 2048x2048 image.
Notice that the output pixels are shifted to the left and towards the bottom of the image.

Urgency

The issue is not urgent as we are currently using the native CPU implementation.

Platform

Mac

OS Version

Sonoma 14.5

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.18.0

ONNX Runtime API

C

Architecture

ARM64

Execution Provider

CoreML

Execution Provider Library Version

No response

@github-actions github-actions bot added ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider platform:mobile issues related to ONNX Runtime mobile; typically submitted using template labels Jun 25, 2024
@skottmckay
Copy link
Contributor

skottmckay commented Jun 25, 2024

COREML_FLAG_USE_CPU_ONLY results in CoreML executing the same nodes using its reference CPU implementation. We set this as the MLModelConfiguration.computeUnits. The rest of the ORT CoreML EP code runs exactly the same. That would strongly suggest an issue with the internal CoreML handling of a large input when running on GPU/NPU.

COREML_FLAG_ONLY_ALLOW_STATIC_INPUT_SHAPES is applied on a per-node basis. Parts of the model may have fixed shapes leading to CoreML executing them. If you set the session logging severity to VERBOSE it will print out details of which nodes are/aren't assigned to CoreML. That would at least narrow down which CoreML operator could be going wrong.

@skottmckay skottmckay removed ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider labels Jun 26, 2024
@skottmckay
Copy link
Contributor

skottmckay commented Jun 27, 2024

This appears to be a CoreML NeuralNetwork specific problem. There are only a few Div and Sub nodes assigned to CoreML as the rest have dynamic input shapes. Most of those produce the expected output.

There are 2 Div nodes (Div_185 and Div_143) that end up doing 2 / (2048 - 1) (one for the height and one for the width). For some reason the NeuralNetwork Div is somewhat inaccurate for this floating point operation.

Python as a reference (double precision):
2.0 / 2047.0 = 0.0009770395701025891

EP Value name Value
CPU EP Mul_340 0.00097703957
CoreML NeuralNetwork Mul_340 0.00097751617
CoreML ML Program Mul_340 0.00097703957

That difference must become significant across all the other downstream operations in the model, leading to the output discrepancies. I would guess it comes down to floating point inaccuracies from 2 divided by a large number as to why smaller numbers for the height or width don't trigger the issue.

@sophies927 sophies927 added the ep:CoreML issues related to CoreML execution provider label Jun 27, 2024
@skottmckay
Copy link
Contributor

FWIW it's possible to get a good result from NeuralNetwork but the model would need to be updated and you might need some experimentation to figure out what works best.

If I scale down the input size value (the 2047 in this case) first, do the Div, and scale back up it's happy. Guessing it's due to the difference in floating point representation due to the range between '2' and '2047'.

e.g. scaling the 2047 by 1000 (arbitrarily chosen) would be a = 2047 / 1000, b = 2 / a, c = b * 1000

Copy link
Contributor

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

@github-actions github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CoreML issues related to CoreML execution provider platform:mobile issues related to ONNX Runtime mobile; typically submitted using template stale issues that have not been addressed in a while; categorized by a bot
Projects
None yet
Development

No branches or pull requests

3 participants