Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [OpenVino EP] Only first result in session is correct. #19975

Open
debugmenot opened this issue Mar 19, 2024 · 16 comments
Open

[BUG] [OpenVino EP] Only first result in session is correct. #19975

debugmenot opened this issue Mar 19, 2024 · 16 comments
Labels
ep:OpenVINO issues related to OpenVINO execution provider

Comments

@debugmenot
Copy link

debugmenot commented Mar 19, 2024

Describe the issue

When running inference session ONLY with OpenVino EP and ORT > 1.13.1 any results except first are incorrect. There are no issues with ORT == 1.13.1 or CPU/CUDA/XNNPACK on any ORT version.

Getting this issue only on one model (Attention OCR) - model structure you can find at the bottom, other models works fine. seems there are some layers/functions in it that was broken after 1.13.1 build...

Description:

Ubuntu 22.04, Onnxruntime 1.17.1, OpenVino 2023.3, C++
Model: sort of Attention Decoder OCR, converted to onnx from pytorch.

Issue:
im inferencing the same image (also tried on sequence of different images durning the inference session). Only the FIRST result is correct. Second result and so on looks like partially "cropped" first result doesnt matter if next input data is new...
For example inferencing sequence of images with text "1234567890", "ABCDEFGHJK", "7777777777". Getting: "1234567890", "1200120012", "1200120012"...

Downgrade to ORT 1.13.1 solved the issue, but seems that something is broken after 1.13.1 build.
All other EP (CPU, CUDA, XNNPACK) works well with the same code.

Found one reference to similar issue in OpenVino github: openvinotoolkit/openvino#12966

Enabled verbose mode and found that node placements are differ between 1.17.1 (incorrect) and 1.13.1(correct) inference sessions, maybe it's matters, but doesn't explain why first result is always correct...:

correct inference session with node placements(1.13.1):

* Node placements
*Node(s) placed on [OpenVINOExecutionProvider]. Number of nodes: 11

OpenVINO-EP-subgraph_1 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_1_0)
OpenVINO-EP-subgraph_2 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_2_1)
OpenVINO-EP-subgraph_3 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_3_2)
OpenVINO-EP-subgraph_4 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_4_3)
OpenVINO-EP-subgraph_5 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_5_4)
OpenVINO-EP-subgraph_6 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_6_5)
OpenVINO-EP-subgraph_7 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_7_6)
OpenVINO-EP-subgraph_8 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_8_7)
OpenVINO-EP-subgraph_9 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_9_8)
OpenVINO-EP-subgraph_10 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_10_9)
OpenVINO-EP-subgraph_11 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_11_10)
*Node(s) placed on [CPUExecutionProvider]. Number of nodes: 167
GRU (/decoder/rnn/GRU)
LogSoftmax (/decoder/LogSoftmax)
ArgMax (/decoder/ArgMax)
Unsqueeze (/decoder/Unsqueeze)
Transpose (/decoder/Transpose_2)
Gather (/decoder/emb_1/Gather)
Expand (/decoder/attention_1/Expand)
Transpose (/decoder/attention_1/Transpose)
Concat (/decoder/attention_1/Concat)
MatMul (/decoder/attention/attn_1/MatMul)
Add (/decoder/attention/attn_1/Add)
Tanh (/decoder/attention_1/Tanh)
Softmax (/decoder/attention_1/Softmax)
MatMul (/decoder/MatMul_1)
Transpose (/decoder/Transpose_3)
Concat (/decoder/Concat_1)
GRU (/decoder/rnn_1/GRU)
LogSoftmax (/decoder/LogSoftmax_1)
ArgMax (/decoder/ArgMax_1)
Unsqueeze (/decoder/Unsqueeze_1)
Transpose (/decoder/Transpose_4)
Gather (/decoder/emb_2/Gather)
Expand (/decoder/attention_2/Expand)
Transpose (/decoder/attention_2/Transpose)
Concat (/decoder/attention_2/Concat)
MatMul (/decoder/attention/attn_2/MatMul)
Add (/decoder/attention/attn_2/Add)
Tanh (/decoder/attention_2/Tanh)
Softmax (/decoder/attention_2/Softmax)
MatMul (/decoder/MatMul_2)
Transpose (/decoder/Transpose_5)
Concat (/decoder/Concat_2)
GRU (/decoder/rnn_2/GRU)
LogSoftmax (/decoder/LogSoftmax_2)
ArgMax (/decoder/ArgMax_2)
Unsqueeze (/decoder/Unsqueeze_2)
Transpose (/decoder/Transpose_6)
Gather (/decoder/emb_3/Gather)
Expand (/decoder/attention_3/Expand)
Transpose (/decoder/attention_3/Transpose)
Concat (/decoder/attention_3/Concat)
MatMul (/decoder/attention/attn_3/MatMul)
Add (/decoder/attention/attn_3/Add)
Tanh (/decoder/attention_3/Tanh)
Softmax (/decoder/attention_3/Softmax)
MatMul (/decoder/MatMul_3)
Transpose (/decoder/Transpose_7)
Concat (/decoder/Concat_3)
GRU (/decoder/rnn_3/GRU)
LogSoftmax (/decoder/LogSoftmax_3)
ArgMax (/decoder/ArgMax_3)
Unsqueeze (/decoder/Unsqueeze_3)
Transpose (/decoder/Transpose_8)
Gather (/decoder/emb_4/Gather)
Expand (/decoder/attention_4/Expand)
Transpose (/decoder/attention_4/Transpose)
Concat (/decoder/attention_4/Concat)
MatMul (/decoder/attention/attn_4/MatMul)
Add (/decoder/attention/attn_4/Add)
Tanh (/decoder/attention_4/Tanh)
Softmax (/decoder/attention_4/Softmax)
MatMul (/decoder/MatMul_4)
Transpose (/decoder/Transpose_9)
Concat (/decoder/Concat_4)
GRU (/decoder/rnn_4/GRU)
LogSoftmax (/decoder/LogSoftmax_4)
ArgMax (/decoder/ArgMax_4)
Unsqueeze (/decoder/Unsqueeze_4)
Transpose (/decoder/Transpose_10)
Gather (/decoder/emb_5/Gather)
Expand (/decoder/attention_5/Expand)
Transpose (/decoder/attention_5/Transpose)
Concat (/decoder/attention_5/Concat)
MatMul (/decoder/attention/attn_5/MatMul)
Add (/decoder/attention/attn_5/Add)
Tanh (/decoder/attention_5/Tanh)
Softmax (/decoder/attention_5/Softmax)
MatMul (/decoder/MatMul_5)
Transpose (/decoder/Transpose_11)
Concat (/decoder/Concat_5)
GRU (/decoder/rnn_5/GRU)
LogSoftmax (/decoder/LogSoftmax_5)
ArgMax (/decoder/ArgMax_5)
Unsqueeze (/decoder/Unsqueeze_5)
Transpose (/decoder/Transpose_12)
Gather (/decoder/emb_6/Gather)
Expand (/decoder/attention_6/Expand)
Transpose (/decoder/attention_6/Transpose)
Concat (/decoder/attention_6/Concat)
MatMul (/decoder/attention/attn_6/MatMul)
Add (/decoder/attention/attn_6/Add)
Tanh (/decoder/attention_6/Tanh)
Softmax (/decoder/attention_6/Softmax)
MatMul (/decoder/MatMul_6)
Transpose (/decoder/Transpose_13)
Concat (/decoder/Concat_6)
GRU (/decoder/rnn_6/GRU)
LogSoftmax (/decoder/LogSoftmax_6)
ArgMax (/decoder/ArgMax_6)
Unsqueeze (/decoder/Unsqueeze_6)
Transpose (/decoder/Transpose_14)
Gather (/decoder/emb_7/Gather)
Expand (/decoder/attention_7/Expand)
Transpose (/decoder/attention_7/Transpose)
Concat (/decoder/attention_7/Concat)
MatMul (/decoder/attention/attn_7/MatMul)
Add (/decoder/attention/attn_7/Add)
Tanh (/decoder/attention_7/Tanh)
Softmax (/decoder/attention_7/Softmax)
MatMul (/decoder/MatMul_7)
Transpose (/decoder/Transpose_15)
Concat (/decoder/Concat_7)
GRU (/decoder/rnn_7/GRU)
LogSoftmax (/decoder/LogSoftmax_7)
ArgMax (/decoder/ArgMax_7)
Unsqueeze (/decoder/Unsqueeze_7)
Transpose (/decoder/Transpose_16)
Gather (/decoder/emb_8/Gather)
Expand (/decoder/attention_8/Expand)
Transpose (/decoder/attention_8/Transpose)
Concat (/decoder/attention_8/Concat)
MatMul (/decoder/attention/attn_8/MatMul)
Add (/decoder/attention/attn_8/Add)
Tanh (/decoder/attention_8/Tanh)
Softmax (/decoder/attention_8/Softmax)
MatMul (/decoder/MatMul_8)
Transpose (/decoder/Transpose_17)
Concat (/decoder/Concat_8)
GRU (/decoder/rnn_8/GRU)
LogSoftmax (/decoder/LogSoftmax_8)
ArgMax (/decoder/ArgMax_8)
Unsqueeze (/decoder/Unsqueeze_8)
Transpose (/decoder/Transpose_18)
Gather (/decoder/emb_9/Gather)
Expand (/decoder/attention_9/Expand)
Transpose (/decoder/attention_9/Transpose)
Concat (/decoder/attention_9/Concat)
MatMul (/decoder/attention/attn_9/MatMul)
Add (/decoder/attention/attn_9/Add)
Tanh (/decoder/attention_9/Tanh)
Softmax (/decoder/attention_9/Softmax)
MatMul (/decoder/MatMul_9)
Transpose (/decoder/Transpose_19)
Concat (/decoder/Concat_9)
GRU (/decoder/rnn_9/GRU)
LogSoftmax (/decoder/LogSoftmax_9)
Unsqueeze (/decoder/Unsqueeze_9)
Unsqueeze (/decoder/Unsqueeze_10)
Unsqueeze (/decoder/Unsqueeze_11)
Unsqueeze (/decoder/Unsqueeze_12)
Unsqueeze (/decoder/Unsqueeze_13)
Unsqueeze (/decoder/Unsqueeze_14)
Unsqueeze (/decoder/Unsqueeze_15)
Unsqueeze (/decoder/Unsqueeze_16)
Unsqueeze (/decoder/Unsqueeze_17)
Unsqueeze (/decoder/Unsqueeze_18)
Concat (/decoder/Concat_10)
Transpose (/decoder/Transpose_20)
FusedMatMul (MatMul_With_Transpose)
FusedMatMul (MatMul_With_Transpose_token_0)
FusedMatMul (MatMul_With_Transpose_token_1)
FusedMatMul (MatMul_With_Transpose_token_2)
FusedMatMul (MatMul_With_Transpose_token_3)
FusedMatMul (MatMul_With_Transpose_token_4)
FusedMatMul (MatMul_With_Transpose_token_5)
FusedMatMul (MatMul_With_Transpose_token_6)
FusedMatMul (MatMul_With_Transpose_token_7)

Incorrect inference result node placement (1.17.1)

* Node placements
*Node(s) placed on [OpenVINOExecutionProvider]. Number of nodes: 11

OpenVINO-EP-subgraph_1 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_1_0)
OpenVINO-EP-subgraph_2 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_2_1)
OpenVINO-EP-subgraph_3 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_3_2)
OpenVINO-EP-subgraph_4 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_4_3)
OpenVINO-EP-subgraph_5 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_5_4)
OpenVINO-EP-subgraph_6 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_6_5)
OpenVINO-EP-subgraph_7 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_7_6)
OpenVINO-EP-subgraph_8 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_8_7)
OpenVINO-EP-subgraph_9 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_9_8)
OpenVINO-EP-subgraph_10 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_10_9)
OpenVINO-EP-subgraph_11 (OpenVINOExecutionProvider_OpenVINO-EP-subgraph_11_10)
*Node(s) placed on [CPUExecutionProvider]. Number of nodes: 167
GRU (/decoder/rnn/GRU)
LogSoftmax (/decoder/LogSoftmax)
ArgMax (/decoder/ArgMax)
Unsqueeze (/decoder/Unsqueeze)
Transpose (/decoder/Transpose_2)
Gather (/decoder/emb_1/Gather)
Expand (/decoder/attention_1/Expand)
Transpose (/decoder/attention_1/Transpose)
Concat (/decoder/attention_1/Concat)
MatMul (/decoder/attention/attn_1/MatMul)
Add (/decoder/attention/attn_1/Add)
Tanh (/decoder/attention_1/Tanh)
Softmax (/decoder/attention_1/Softmax)
MatMul (/decoder/MatMul_1)
Transpose (/decoder/Transpose_3)
Concat (/decoder/Concat_1)
GRU (/decoder/rnn_1/GRU)
LogSoftmax (/decoder/LogSoftmax_1)
ArgMax (/decoder/ArgMax_1)
Unsqueeze (/decoder/Unsqueeze_1)
Transpose (/decoder/Transpose_4)
Gather (/decoder/emb_2/Gather)
Expand (/decoder/attention_2/Expand)
Transpose (/decoder/attention_2/Transpose)
Concat (/decoder/attention_2/Concat)
MatMul (/decoder/attention/attn_2/MatMul)
Add (/decoder/attention/attn_2/Add)
Tanh (/decoder/attention_2/Tanh)
Softmax (/decoder/attention_2/Softmax)
MatMul (/decoder/MatMul_2)
Transpose (/decoder/Transpose_5)
Concat (/decoder/Concat_2)
GRU (/decoder/rnn_2/GRU)
LogSoftmax (/decoder/LogSoftmax_2)
ArgMax (/decoder/ArgMax_2)
Unsqueeze (/decoder/Unsqueeze_2)
Transpose (/decoder/Transpose_6)
Gather (/decoder/emb_3/Gather)
Expand (/decoder/attention_3/Expand)
Transpose (/decoder/attention_3/Transpose)
Concat (/decoder/attention_3/Concat)
MatMul (/decoder/attention/attn_3/MatMul)
Add (/decoder/attention/attn_3/Add)
Tanh (/decoder/attention_3/Tanh)
Softmax (/decoder/attention_3/Softmax)
MatMul (/decoder/MatMul_3)
Transpose (/decoder/Transpose_7)
Concat (/decoder/Concat_3)
GRU (/decoder/rnn_3/GRU)
LogSoftmax (/decoder/LogSoftmax_3)
ArgMax (/decoder/ArgMax_3)
Unsqueeze (/decoder/Unsqueeze_3)
Transpose (/decoder/Transpose_8)
Gather (/decoder/emb_4/Gather)
Expand (/decoder/attention_4/Expand)
Transpose (/decoder/attention_4/Transpose)
Concat (/decoder/attention_4/Concat)
MatMul (/decoder/attention/attn_4/MatMul)
Add (/decoder/attention/attn_4/Add)
Tanh (/decoder/attention_4/Tanh)
Softmax (/decoder/attention_4/Softmax)
MatMul (/decoder/MatMul_4)
Transpose (/decoder/Transpose_9)
Concat (/decoder/Concat_4)
GRU (/decoder/rnn_4/GRU)
LogSoftmax (/decoder/LogSoftmax_4)
ArgMax (/decoder/ArgMax_4)
Unsqueeze (/decoder/Unsqueeze_4)
Transpose (/decoder/Transpose_10)
Gather (/decoder/emb_5/Gather)
Expand (/decoder/attention_5/Expand)
Transpose (/decoder/attention_5/Transpose)
Concat (/decoder/attention_5/Concat)
MatMul (/decoder/attention/attn_5/MatMul)
Add (/decoder/attention/attn_5/Add)
Tanh (/decoder/attention_5/Tanh)
Softmax (/decoder/attention_5/Softmax)
MatMul (/decoder/MatMul_5)
Transpose (/decoder/Transpose_11)
Concat (/decoder/Concat_5)
GRU (/decoder/rnn_5/GRU)
LogSoftmax (/decoder/LogSoftmax_5)
ArgMax (/decoder/ArgMax_5)
Unsqueeze (/decoder/Unsqueeze_5)
Transpose (/decoder/Transpose_12)
Gather (/decoder/emb_6/Gather)
Expand (/decoder/attention_6/Expand)
Transpose (/decoder/attention_6/Transpose)
Concat (/decoder/attention_6/Concat)
MatMul (/decoder/attention/attn_6/MatMul)
Add (/decoder/attention/attn_6/Add)
Tanh (/decoder/attention_6/Tanh)
Softmax (/decoder/attention_6/Softmax)
MatMul (/decoder/MatMul_6)
Transpose (/decoder/Transpose_13)
Concat (/decoder/Concat_6)
GRU (/decoder/rnn_6/GRU)
LogSoftmax (/decoder/LogSoftmax_6)
ArgMax (/decoder/ArgMax_6)
Unsqueeze (/decoder/Unsqueeze_6)
Transpose (/decoder/Transpose_14)
Gather (/decoder/emb_7/Gather)
Expand (/decoder/attention_7/Expand)
Transpose (/decoder/attention_7/Transpose)
Concat (/decoder/attention_7/Concat)
MatMul (/decoder/attention/attn_7/MatMul)
Add (/decoder/attention/attn_7/Add)
Tanh (/decoder/attention_7/Tanh)
Softmax (/decoder/attention_7/Softmax)
MatMul (/decoder/MatMul_7)
Transpose (/decoder/Transpose_15)
Concat (/decoder/Concat_7)
GRU (/decoder/rnn_7/GRU)
LogSoftmax (/decoder/LogSoftmax_7)
ArgMax (/decoder/ArgMax_7)
Unsqueeze (/decoder/Unsqueeze_7)
Transpose (/decoder/Transpose_16)
Gather (/decoder/emb_8/Gather)
Expand (/decoder/attention_8/Expand)
Transpose (/decoder/attention_8/Transpose)
Concat (/decoder/attention_8/Concat)
MatMul (/decoder/attention/attn_8/MatMul)
Add (/decoder/attention/attn_8/Add)
Tanh (/decoder/attention_8/Tanh)
Softmax (/decoder/attention_8/Softmax)
MatMul (/decoder/MatMul_8)
Transpose (/decoder/Transpose_17)
Concat (/decoder/Concat_8)
GRU (/decoder/rnn_8/GRU)
LogSoftmax (/decoder/LogSoftmax_8)
ArgMax (/decoder/ArgMax_8)
Unsqueeze (/decoder/Unsqueeze_8)
Transpose (/decoder/Transpose_18)
Gather (/decoder/emb_9/Gather)
Expand (/decoder/attention_9/Expand)
Transpose (/decoder/attention_9/Transpose)
Concat (/decoder/attention_9/Concat)
MatMul (/decoder/attention/attn_9/MatMul)
Add (/decoder/attention/attn_9/Add)
Tanh (/decoder/attention_9/Tanh)
Softmax (/decoder/attention_9/Softmax)
MatMul (/decoder/MatMul_9)
Transpose (/decoder/Transpose_19)
Concat (/decoder/Concat_9)
GRU (/decoder/rnn_9/GRU)
LogSoftmax (/decoder/LogSoftmax_9)
Unsqueeze (/decoder/Unsqueeze_9)
Unsqueeze (/decoder/Unsqueeze_10)
Unsqueeze (/decoder/Unsqueeze_11)
Unsqueeze (/decoder/Unsqueeze_12)
Unsqueeze (/decoder/Unsqueeze_13)
Unsqueeze (/decoder/Unsqueeze_14)
Unsqueeze (/decoder/Unsqueeze_15)
Unsqueeze (/decoder/Unsqueeze_16)
Unsqueeze (/decoder/Unsqueeze_17)
Unsqueeze (/decoder/Unsqueeze_18)
Concat (/decoder/Concat_10)
Transpose (/decoder/Transpose_20)
FusedMatMul (MatMul_With_Transpose)
FusedMatMul (MatMul_With_Transpose_token_18)
FusedMatMul (MatMul_With_Transpose_token_19)
FusedMatMul (MatMul_With_Transpose_token_20)
FusedMatMul (MatMul_With_Transpose_token_21)
FusedMatMul (MatMul_With_Transpose_token_22)
FusedMatMul (MatMul_With_Transpose_token_23)
FusedMatMul (MatMul_With_Transpose_token_24)
FusedMatMul (MatMul_With_Transpose_token_25)

as you can see the difference is only on last 8 lines (matmuls token ids differs). Hope it'll help...

F

To reproduce

Look description.

Urgency

Urgent

Platform

Linux

OS Version

Ubuntu 22.04

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.17.1 release

ONNX Runtime API

C++

Architecture

X64

Execution Provider

OpenVINO

Execution Provider Library Version

2023.3

@github-actions github-actions bot added ep:CUDA issues related to the CUDA execution provider ep:OpenVINO issues related to OpenVINO execution provider labels Mar 19, 2024
@debugmenot
Copy link
Author

Just to note: the issue looks independent of OpenVino version - got experiments with different. Also built all from scratch many times on different systems - same results.

@hariharans29 hariharans29 removed the ep:CUDA issues related to the CUDA execution provider label Mar 19, 2024
@debugmenot
Copy link
Author

debugmenot commented Mar 19, 2024

Update: 1.14.1 also works, but the performance is about 10-15% lower. 1.15 and higher affected by issue.

@jywu-msft
Copy link
Member

+@sfatimar, @preetha-intel

@debugmenot
Copy link
Author

any update?

@sfatimar
Copy link
Contributor

sfatimar commented Apr 2, 2024

Can we have access to the model. It seems there are 11 subgraphs being formed and 167 nodes are being placed on CPUEP. But it is hard to debug without the model.

@debugmenot
Copy link
Author

@sfatimar

dumbmodel.onnx.zip
Dumb model is in attachment.
To visualize issue there is also small log of test run:

Here i'm iterating over the same image. All result except first are broken.

`f1race@build_server_nvidia:/opt/ort_dev$ ./test --image images/test/dumb100x100text.jpg
[info] Wellcome to first 0.0.1
[info] Available provider: CUDAExecutionProvider
[info] Available provider: OpenVINOExecutionProvider
[info] Available provider: XnnpackExecutionProvider
[info] Available provider: CPUExecutionProvider
[-] Selected provider: OpenVINOExecutionProvider
Input 0 : name=input.1
Output 0 : name=1389
[-] Output tensor element count: 390
[info] CHAR: A, CLASS: 13, CONF: -0.11442014
[info] CHAR: A, CLASS: 13, CONF: -0.5359584
[info] CHAR: 4, CLASS: 7, CONF: -2.073846
[info] CHAR: 6, CLASS: 9, CONF: -2.010087
[info] CHAR: 6, CLASS: 9, CONF: -1.8180711
[info] CHAR: D, CLASS: 16, CONF: -2.448421
[info] CHAR: S, CLASS: 31, CONF: -2.7345552
[info] CHAR: , CLASS: 2, CONF: -0.009441723
[info] CHAR: , CLASS: 2, CONF: -0.05160664
[info] CHAR: , CLASS: 2, CONF: -0.097647004

[-] Output tensor element count: 390
[info] CHAR: A, CLASS: 13, CONF: -0.11442014
[info] CHAR: B, CLASS: 14, CONF: -2.1106374
[info] CHAR: , CLASS: 2, CONF: -2.3829944
[info] CHAR: , CLASS: 0, CONF: -0.31160322
[info] CHAR: , CLASS: 0, CONF: -2.2568073
[info] CHAR: , CLASS: 0, CONF: -2.5611315
[info] CHAR: , CLASS: 0, CONF: -2.2948604
[info] CHAR: , CLASS: 0, CONF: -2.2516015
[info] CHAR: , CLASS: 0, CONF: -2.5611215
[info] CHAR: , CLASS: 0, CONF: -2.294854

[-] Output tensor element count: 390
[info] CHAR: A, CLASS: 13, CONF: -0.11442014
[info] CHAR: B, CLASS: 14, CONF: -2.1106374
[info] CHAR: , CLASS: 2, CONF: -2.3829944
[info] CHAR: , CLASS: 0, CONF: -0.31160322
[info] CHAR: , CLASS: 0, CONF: -2.2568073
[info] CHAR: , CLASS: 0, CONF: -2.5611315
[info] CHAR: , CLASS: 0, CONF: -2.2948604
[info] CHAR: , CLASS: 0, CONF: -2.2516015
[info] CHAR: , CLASS: 0, CONF: -2.5611215
[info] CHAR: , CLASS: 0, CONF: -2.294854`

@debugmenot
Copy link
Author

Once again, this happened ONLY with OpenVINO EP with Onnxruntime >= 1.15 and any version of OpenVino.

No issues with Onnxruntime 1.13.1 and 1.14.1 (lower not tested).

CPUEP, XnnpackEP, CudaEP works well with this model and same inference code in any version of ORT including the latest one.

@henxing
Copy link

henxing commented Apr 15, 2024

I'm seeing a similar issue that occurs in Python with onnxruntime-openvino version 1.16.0. I am currently stuck using python 3.8, so I cannot test 1.17, but see the following for a test script with three very simple models that show how one of them (BrokenModel) generates different results than PyTorch when using onnxruntime. If this behavior is different enough from this issue, I'm happy to open another issue to track it.

import numpy as np
import onnxruntime as rt
import torch
from torch import nn


class BrokenModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv_1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1)
        self.conv_2 = nn.Conv2d(64, 1, kernel_size=1, stride=1, padding=0)

    def forward(self, x):
        x = self.conv_1(x)
        output = self.conv_2(x)
        return output.mean(dim=(1, 2, 3))


class BatchMeanModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv_1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1)
        self.conv_2 = nn.Conv2d(64, 1, kernel_size=1, stride=1, padding=0)

    def forward(self, x):
        x = self.conv_1(x)
        output = self.conv_2(x)
        return output.mean(dim=(1, 2, 3)), output.mean()


class FewChannelModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv_1 = nn.Conv2d(3, 3, kernel_size=3, stride=1, padding=1)
        self.conv_2 = nn.Conv2d(3, 1, kernel_size=1, stride=1, padding=0)

    def forward(self, x):
        x = self.conv_1(x)
        output = self.conv_2(x)
        return output.mean(dim=(1, 2, 3))


def run_model_pytorch_onnxruntime(arch, path):
    model = arch()
    model.eval()
    print("=" * 80)
    print(model)

    data = torch.ones(2, 3, 224, 224)
    data[0] *= 0

    print("Torch:")
    for _ in range(2):
        result = model(data)
        print(result)
    print()

    torch.onnx.export(
        model,
        data,
        path,
        input_names=["input"],
        output_names=["output"],
        export_params=True,
        dynamic_axes={name: {0: "batch_size"} for name in ("input", "output")},
        verbose=False,
    )

    sess_options = rt.SessionOptions()
    sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_DISABLE_ALL

    print("Onnxruntime:")
    rt_sess = rt.InferenceSession(
        path, sess_options, providers=["OpenVINOExecutionProvider"], provider_options=[{"device_id": "GPU"}]
    )
    for _ in range(2):
        outputs = rt_sess.run(None, {"input": data.numpy()})
        print(outputs)
    print()


if __name__ == "__main__":
    run_model_pytorch_onnxruntime(BrokenModel, "broken_model.onnx")
    print()
    run_model_pytorch_onnxruntime(BatchMeanModel, "batch_mean_model.onnx")
    print()
    run_model_pytorch_onnxruntime(FewChannelModel, "few_channel_model.onnx")

You'll need to install torch, onnxruntime-openvino, and numpy to run this script.

@debugmenot
Copy link
Author

@sfatimar, Hi! Any updates? I've uploaded the model for bug investigation.

@ankitm3k
Copy link
Contributor

ankitm3k commented Apr 25, 2024

Hi @debugmenot , I have tested the script suggested by @henxing using OpenVINO Toolkit v2024.1 (w_openvino_toolkit_windows_2024.1.0.dev20240405_x86_64) and OVEP v1.18.0 (this version update is now merged and available on the latest main of microsoft/onnruntime repo) on a Windows machine. I ran inference for 5 iterations and the PyTorch vs ORT OpenVINO EP results for every inference iterations were same and OVEP results were quite accurate upto 3 decimal precision against torch results. Please find the below run log for the same -

================================================================================
BrokenModel(
(conv_1): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(conv_2): Conv2d(64, 1, kernel_size=(1, 1), stride=(1, 1))
)
Torch:
tensor([-0.1026, -0.0569], grad_fn=)
tensor([-0.1026, -0.0569], grad_fn=)

Onnxruntime:
[array([-0.1026001 , -0.05670166], dtype=float32)]
[array([-0.1026001 , -0.05670166], dtype=float32)]
[array([-0.1026001 , -0.05670166], dtype=float32)]
[array([-0.1026001 , -0.05670166], dtype=float32)]
[array([-0.1026001 , -0.05670166], dtype=float32)]

================================================================================
BatchMeanModel(
(conv_1): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(conv_2): Conv2d(64, 1, kernel_size=(1, 1), stride=(1, 1))
)
Torch:
(tensor([0.1573, 0.1438], grad_fn=), tensor(0.1506, grad_fn=))
(tensor([0.1573, 0.1438], grad_fn=), tensor(0.1506, grad_fn=))

Onnxruntime:
[array([0.1573365 , 0.14381096], dtype=float32), array(0.15057378, dtype=float32)]
[array([0.1573365 , 0.14381096], dtype=float32), array(0.15057378, dtype=float32)]
[array([0.1573365 , 0.14381096], dtype=float32), array(0.15057378, dtype=float32)]
[array([0.1573365 , 0.14381096], dtype=float32), array(0.15057378, dtype=float32)]
[array([0.1573365 , 0.14381096], dtype=float32), array(0.15057378, dtype=float32)]

================================================================================
FewChannelModel(
(conv_1): Conv2d(3, 3, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(conv_2): Conv2d(3, 1, kernel_size=(1, 1), stride=(1, 1))
)
Torch:
tensor([-0.1036, -0.1638], grad_fn=)
tensor([-0.1036, -0.1638], grad_fn=)

Onnxruntime:
[array([-0.10357666, -0.16418457], dtype=float32)]
[array([-0.10357666, -0.16418457], dtype=float32)]
[array([-0.10357666, -0.16418457], dtype=float32)]
[array([-0.10357666, -0.16418457], dtype=float32)]
[array([-0.10357666, -0.16418457], dtype=float32)]

@ankitm3k
Copy link
Contributor

@sfatimar

dumbmodel.onnx.zip Dumb model is in attachment. To visualize issue there is also small log of test run:

Here i'm iterating over the same image. All result except first are broken.

`f1race@build_server_nvidia:/opt/ort_dev$ ./test --image images/test/dumb100x100text.jpg [info] Wellcome to first 0.0.1 [info] Available provider: CUDAExecutionProvider [info] Available provider: OpenVINOExecutionProvider [info] Available provider: XnnpackExecutionProvider [info] Available provider: CPUExecutionProvider [-] Selected provider: OpenVINOExecutionProvider Input 0 : name=input.1 Output 0 : name=1389 [-] Output tensor element count: 390 [info] CHAR: A, CLASS: 13, CONF: -0.11442014 [info] CHAR: A, CLASS: 13, CONF: -0.5359584 [info] CHAR: 4, CLASS: 7, CONF: -2.073846 [info] CHAR: 6, CLASS: 9, CONF: -2.010087 [info] CHAR: 6, CLASS: 9, CONF: -1.8180711 [info] CHAR: D, CLASS: 16, CONF: -2.448421 [info] CHAR: S, CLASS: 31, CONF: -2.7345552 [info] CHAR: , CLASS: 2, CONF: -0.009441723 [info] CHAR: , CLASS: 2, CONF: -0.05160664 [info] CHAR: , CLASS: 2, CONF: -0.097647004

[-] Output tensor element count: 390 [info] CHAR: A, CLASS: 13, CONF: -0.11442014 [info] CHAR: B, CLASS: 14, CONF: -2.1106374 [info] CHAR: , CLASS: 2, CONF: -2.3829944 [info] CHAR: , CLASS: 0, CONF: -0.31160322 [info] CHAR: , CLASS: 0, CONF: -2.2568073 [info] CHAR: , CLASS: 0, CONF: -2.5611315 [info] CHAR: , CLASS: 0, CONF: -2.2948604 [info] CHAR: , CLASS: 0, CONF: -2.2516015 [info] CHAR: , CLASS: 0, CONF: -2.5611215 [info] CHAR: , CLASS: 0, CONF: -2.294854

[-] Output tensor element count: 390 [info] CHAR: A, CLASS: 13, CONF: -0.11442014 [info] CHAR: B, CLASS: 14, CONF: -2.1106374 [info] CHAR: , CLASS: 2, CONF: -2.3829944 [info] CHAR: , CLASS: 0, CONF: -0.31160322 [info] CHAR: , CLASS: 0, CONF: -2.2568073 [info] CHAR: , CLASS: 0, CONF: -2.5611315 [info] CHAR: , CLASS: 0, CONF: -2.2948604 [info] CHAR: , CLASS: 0, CONF: -2.2516015 [info] CHAR: , CLASS: 0, CONF: -2.5611215 [info] CHAR: , CLASS: 0, CONF: -2.294854`

We are investigating the issues faced while running your model using OpenVINO EP execution provider.

@debugmenot
Copy link
Author

@ankitm3k hi! Did you confirm the bug? If so, any ETA for patch?

@ankitm3k
Copy link
Contributor

ankitm3k commented May 7, 2024

Hi @debugmenot,
I have investigated the issues with your given onnx model file i.e. dumbmodel.onnx. When performing inference with your model, there were many subgraph partitions with your model due to which most of the nodes were falling back to CPU EP. This causes lower performance as the model graph is completely not running with OpenVINO EP. The above fix enables the whole model to be supported on OpenVINOExecutionProvider and improves performance for your model.

I recommend you to use latest OpenVINO Toolkit v2024.1 along with the above patch to fix the same. I also have investigated the tensor outputs as a result of multiple inference iterations over the same input data and they were found to be consistent / accurate with the first inference results for my build.

sfatimar added a commit to intel/onnxruntime that referenced this issue Jun 24, 2024
fix: updated data ops to support the complete graph on OVEP (microsoft#19975)
@debugmenot
Copy link
Author

debugmenot commented Aug 13, 2024

@ankitm3k
Hi.
Update: issue still not fixed... Just checked. Performance is better now... but:
Onnxruntime 1.14.1 + OV:
[02:42:09.361] [I] [74706] [4] [car] HOMEP: T454BE199
[02:42:11.675] [I] [74706] [6] [car] HOMEP: X212EX197
[02:42:14.785] [I] [74706] [13] [car] HOMEP: O353XM199
[02:42:16.420] [I] [74706] [16] [car] HOMEP: H002XC199
[02:42:17.709] [I] [74706] [18] [car] HOMEP: P346AB197
[02:42:18.525] [I] [74706] [20] [car] HOMEP: A001OT197
[02:42:19.709] [I] [74706] [21] [car] HOMEP: E072MK199
[02:42:21.144] [I] [74706] [23] [car] HOMEP: B797HK197
[02:42:22.028] [I] [74706] [25] [car] HOMEP: O369CX177
[02:42:24.947] [I] [74706] [30] [car] HOMEP: B410KA17
[02:42:25.968] [I] [74706] [33] [car] HOMEP: K558AT197
[02:42:36.141] [I] [74706] [52] [car] HOMEP: C159XT199
[02:42:41.442] [I] [74706] [60] [car] HOMEP: O905OT190
[02:42:43.093] [I] [74706] [63] [car] HOMEP: Y902OA190
[02:42:46.568] [I] [74706] [68] [car] HOMEP: E159YY150
[02:42:47.770] [I] [74706] [71] [car] HOMEP: M181YA197

Onnxruntime 1.18.1 + OpenVinoEP 2024.3 + your GRU OP Patch:
[01:58:34.495] [I] [7342] [6] [car] HOMEP: T454BE199
[01:58:36.900] [I] [7342] [10] [car] HOMEP: X2XXX22X22
[01:58:39.927] [I] [7342] [20] [car] HOMEP: O333O33O33
[01:58:41.637] [I] [7342] [23] [car] HOMEP: H000H00H00
[01:58:42.832] [I] [7342] [27] [car] HOMEP: P333P33P33
[01:58:43.725] [I] [7342] [29] [car] HOMEP: A000A00A00
[01:58:44.849] [I] [7342] [30] [car] HOMEP: E000E00E00
[01:58:46.330] [I] [7342] [34] [car] HOMEP: B777B77B77
[01:58:51.137] [I] [7342] [51] [car] HOMEP: K555K55K55
[01:59:01.337] [I] [7342] [63] [car] HOMEP: C1CCC11C11
[01:59:06.587] [I] [7342] [70] [car] HOMEP: O999O99O99
[01:59:08.301] [I] [7342] [74] [car] HOMEP: Y999Y99Y99
[01:59:11.775] [I] [7342] [80] [car] HOMEP: E1EEE11E11
[01:59:13.006] [I] [7342] [85] [car] HOMEP: M111M11M11

i can prepare test project (source+model+image) for you. can you share your email please?

@debugmenot
Copy link
Author

But with patch behaviour is slightly different - results after first is a little bit differs with results without patch but looks approx the same (incorrect)...
Is there a dirty fix possible if i change supported ops in data_ops.cc to 1.14.1 version, or something like this? How to do this properly? Cant use legacy ort versions in new build of our software because of API incompatibility.

@debugmenot
Copy link
Author

debugmenot commented Aug 14, 2024

@ankitm3k I've finally found an issue, at least WHERE it is EXACTLY. If
//{"Unsqueeze", V_2020_4, {"CPU", "GPU"}}, //is commented in data_ops.cc

all works as expected :) Issue needs an investigation.

its strange because unsqueeze is defined exactly same way as in 1.14.1 and 1.13.1 versions...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:OpenVINO issues related to OpenVINO execution provider
Projects
None yet
Development

No branches or pull requests

6 participants