Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CANN error executing while running Conv node #18807

Closed
H3AlO3 opened this issue Dec 13, 2023 · 6 comments
Closed

CANN error executing while running Conv node #18807

H3AlO3 opened this issue Dec 13, 2023 · 6 comments
Labels
ep:ACL issues related to ACL execution provider

Comments

@H3AlO3
Copy link

H3AlO3 commented Dec 13, 2023

Describe the issue

I'm trying to run an onnx model on a Huawei cloud server that has an Ascend 310, but it reports the following error while model.run.

2023-12-13 20:11:34.351330063 [E:onnxruntime:Default, cann_call.cc:139 CannCall] CANN failure 500001: ACL_ERROR_FAILURE ; NPU=0 ; hostname=ecs-b520 ; expr=aclopCompileAndExecute(opname.c_str(), prepare.inputDesc_.size(), prepare.inputDesc_.data(), prepare.inputBuffers_.data(), prepare.outputDesc_.size(), prepare.outputDesc_.data(), prepare.outputBuffers_.data(), prepare.opAttr_, ACL_ENGINE_SYS, ACL_COMPILE_SYS, __null, Stream(ctx));
2023-12-13 20:11:34.351379137 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running Conv node. Name:'/features/features.0/features.0.0/Conv' Status Message: CANN error executing aclopCompileAndExecute(opname.c_str(), prepare.inputDesc_.size(), prepare.inputDesc_.data(), prepare.inputBuffers_.data(), prepare.outputDesc_.size(), prepare.outputDesc_.data(), prepare.outputBuffers_.data(), prepare.opAttr_, ACL_ENGINE_SYS, ACL_COMPILE_SYS, NULL, Stream(ctx))
Traceback (most recent call last):
  File "error.py", line 27, in <module>
    print(predict())
  File "error.py", line 23, in predict
    preds = model.run(None, {"input.1": img})[0]
  File "/usr/local/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 217, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Conv node. Name:'/features/features.0/features.0.0/Conv' Status Message: CANN error executing aclopCompileAndExecute(opname.c_str(), prepare.inputDesc_.size(), prepare.inputDesc_.data(), prepare.inputBuffers_.data(), prepare.outputDesc_.size(), prepare.outputDesc_.data(), prepare.outputBuffers_.data(), prepare.opAttr_, ACL_ENGINE_SYS, ACL_COMPILE_SYS, NULL, Stream(ctx))

Specifically, I have the following configuration:

providers = [
    (
        "CANNExecutionProvider",
        {
            "device_id": 0,
            "arena_extend_strategy": "kNextPowerOfTwo",
            "enable_cann_graph": False,
        },
    ),
    "CPUExecutionProvider",
]

If I change the 'enable_cann_graph' setting to True, then it reports the following error

2023-12-13 20:10:22.459251611 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running torch_jit_9127997404873590405_0 node. Name:'CANNExecutionProvider_torch_jit_9127997404873590405_0_0' Status Message: /root/onnxruntime/onnxruntime/core/providers/cann/cann_call.cc:143 bool onnxruntime::CannCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = int; bool THRW = true] /root/onnxruntime/onnxruntime/core/providers/cann/cann_call.cc:137 bool onnxruntime::CannCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = int; bool THRW = true] CANN failure -1: (look for ACL_ERROR_xxx in acl.h) ; NPU=0 ; hostname=ecs-b520 ; expr=ge::aclgrphBuildInitialize(options);


Traceback (most recent call last):
  File "error.py", line 27, in <module>
    print(predict())
  File "error.py", line 23, in predict
    preds = model.run(None, {"input.1": img})[0]
  File "/usr/local/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 217, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running torch_jit_9127997404873590405_0 node. Name:'CANNExecutionProvider_torch_jit_9127997404873590405_0_0' Status Message: /root/onnxruntime/onnxruntime/core/providers/cann/cann_call.cc:143 bool onnxruntime::CannCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = int; bool THRW = true] /root/onnxruntime/onnxruntime/core/providers/cann/cann_call.cc:137 bool onnxruntime::CannCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = int; bool THRW = true] CANN failure -1: (look for ACL_ERROR_xxx in acl.h) ; NPU=0 ; hostname=ecs-b520 ; expr=ge::aclgrphBuildInitialize(options);

There is no error while using CPUExecutionProvider.
Please help me, thanks!

To reproduce

my code

import onnxruntime as ort
import numpy as np

providers = [
    (
        "CANNExecutionProvider",
        {
            "device_id": 0,
            "arena_extend_strategy": "kNextPowerOfTwo",
            "enable_cann_graph": False,
            },
    ),
    "CPUExecutionProvider",
]
#providers = ["CPUExecutionProvider"]
model = ort.InferenceSession('model_0.onnx', providers=providers)


def predict():
    # fake image
    img = np.random.random((1, 3, 1024, 1024)).astype(np.float16)
    # inference
    preds = model.run(None, {"input.1": img})[0]
    return preds


print(predict())

and here is my model

Urgency

No response

Platform

Linux

OS Version

Ubuntu 18.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.15.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CANN

Execution Provider Library Version

CANN 7.0.0

@github-actions github-actions bot added the ep:ACL issues related to ACL execution provider label Dec 13, 2023
@H3AlO3
Copy link
Author

H3AlO3 commented Dec 13, 2023

I tried another model with a single Liner layer

class X(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(2, 2)

    def forward(self, x):
        return self.linear(x)

and it works, but if you replace the Liner layer with Conv2d

class X(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Conv2d(2, 2, 2)

    def forward(self, x):
        return self.linear(x)

the error occurs again. So I think maybe there's an error with convolution?

@H3AlO3
Copy link
Author

H3AlO3 commented Dec 15, 2023

It's a bug of cann-opp installing program, I have solved it.

@H3AlO3 H3AlO3 closed this as completed Dec 15, 2023
@kingsley-gl
Copy link

It's a bug of cann-opp installing program, I have solved it.

I have met the same problem, how to slove it?

@H3AlO3
Copy link
Author

H3AlO3 commented Mar 7, 2024

It's a bug of cann-opp installing program, I have solved it.

I have met the same problem, how to slove it?

I manually unzipped and copied some of the files from the opp installer, though I didn't actually solve the problem completely, it just turned into another error, so I finally gave up.

@kingsley-gl
Copy link

It's a bug of cann-opp installing program, I have solved it.

I have met the same problem, how to slove it?

I manually unzipped and copied some of the files from the opp installer, though I didn't actually solve the problem completely, it just turned into another error, so I finally gave up.

ok, thanks a lot

@kingsley-gl
Copy link

It's a bug of cann-opp installing program, I have solved it.

I have met the same problem, how to slove it?

I manually unzipped and copied some of the files from the opp installer, though I didn't actually solve the problem completely, it just turned into another error, so I finally gave up.

I solve it. You might open the ACL log by setting export ASCEND_SLOG_PRINT_TO_STDOUT=1 and adding the sess_opt.log_severity_level = 0 at your code to open the onnxruntime log. It will be catch the real error, such as no module 'tbe' and so on. The code would be well run after fixing it one by one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:ACL issues related to ACL execution provider
Projects
None yet
Development

No branches or pull requests

2 participants