Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Build] Encountered error executing OPs on RISCV platform #20030

Open
FZR95 opened this issue Mar 22, 2024 · 2 comments
Open

[Build] Encountered error executing OPs on RISCV platform #20030

FZR95 opened this issue Mar 22, 2024 · 2 comments
Labels
contributions welcome lower priority issues for the core ORT teams

Comments

@FZR95
Copy link

FZR95 commented Mar 22, 2024

Describe the issue

I am trying to build onnxruntime for a RISCV target platform (LicheeRV Nano). I have succeeded the built of libonnxruntime.so using the cross compiler (riscv64-unknown-linux-musl) and no other errors reported.
The same test project and model runs without errors on both Raspberry Pi (arm) and PC, the model accuracy is normal, but on the RISCV target platform it drops to 15% accuracy, which is equivalent to a completely wrong state.

I have tried several options, but they don't work:
-Donnxruntime_DISABLE_CONTRIB_OPS=ON
-Donnxruntime_DONT_VECTORIZE=ON
-Donnxruntime_DISABLE_ABSEIL=ON

The model is a simple 3-layer CNN including conv1d, relu and maxpool. I can at least confirm that the problem occurs after the first convolutional block (can't be sure it's the conv OP's fault).

I don't have any idea at the moment, so I'd appreciate if you could give me some ideas!

Urgency

No response

Target platform

RISCV (LicheeRV Nano)

Build script

cmake ../cmake
-Donnxruntime_GCC_STATIC_CPP_RUNTIME=ON
-DCMAKE_BUILD_TYPE=Release
-Donnxruntime_BUILD_SHARED_LIB=ON
-Donnxruntime_BUILD_UNIT_TESTS=OFF
-Donnxruntime_ENABLE_CPUINFO=OFF
-DCMAKE_TOOLCHAIN_FILE=linux_lichee_crosscompile_toolchain.cmake

linux_lichee_crosscompile_toolchain.cmake

SET(CMAKE_SYSTEM_NAME Linux)
SET(CMAKE_SYSTEM_VERSION 1)
set(tools "/opt/host-tools/gcc/riscv64-linux-musl-x86_64")
set(CMAKE_C_COMPILER ${tools}/bin/riscv64-unknown-linux-musl-gcc)
set(CMAKE_CXX_COMPILER ${tools}/bin/riscv64-unknown-linux-musl-g++)
set(CMAKE_SYSTEM_PROCESSOR riscv64)
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -mcpu=c906fdv -march=rv64imafdcv0p7xthead -mcmodel=medany -mabi=lp64d")
SET(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER)
SET(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY)
SET(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY)
SET(CMAKE_FIND_ROOT_PATH_MODE_PACKAGE ONLY)

Error / output

RISCV platform:
ONNX Runtime version: 1.18.0
Input Node Name/Shape (0):
input : 1x9x128
Output Node Name/Shape (0):
output : -1x6
Accuracy: 15.23 %

PC platform:
ONNX Runtime version: 1.18.0
Input Node Name/Shape (0):
input : 1x9x128
Output Node Name/Shape (0):
output : -1x6
Accuracy: 86.09 %

Visual Studio Version

No response

GCC / Compiler Version

C++ compiler version : 10.2.0

@FZR95 FZR95 added the build build issues; typically submitted using template label Mar 22, 2024
@snnn snnn added contributions welcome lower priority issues for the core ORT teams and removed build build issues; typically submitted using template labels Mar 22, 2024
@snnn
Copy link
Member

snnn commented Mar 22, 2024

Sorry our team doesn't have access to the kind of hardware. We cannot debug the issue. If you know where the problem is, welcome to help us fix it.

@NobuoTsukamoto
Copy link

I'm having what seems to be a similar issue (v1.18.0).
This issue has been confirmed on VisionFive 2 and Yocto QEMU.

Target platform

  • VisionFive 2 (Debian - 202405 Release)
  • Yocto QEMU (5.0+snapshot-a41948e86dd6d03beea43827c4f2bf274dd767a3)

Error / output

$ python3 label_image.py --model mobilenetv2-10.onnx --label ./synset.txt --image kitten.jpg 

Result: VisionFive 2 and Yocto QEMU

Inference result:
  class=n03868863 oxygen mask ; probability=6.158878
  class=n03045698 cloak ; probability=5.869998
  class=n03196217 digital clock ; probability=5.623226
  class=n03770439 miniskirt, mini ; probability=5.581253
  class=n04254680 soccer ball ; probability=5.542237

Expected behavior
tabby, tabby cat are displayed as inference results.
I get the correct results on a Raspberry Pi or x86 PC.
When I ran onnx_test_runner on Yocto QEMU, gemm_activation_fusion failed and the results were different from x86.
gemm_activation_fusion does not fail on an x86 PC.

[[1;31m2024-06-12 06:59:47.394317600 [E:onnxruntime:Default, dataitem_request.cc:212 RunImpl] gemm_activation_fusion:output=z:expected 2.34025 (4015c6b4), got 0.389875 (3ec79db5), diff: 1.95038, tol=0.00334025 idx=4. 12 of 12 differ^[[m
[[1;31m2024-06-12 06:59:47.395344300 [E:onnxruntime:Default, testcase_request.cc:194 CalculateAndLogStats] gemm_activation_fusion: result differs. Dataset:/usr/share/onnxruntime/test/testdata/transform/gemm_activation_fusion/test_data_set_0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributions welcome lower priority issues for the core ORT teams
Projects
None yet
Development

No branches or pull requests

3 participants