Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Build] AIX tests failures for Multi-Lora feature #22364

Closed
ranjitshs opened this issue Oct 9, 2024 · 5 comments · Fixed by #22375
Closed

[Build] AIX tests failures for Multi-Lora feature #22364

ranjitshs opened this issue Oct 9, 2024 · 5 comments · Fixed by #22375
Assignees
Labels
core runtime issues related to core runtime

Comments

@ranjitshs
Copy link
Contributor

Describe the issue

As mentioned in #22046 , in AIX,
Below tests are failing . I am expecting similar failures in python bindings.
This issue is to track these test failures.

1: [ RUN      ] LoraAdapterTest.Load
1: unknown file: Failure
1: C++ exception with description "/home/buildusr/jenkins/workspace/onnxruntime-gcc/onnxruntime/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandler<onnxruntime::OnnxRuntimeException>::SafeIntOnOverflow() Integer overflow
1: " thrown in the test body.
1: 
1: [  FAILED  ] LoraAdapterTest.Load (27 ms)



4: [ RUN      ] CApiTest.RunWithLoraAdapterFromFile
4: unknown file: Failure
4: C++ exception with description "/home/buildusr/jenkins/workspace/onnxruntime-gcc/onnxruntime/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandler<onnxruntime::OnnxRuntimeException>::SafeIntOnOverflow() Integer overflow
4: " thrown in the test body.
4: 
4: [  FAILED  ] CApiTest.RunWithLoraAdapterFromFile (0 ms)
4: [ RUN      ] CApiTest.RunWithLoraAdapterFromArray
4: unknown file: Failure
4: C++ exception with description "/home/buildusr/jenkins/workspace/onnxruntime-gcc/onnxruntime/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandler<onnxruntime::OnnxRuntimeException>::SafeIntOnOverflow() Integer overflow
4: " thrown in the test body.
4: 
4: [  FAILED  ] CApiTest.RunWithLoraAdapterFromArray (0 ms)
4: [ RUN      ] CApiTest.RunBaseLoraModel

Urgency

No response

Target platform

AIX

Build script

AIX Build instruction can be referred from https://onnxruntime.ai/docs/build/inferencing.html

Error / output

1: [ RUN      ] LoraAdapterTest.Load
1: unknown file: Failure
1: C++ exception with description "/home/buildusr/jenkins/workspace/onnxruntime-gcc/onnxruntime/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandler<onnxruntime::OnnxRuntimeException>::SafeIntOnOverflow() Integer overflow
1: " thrown in the test body.
1: 
1: [  FAILED  ] LoraAdapterTest.Load (27 ms)



4: [ RUN      ] CApiTest.RunWithLoraAdapterFromFile
4: unknown file: Failure
4: C++ exception with description "/home/buildusr/jenkins/workspace/onnxruntime-gcc/onnxruntime/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandler<onnxruntime::OnnxRuntimeException>::SafeIntOnOverflow() Integer overflow
4: " thrown in the test body.
4: 
4: [  FAILED  ] CApiTest.RunWithLoraAdapterFromFile (0 ms)
4: [ RUN      ] CApiTest.RunWithLoraAdapterFromArray
4: unknown file: Failure
4: C++ exception with description "/home/buildusr/jenkins/workspace/onnxruntime-gcc/onnxruntime/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandler<onnxruntime::OnnxRuntimeException>::SafeIntOnOverflow() Integer overflow
4: " thrown in the test body.
4: 
4: [  FAILED  ] CApiTest.RunWithLoraAdapterFromArray (0 ms)
4: [ RUN      ] CApiTest.RunBaseLoraModel

Visual Studio Version

No response

GCC / Compiler Version

10.3

@ranjitshs ranjitshs added the build build issues; typically submitted using template label Oct 9, 2024
@ranjitshs
Copy link
Contributor Author

@snnn @tianleiwu @yuslepukhin
FYI.

I did some debugging on above.

  • For the first LoraAdapterTest.Load, during execution , adapter shape is created by FlatBuffer CreateVector which is doing byte-swapping in BE platform
    so in method CreateOrtValueOverLoraParameter , we are receiving shape as very large value causing InitOrtValue to throw exception and fail.
      shape size2
      0:576460752303423488
      1:288230376151711744

After swapping the shape , I see that this test is passing.

  • For CApiTest.RunWithLoraAdapterFromFile and CApiTest.RunWithLoraAdapterFromFile.
    It's again BE platform issue where we are trying to parse adapter file generated on LE system.
    and in this case, along with shape, we need to consider raw_data also which is read from file.
    Both needs to be changed in BE system

I assume raw_data is float (it can be any other supported type also) as of now and swapped raw_data buffer to see whether test is passing or not.
I don't see exception thrown now because shape is proper, but it's not working yet.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 126, which exceeds 0.06, where
expected_output[i] evaluates to 154,
data[i] evaluates to 28, and
0.06 evaluates to 0.059999999999999998.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 144, which exceeds 0.06, where
expected_output[i] evaluates to 176,
data[i] evaluates to 32, and
0.06 evaluates to 0.059999999999999998.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 162, which exceeds 0.06, where
expected_output[i] evaluates to 198,
data[i] evaluates to 36, and
0.06 evaluates to 0.059999999999999998.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 180, which exceeds 0.06, where
expected_output[i] evaluates to 220,
data[i] evaluates to 40, and
0.06 evaluates to 0.059999999999999998.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 126, which exceeds 0.06, where
expected_output[i] evaluates to 154,
data[i] evaluates to 28, and
0.06 evaluates to 0.059999999999999998.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 144, which exceeds 0.06, where
expected_output[i] evaluates to 176,
data[i] evaluates to 32, and
0.06 evaluates to 0.059999999999999998.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 162, which exceeds 0.06, where
expected_output[i] evaluates to 198,
data[i] evaluates to 36, and
0.06 evaluates to 0.059999999999999998.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 180, which exceeds 0.06, where
expected_output[i] evaluates to 220,
data[i] evaluates to 40, and
0.06 evaluates to 0.059999999999999998.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 126, which exceeds 0.06, where
expected_output[i] evaluates to 154,
data[i] evaluates to 28, and
0.06 evaluates to 0.059999999999999998.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 144, which exceeds 0.06, where
expected_output[i] evaluates to 176,
data[i] evaluates to 32, and
0.06 evaluates to 0.059999999999999998.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 162, which exceeds 0.06, where
expected_output[i] evaluates to 198,
data[i] evaluates to 36, and
0.06 evaluates to 0.059999999999999998.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 180, which exceeds 0.06, where
expected_output[i] evaluates to 220,
data[i] evaluates to 40, and
0.06 evaluates to 0.059999999999999998.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 126, which exceeds 0.06, where
expected_output[i] evaluates to 154,
data[i] evaluates to 28, and
0.06 evaluates to 0.059999999999999998.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 144, which exceeds 0.06, where
expected_output[i] evaluates to 176,
data[i] evaluates to 32, and
0.06 evaluates to 0.059999999999999998.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 162, which exceeds 0.06, where
expected_output[i] evaluates to 198,
data[i] evaluates to 36, and
0.06 evaluates to 0.059999999999999998.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 180, which exceeds 0.06, where
expected_output[i] evaluates to 220,
data[i] evaluates to 40, and
0.06 evaluates to 0.059999999999999998.

So as per my understanding , we need to consider both below case.

  • adapter content generated on BE at run time
  • handling of adapter file generated on LE.

Let me know your thoughts.

@snnn snnn added core runtime issues related to core runtime and removed build build issues; typically submitted using template labels Oct 9, 2024
@yuslepukhin
Copy link
Member

The fix is coming shortly.

@yuslepukhin
Copy link
Member

Please, try the above branch and see if this works for you.

@ranjitshs
Copy link
Contributor Author

@yuslepukhin
Thanks you for the quick response and providing the working solution for BE.
I see both tests are passing now. I have verified python related tests also and it's looking good.

# ./onnxruntime_test_all "--gtest_filter=LoraAdapterTest.Load"
Note: Google Test filter = LoraAdapterTest.Load
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from LoraAdapterTest
[ RUN      ] LoraAdapterTest.Load
[       OK ] LoraAdapterTest.Load (0 ms)
[----------] 1 test from LoraAdapterTest (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (0 ms total)
[  PASSED  ] 1 test.

(0) root @ aixoss1-lp6: /usr/onnxruntime/build/Linux/Release
# ./onnxruntime_shared_lib_test "--gtest_filter=CApiTest.RunWithLoraAdapterFromFile"
Note: Google Test filter = CApiTest.RunWithLoraAdapterFromFile
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from CApiTest
[ RUN      ] CApiTest.RunWithLoraAdapterFromFile
2024-10-10 02:21:49.241850000 [W:onnxruntime:, graph.cc:1348 Graph] Initializer lora_param_an appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py
2024-10-10 02:21:49.241937000 [W:onnxruntime:, graph.cc:1348 Graph] Initializer lora_param_b appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
[       OK ] CApiTest.RunWithLoraAdapterFromFile (371 ms)
[----------] 1 test from CApiTest (371 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (371 ms total)
[  PASSED  ] 1 test.

(0) root @ aixoss1-lp6: /usr/onnxruntime/build/Linux/Release
# ./onnxruntime_shared_lib_test "--gtest_filter=CApiTest.RunWithLoraAdapterFromArray"
Note: Google Test filter = CApiTest.RunWithLoraAdapterFromArray
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from CApiTest
[ RUN      ] CApiTest.RunWithLoraAdapterFromArray
2024-10-10 02:21:54.222042000 [W:onnxruntime:, graph.cc:1348 Graph] Initializer lora_param_an appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py
2024-10-10 02:21:54.222121000 [W:onnxruntime:, graph.cc:1348 Graph] Initializer lora_param_b appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
[       OK ] CApiTest.RunWithLoraAdapterFromArray (36 ms)
[----------] 1 test from CApiTest (36 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (36 ms total)
[  PASSED  ] 1 test.

@ranjitshs
Copy link
Contributor Author

@skottmckay
FYI.

yuslepukhin added a commit that referenced this issue Oct 11, 2024
…2375)

### Description
<!-- Describe your changes. -->
flatbuffers always write data in LE and it is automatically traslated
to/from BE as needed,
but only if we use proper accessors. This would work for shape.
However, we store parameters as bytes, so we need to swap bytes as
needed for BE.

### Motivation and Context
Address #22364
guschmue pushed a commit that referenced this issue Oct 18, 2024
…2375)

### Description
<!-- Describe your changes. -->
flatbuffers always write data in LE and it is automatically traslated
to/from BE as needed,
but only if we use proper accessors. This would work for shape.
However, we store parameters as bytes, so we need to swap bytes as
needed for BE.

### Motivation and Context
Address #22364
ishwar-raut1 pushed a commit to ishwar-raut1/onnxruntime that referenced this issue Nov 19, 2024
…crosoft#22375)

### Description
<!-- Describe your changes. -->
flatbuffers always write data in LE and it is automatically traslated
to/from BE as needed,
but only if we use proper accessors. This would work for shape.
However, we store parameters as bytes, so we need to swap bytes as
needed for BE.

### Motivation and Context
Address microsoft#22364
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core runtime issues related to core runtime
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants