[Build] AIX tests failures for Multi-Lora feature #22364

ranjitshs · 2024-10-09T14:24:28Z

Describe the issue

As mentioned in #22046 , in AIX,
Below tests are failing . I am expecting similar failures in python bindings.
This issue is to track these test failures.

1: [ RUN      ] LoraAdapterTest.Load
1: unknown file: Failure
1: C++ exception with description "/home/buildusr/jenkins/workspace/onnxruntime-gcc/onnxruntime/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandler<onnxruntime::OnnxRuntimeException>::SafeIntOnOverflow() Integer overflow
1: " thrown in the test body.
1: 
1: [  FAILED  ] LoraAdapterTest.Load (27 ms)



4: [ RUN      ] CApiTest.RunWithLoraAdapterFromFile
4: unknown file: Failure
4: C++ exception with description "/home/buildusr/jenkins/workspace/onnxruntime-gcc/onnxruntime/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandler<onnxruntime::OnnxRuntimeException>::SafeIntOnOverflow() Integer overflow
4: " thrown in the test body.
4: 
4: [  FAILED  ] CApiTest.RunWithLoraAdapterFromFile (0 ms)
4: [ RUN      ] CApiTest.RunWithLoraAdapterFromArray
4: unknown file: Failure
4: C++ exception with description "/home/buildusr/jenkins/workspace/onnxruntime-gcc/onnxruntime/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandler<onnxruntime::OnnxRuntimeException>::SafeIntOnOverflow() Integer overflow
4: " thrown in the test body.
4: 
4: [  FAILED  ] CApiTest.RunWithLoraAdapterFromArray (0 ms)
4: [ RUN      ] CApiTest.RunBaseLoraModel

Urgency

No response

Target platform

AIX

Build script

AIX Build instruction can be referred from https://onnxruntime.ai/docs/build/inferencing.html

Error / output

1: [ RUN      ] LoraAdapterTest.Load
1: unknown file: Failure
1: C++ exception with description "/home/buildusr/jenkins/workspace/onnxruntime-gcc/onnxruntime/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandler<onnxruntime::OnnxRuntimeException>::SafeIntOnOverflow() Integer overflow
1: " thrown in the test body.
1: 
1: [  FAILED  ] LoraAdapterTest.Load (27 ms)



4: [ RUN      ] CApiTest.RunWithLoraAdapterFromFile
4: unknown file: Failure
4: C++ exception with description "/home/buildusr/jenkins/workspace/onnxruntime-gcc/onnxruntime/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandler<onnxruntime::OnnxRuntimeException>::SafeIntOnOverflow() Integer overflow
4: " thrown in the test body.
4: 
4: [  FAILED  ] CApiTest.RunWithLoraAdapterFromFile (0 ms)
4: [ RUN      ] CApiTest.RunWithLoraAdapterFromArray
4: unknown file: Failure
4: C++ exception with description "/home/buildusr/jenkins/workspace/onnxruntime-gcc/onnxruntime/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandler<onnxruntime::OnnxRuntimeException>::SafeIntOnOverflow() Integer overflow
4: " thrown in the test body.
4: 
4: [  FAILED  ] CApiTest.RunWithLoraAdapterFromArray (0 ms)
4: [ RUN      ] CApiTest.RunBaseLoraModel

Visual Studio Version

No response

GCC / Compiler Version

10.3

The text was updated successfully, but these errors were encountered:

ranjitshs · 2024-10-09T14:46:23Z

@snnn @tianleiwu @yuslepukhin
FYI.

I did some debugging on above.

For the first LoraAdapterTest.Load, during execution , adapter shape is created by FlatBuffer CreateVector which is doing byte-swapping in BE platform
so in method CreateOrtValueOverLoraParameter , we are receiving shape as very large value causing InitOrtValue to throw exception and fail.

      shape size2
      0:576460752303423488
      1:288230376151711744

After swapping the shape , I see that this test is passing.

For CApiTest.RunWithLoraAdapterFromFile and CApiTest.RunWithLoraAdapterFromFile.
It's again BE platform issue where we are trying to parse adapter file generated on LE system.
and in this case, along with shape, we need to consider raw_data also which is read from file.
Both needs to be changed in BE system

I assume raw_data is float (it can be any other supported type also) as of now and swapped raw_data buffer to see whether test is passing or not.
I don't see exception thrown now because shape is proper, but it's not working yet.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 126, which exceeds 0.06, where
expected_output[i] evaluates to 154,
data[i] evaluates to 28, and
0.06 evaluates to 0.059999999999999998.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 144, which exceeds 0.06, where
expected_output[i] evaluates to 176,
data[i] evaluates to 32, and
0.06 evaluates to 0.059999999999999998.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 162, which exceeds 0.06, where
expected_output[i] evaluates to 198,
data[i] evaluates to 36, and
0.06 evaluates to 0.059999999999999998.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 180, which exceeds 0.06, where
expected_output[i] evaluates to 220,
data[i] evaluates to 40, and
0.06 evaluates to 0.059999999999999998.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 126, which exceeds 0.06, where
expected_output[i] evaluates to 154,
data[i] evaluates to 28, and
0.06 evaluates to 0.059999999999999998.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 144, which exceeds 0.06, where
expected_output[i] evaluates to 176,
data[i] evaluates to 32, and
0.06 evaluates to 0.059999999999999998.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 162, which exceeds 0.06, where
expected_output[i] evaluates to 198,
data[i] evaluates to 36, and
0.06 evaluates to 0.059999999999999998.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 180, which exceeds 0.06, where
expected_output[i] evaluates to 220,
data[i] evaluates to 40, and
0.06 evaluates to 0.059999999999999998.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 126, which exceeds 0.06, where
expected_output[i] evaluates to 154,
data[i] evaluates to 28, and
0.06 evaluates to 0.059999999999999998.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 144, which exceeds 0.06, where
expected_output[i] evaluates to 176,
data[i] evaluates to 32, and
0.06 evaluates to 0.059999999999999998.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 162, which exceeds 0.06, where
expected_output[i] evaluates to 198,
data[i] evaluates to 36, and
0.06 evaluates to 0.059999999999999998.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 180, which exceeds 0.06, where
expected_output[i] evaluates to 220,
data[i] evaluates to 40, and
0.06 evaluates to 0.059999999999999998.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 126, which exceeds 0.06, where
expected_output[i] evaluates to 154,
data[i] evaluates to 28, and
0.06 evaluates to 0.059999999999999998.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 144, which exceeds 0.06, where
expected_output[i] evaluates to 176,
data[i] evaluates to 32, and
0.06 evaluates to 0.059999999999999998.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 162, which exceeds 0.06, where
expected_output[i] evaluates to 198,
data[i] evaluates to 36, and
0.06 evaluates to 0.059999999999999998.

/home/buildusr/onnxruntime/onnxruntime/test/shared_lib/test_inference.cc:4445: Failure
The difference between expected_output[i] and data[i] is 180, which exceeds 0.06, where
expected_output[i] evaluates to 220,
data[i] evaluates to 40, and
0.06 evaluates to 0.059999999999999998.

So as per my understanding , we need to consider both below case.

adapter content generated on BE at run time
handling of adapter file generated on LE.

Let me know your thoughts.

yuslepukhin · 2024-10-09T23:17:18Z

The fix is coming shortly.

yuslepukhin · 2024-10-10T00:36:52Z

Please, try the above branch and see if this works for you.

ranjitshs · 2024-10-10T12:10:45Z

@yuslepukhin
Thanks you for the quick response and providing the working solution for BE.
I see both tests are passing now. I have verified python related tests also and it's looking good.

# ./onnxruntime_test_all "--gtest_filter=LoraAdapterTest.Load"
Note: Google Test filter = LoraAdapterTest.Load
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from LoraAdapterTest
[ RUN      ] LoraAdapterTest.Load
[       OK ] LoraAdapterTest.Load (0 ms)
[----------] 1 test from LoraAdapterTest (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (0 ms total)
[  PASSED  ] 1 test.

(0) root @ aixoss1-lp6: /usr/onnxruntime/build/Linux/Release
# ./onnxruntime_shared_lib_test "--gtest_filter=CApiTest.RunWithLoraAdapterFromFile"
Note: Google Test filter = CApiTest.RunWithLoraAdapterFromFile
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from CApiTest
[ RUN      ] CApiTest.RunWithLoraAdapterFromFile
2024-10-10 02:21:49.241850000 [W:onnxruntime:, graph.cc:1348 Graph] Initializer lora_param_an appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py
2024-10-10 02:21:49.241937000 [W:onnxruntime:, graph.cc:1348 Graph] Initializer lora_param_b appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
[       OK ] CApiTest.RunWithLoraAdapterFromFile (371 ms)
[----------] 1 test from CApiTest (371 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (371 ms total)
[  PASSED  ] 1 test.

(0) root @ aixoss1-lp6: /usr/onnxruntime/build/Linux/Release
# ./onnxruntime_shared_lib_test "--gtest_filter=CApiTest.RunWithLoraAdapterFromArray"
Note: Google Test filter = CApiTest.RunWithLoraAdapterFromArray
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from CApiTest
[ RUN      ] CApiTest.RunWithLoraAdapterFromArray
2024-10-10 02:21:54.222042000 [W:onnxruntime:, graph.cc:1348 Graph] Initializer lora_param_an appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py
2024-10-10 02:21:54.222121000 [W:onnxruntime:, graph.cc:1348 Graph] Initializer lora_param_b appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
[       OK ] CApiTest.RunWithLoraAdapterFromArray (36 ms)
[----------] 1 test from CApiTest (36 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (36 ms total)
[  PASSED  ] 1 test.

ranjitshs · 2024-10-10T12:11:03Z

@skottmckay
FYI.

…2375) ### Description  flatbuffers always write data in LE and it is automatically traslated to/from BE as needed, but only if we use proper accessors. This would work for shape. However, we store parameters as bytes, so we need to swap bytes as needed for BE. ### Motivation and Context Address #22364

…crosoft#22375) ### Description  flatbuffers always write data in LE and it is automatically traslated to/from BE as needed, but only if we use proper accessors. This would work for shape. However, we store parameters as bytes, so we need to swap bytes as needed for BE. ### Motivation and Context Address microsoft#22364

ranjitshs added the build build issues; typically submitted using template label Oct 9, 2024

snnn added core runtime issues related to core runtime and removed build build issues; typically submitted using template labels Oct 9, 2024

snnn assigned yuslepukhin Oct 9, 2024

yuslepukhin mentioned this issue Oct 10, 2024

Accomodate BE platforms. Make sure we always write flatbuffers LE #22375

Merged

yuslepukhin linked a pull request Oct 10, 2024 that will close this issue

Accomodate BE platforms. Make sure we always write flatbuffers LE #22375

Merged

yuslepukhin closed this as completed in #22375 Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Build] AIX tests failures for Multi-Lora feature #22364

[Build] AIX tests failures for Multi-Lora feature #22364

ranjitshs commented Oct 9, 2024

ranjitshs commented Oct 9, 2024

yuslepukhin commented Oct 9, 2024

yuslepukhin commented Oct 10, 2024

ranjitshs commented Oct 10, 2024

ranjitshs commented Oct 10, 2024

[Build] AIX tests failures for Multi-Lora feature #22364

[Build] AIX tests failures for Multi-Lora feature #22364

Comments

ranjitshs commented Oct 9, 2024

Describe the issue

Urgency

Target platform

Build script

Error / output

Visual Studio Version

GCC / Compiler Version

ranjitshs commented Oct 9, 2024

yuslepukhin commented Oct 9, 2024

yuslepukhin commented Oct 10, 2024

ranjitshs commented Oct 10, 2024

ranjitshs commented Oct 10, 2024