Replace T4 to A10 in Linux GPU workflow #19205

mszhanyi · 2024-01-19T14:25:58Z

Description

Update Linux GPU machine from T4 to A10, sm=8.6
update the tolerance

Motivation and Context

Free more T4 and test with higher compute capability.
ORT enables TF32 in GEMM for A10/100. TF32 will cause precsion loss and fail this test

2024-01-19T13:27:18.8302842Z [ RUN      ] ModelTests/ModelTest.Run/cuda__models_zoo_opset12_SSD_ssd12
2024-01-19T13:27:25.8438153Z /onnxruntime_src/onnxruntime/test/providers/cpu/model_tests.cc:347: Failure
2024-01-19T13:27:25.8438641Z Expected equality of these values:
2024-01-19T13:27:25.8438841Z   COMPARE_RESULT::SUCCESS
2024-01-19T13:27:25.8439276Z     Which is: 4-byte object <00-00 00-00>
2024-01-19T13:27:25.8439464Z   ret.first
2024-01-19T13:27:25.8445514Z     Which is: 4-byte object <01-00 00-00>
2024-01-19T13:27:25.8445962Z expected 0.145984 (3e157cc1), got 0.975133 (3f79a24b), diff: 0.829149, tol=0.0114598 idx=375. 20 of 388 differ
2024-01-19T13:27:25.8446198Z 
2024-01-19T13:27:25.8555736Z [  FAILED  ] ModelTests/ModelTest.Run/cuda__models_zoo_opset12_SSD_ssd12, where GetParam() = "cuda_../models/zoo/opset12/SSD/ssd-12.onnx" (7025 ms)
2024-01-19T13:27:25.8556077Z [ RUN      ] ModelTests/ModelTest.Run/cuda__models_zoo_opset12_YOLOv312_yolov312
2024-01-19T13:27:29.3174318Z /onnxruntime_src/onnxruntime/test/providers/cpu/model_tests.cc:347: Failure
2024-01-19T13:27:29.3175144Z Expected equality of these values:
2024-01-19T13:27:29.3175389Z   COMPARE_RESULT::SUCCESS
2024-01-19T13:27:29.3175812Z     Which is: 4-byte object <00-00 00-00>
2024-01-19T13:27:29.3176080Z   ret.first
2024-01-19T13:27:29.3176322Z     Which is: 4-byte object <01-00 00-00>
2024-01-19T13:27:29.3178431Z expected 4.34958 (408b2fb8), got 4.51324 (40906c80), diff: 0.16367, tol=0.0534958 idx=9929. 22 of 42588 differ

some other test like SSD throw other exception, so skip them
'''
2024-01-22T09:07:40.8446910Z [ RUN ] ModelTests/ModelTest.Run/cuda__models_zoo_opset12_SSD_ssd12
2024-01-22T09:07:51.5587571Z /onnxruntime_src/onnxruntime/test/providers/cpu/model_tests.cc:358: Failure
2024-01-22T09:07:51.5588512Z Expected equality of these values:
2024-01-22T09:07:51.5588870Z COMPARE_RESULT::SUCCESS
2024-01-22T09:07:51.5589467Z Which is: 4-byte object <00-00 00-00>
2024-01-22T09:07:51.5589953Z ret.first
2024-01-22T09:07:51.5590462Z Which is: 4-byte object <01-00 00-00>
2024-01-22T09:07:51.5590841Z expected 1, got 63
'''

…xruntime into zhanyi/linuxgpuA10

…zhanyi/linuxgpuA10

Please fix the lint error

tianleiwu · 2024-01-23T19:46:12Z

Could we disable TF32 (which will lead to parity check failure sometime) during testing in A10. Like setting environment variable NVIDIA_TF32_OVERRIDE=0.

snnn · 2024-01-23T19:51:34Z

Yes, that's another option. But users of onnxruntime usually set this environment variable. I think in general we should make sure ONNX Runtime still works good in the default setting.

mszhanyi added 15 commits January 17, 2024 14:34

linux gpu

8844cf3

compatible with dual cuda version

b7276d3

cuda 12

319708f

T4

ad8ced7

update1

9776f9b

update2

f9a9219

sm=75

ab3e37e

update repo

b7c5ff1

try A10

3082fc1

12G

da2fd86

NVIDIA_TF_OVERRIDE

893c4da

update tolerance

8a8bea4

upate

2019403

upate

0bdeffe

upate

0e2b173

mszhanyi requested a review from a team as a code owner January 19, 2024 14:25

mszhanyi added 8 commits January 19, 2024 22:28

merge with main

1522ea4

Update linux-gpu-ci-pipeline.yml

122464a

update1

eef0acd

Merge branch 'zhanyi/linuxgpuA10' of https://github.com/microsoft/onn…

a85d746

…xruntime into zhanyi/linuxgpuA10

update tolerance

2dc1bd9

update tolerance

3a677d0

update

359ab07

more update for tolerance

361315c

mszhanyi force-pushed the zhanyi/linuxgpuA10 branch from a061490 to 361315c Compare January 22, 2024 01:59

mszhanyi added 5 commits January 22, 2024 11:41

update tolerance

a905d6c

update

320e96a

Skip it

1608bd4

update

aa19598

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

52b13d8

…zhanyi/linuxgpuA10

wording

178a893

snnn previously approved these changes Jan 22, 2024

View reviewed changes

mszhanyi added 3 commits January 23, 2024 07:59

lint

8f77791

upate test inference

93373c9

update global thread test

298347c

mszhanyi requested a review from snnn January 23, 2024 15:26

snnn approved these changes Jan 23, 2024

View reviewed changes

snnn merged commit 54871a2 into main Jan 23, 2024
95 of 98 checks passed

snnn deleted the zhanyi/linuxgpuA10 branch January 23, 2024 18:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace T4 to A10 in Linux GPU workflow #19205

Replace T4 to A10 in Linux GPU workflow #19205

mszhanyi commented Jan 19, 2024 •

edited

Loading

tianleiwu commented Jan 23, 2024

snnn commented Jan 23, 2024

Replace T4 to A10 in Linux GPU workflow #19205

Replace T4 to A10 in Linux GPU workflow #19205

Conversation

mszhanyi commented Jan 19, 2024 • edited Loading

Description

Motivation and Context

tianleiwu commented Jan 23, 2024

snnn commented Jan 23, 2024

mszhanyi commented Jan 19, 2024 •

edited

Loading