Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace T4 to A10 in Linux GPU workflow #19205

Merged
merged 32 commits into from
Jan 23, 2024
Merged

Replace T4 to A10 in Linux GPU workflow #19205

merged 32 commits into from
Jan 23, 2024

Conversation

mszhanyi
Copy link
Contributor

@mszhanyi mszhanyi commented Jan 19, 2024

Description

  1. Update Linux GPU machine from T4 to A10, sm=8.6
  2. update the tolerance

Motivation and Context

  1. Free more T4 and test with higher compute capability.
  2. ORT enables TF32 in GEMM for A10/100. TF32 will cause precsion loss and fail this test
2024-01-19T13:27:18.8302842Z [ RUN      ] ModelTests/ModelTest.Run/cuda__models_zoo_opset12_SSD_ssd12
2024-01-19T13:27:25.8438153Z /onnxruntime_src/onnxruntime/test/providers/cpu/model_tests.cc:347: Failure
2024-01-19T13:27:25.8438641Z Expected equality of these values:
2024-01-19T13:27:25.8438841Z   COMPARE_RESULT::SUCCESS
2024-01-19T13:27:25.8439276Z     Which is: 4-byte object <00-00 00-00>
2024-01-19T13:27:25.8439464Z   ret.first
2024-01-19T13:27:25.8445514Z     Which is: 4-byte object <01-00 00-00>
2024-01-19T13:27:25.8445962Z expected 0.145984 (3e157cc1), got 0.975133 (3f79a24b), diff: 0.829149, tol=0.0114598 idx=375. 20 of 388 differ
2024-01-19T13:27:25.8446198Z 
2024-01-19T13:27:25.8555736Z [  FAILED  ] ModelTests/ModelTest.Run/cuda__models_zoo_opset12_SSD_ssd12, where GetParam() = "cuda_../models/zoo/opset12/SSD/ssd-12.onnx" (7025 ms)
2024-01-19T13:27:25.8556077Z [ RUN      ] ModelTests/ModelTest.Run/cuda__models_zoo_opset12_YOLOv312_yolov312
2024-01-19T13:27:29.3174318Z /onnxruntime_src/onnxruntime/test/providers/cpu/model_tests.cc:347: Failure
2024-01-19T13:27:29.3175144Z Expected equality of these values:
2024-01-19T13:27:29.3175389Z   COMPARE_RESULT::SUCCESS
2024-01-19T13:27:29.3175812Z     Which is: 4-byte object <00-00 00-00>
2024-01-19T13:27:29.3176080Z   ret.first
2024-01-19T13:27:29.3176322Z     Which is: 4-byte object <01-00 00-00>
2024-01-19T13:27:29.3178431Z expected 4.34958 (408b2fb8), got 4.51324 (40906c80), diff: 0.16367, tol=0.0534958 idx=9929. 22 of 42588 differ

  1. some other test like SSD throw other exception, so skip them
    '''
    2024-01-22T09:07:40.8446910Z [ RUN ] ModelTests/ModelTest.Run/cuda__models_zoo_opset12_SSD_ssd12
    2024-01-22T09:07:51.5587571Z /onnxruntime_src/onnxruntime/test/providers/cpu/model_tests.cc:358: Failure
    2024-01-22T09:07:51.5588512Z Expected equality of these values:
    2024-01-22T09:07:51.5588870Z COMPARE_RESULT::SUCCESS
    2024-01-22T09:07:51.5589467Z Which is: 4-byte object <00-00 00-00>
    2024-01-22T09:07:51.5589953Z ret.first
    2024-01-22T09:07:51.5590462Z Which is: 4-byte object <01-00 00-00>
    2024-01-22T09:07:51.5590841Z expected 1, got 63
    '''

@mszhanyi mszhanyi requested a review from a team as a code owner January 19, 2024 14:25
snnn
snnn previously approved these changes Jan 22, 2024
@snnn snnn dismissed their stale review January 22, 2024 20:02

Please fix the lint error

@mszhanyi mszhanyi requested a review from snnn January 23, 2024 15:26
@snnn snnn merged commit 54871a2 into main Jan 23, 2024
95 of 98 checks passed
@snnn snnn deleted the zhanyi/linuxgpuA10 branch January 23, 2024 18:49
@tianleiwu
Copy link
Contributor

Could we disable TF32 (which will lead to parity check failure sometime) during testing in A10. Like setting environment variable NVIDIA_TF32_OVERRIDE=0.

@snnn
Copy link
Member

snnn commented Jan 23, 2024

Yes, that's another option. But users of onnxruntime usually set this environment variable. I think in general we should make sure ONNX Runtime still works good in the default setting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants