[Build] MoE related unit tests fail for older architectures such as Pascal when building dev debug #20788

yuslepukhin · 2024-05-23T18:08:18Z

Describe the issue

MoE unit tests fail on older architecture.
The tests have a particular requirement. If that requirement is not met it is pointless to run the tests.

Urgency

No response

Target platform

Windows

Build script

.\build.bat --build_dir .\cuda_build --config Debug --cmake_generator "Visual Studio 17 2022" --use_cuda --cuda_home c:\cuda\cuda --cudnn_home c:\cuda\cudnn --cuda_version 11.8 --build_wheel --build_shared_lib --parallel --skip_submodule_sync

Error / output

[ FAILED ] MoETest.MoETest_Gelu
[ FAILED ] MoETest.MoETest_Relu
[ FAILED ] MoETest.MoETest_Mixtral
Arch unsupported for MoE GEMM

Visual Studio Version

2022 64-bit 17.9.7

GCC / Compiler Version

No response

tianleiwu · 2024-05-23T18:46:55Z

@wangyems, In MoeGemmRunner::dispatch_to_arch, I saw it dispatch for SM from 70 to 89. We shall skip MOE tests for other GPUs (< 70, and >= 90).

BTW, could you test MOE in V100 to see whether it could run in SM=70? I saw some feature required SM>=80:

onnxruntime/onnxruntime/contrib_ops/cuda/moe/cutlass_extensions/gemm/warp/mma_tensorop_compute_B_with_f16.h

Line 135 in 0996d6e

"MmaTensorOpCvtBToA only supports Fp16 A or Bf16 A on Ampere+");

.

I think the requirements are like:
float: sm 70 to 89 (In theory, it shall support all GPUs, just a limitation of MoeGemmRunner::dispatch_to_arch)
float16: sm 70 to 89
bfloat16: sm 80 to 89

snnn · 2024-05-23T20:30:24Z

I tested it on V100 on Linux with CUDA 11.8.

[==========] 4 tests from 1 test suite ran. (10758 ms total)
[  PASSED  ] 3 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] MoETest.QMoETest_Mixtral_Int4

snnn · 2024-05-23T20:31:18Z

Our Linux training GPU machine pools use V100. Why they didn't catch this error?

wangyems · 2024-05-23T22:49:48Z

checking..

### Description  #20788 Will do sm70 validation separately. ### Motivation and Context

yuslepukhin added the build build issues; typically submitted using template label May 23, 2024

github-actions bot added ep:CUDA issues related to the CUDA execution provider platform:windows issues related to the Windows platform labels May 23, 2024

tianleiwu assigned wangyems May 23, 2024

snnn removed the build build issues; typically submitted using template label May 29, 2024

wangyems mentioned this issue May 30, 2024

Fix moe tests to run on supported arch #20872

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Build] MoE related unit tests fail for older architectures such as Pascal when building dev debug #20788

[Build] MoE related unit tests fail for older architectures such as Pascal when building dev debug #20788

yuslepukhin commented May 23, 2024

tianleiwu commented May 23, 2024 •

edited

Loading

snnn commented May 23, 2024

snnn commented May 23, 2024

wangyems commented May 23, 2024

[Build] MoE related unit tests fail for older architectures such as Pascal when building dev debug #20788

[Build] MoE related unit tests fail for older architectures such as Pascal when building dev debug #20788

Comments

yuslepukhin commented May 23, 2024

Describe the issue

Urgency

Target platform

Build script

Error / output

Visual Studio Version

GCC / Compiler Version

tianleiwu commented May 23, 2024 • edited Loading

snnn commented May 23, 2024

snnn commented May 23, 2024

wangyems commented May 23, 2024

tianleiwu commented May 23, 2024 •

edited

Loading