Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Build] MoE related unit tests fail for older architectures such as Pascal when building dev debug #20788

Open
yuslepukhin opened this issue May 23, 2024 · 4 comments
Assignees
Labels
ep:CUDA issues related to the CUDA execution provider platform:windows issues related to the Windows platform

Comments

@yuslepukhin
Copy link
Member

Describe the issue

MoE unit tests fail on older architecture.
The tests have a particular requirement. If that requirement is not met it is pointless to run the tests.

Urgency

No response

Target platform

Windows

Build script

.\build.bat --build_dir .\cuda_build --config Debug --cmake_generator "Visual Studio 17 2022" --use_cuda --cuda_home c:\cuda\cuda --cudnn_home c:\cuda\cudnn --cuda_version 11.8 --build_wheel --build_shared_lib --parallel --skip_submodule_sync

Error / output

[ FAILED ] MoETest.MoETest_Gelu
[ FAILED ] MoETest.MoETest_Relu
[ FAILED ] MoETest.MoETest_Mixtral
Arch unsupported for MoE GEMM

Visual Studio Version

2022 64-bit 17.9.7

GCC / Compiler Version

No response

@yuslepukhin yuslepukhin added the build build issues; typically submitted using template label May 23, 2024
@github-actions github-actions bot added ep:CUDA issues related to the CUDA execution provider platform:windows issues related to the Windows platform labels May 23, 2024
@tianleiwu
Copy link
Contributor

tianleiwu commented May 23, 2024

@wangyems, In MoeGemmRunner::dispatch_to_arch, I saw it dispatch for SM from 70 to 89. We shall skip MOE tests for other GPUs (< 70, and >= 90).

BTW, could you test MOE in V100 to see whether it could run in SM=70? I saw some feature required SM>=80:

"MmaTensorOpCvtBToA only supports Fp16 A or Bf16 A on Ampere+");
.

I think the requirements are like:
float: sm 70 to 89 (In theory, it shall support all GPUs, just a limitation of MoeGemmRunner::dispatch_to_arch)
float16: sm 70 to 89
bfloat16: sm 80 to 89

@snnn
Copy link
Member

snnn commented May 23, 2024

I tested it on V100 on Linux with CUDA 11.8.

[==========] 4 tests from 1 test suite ran. (10758 ms total)
[  PASSED  ] 3 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] MoETest.QMoETest_Mixtral_Int4

@snnn
Copy link
Member

snnn commented May 23, 2024

Our Linux training GPU machine pools use V100. Why they didn't catch this error?

@wangyems
Copy link
Contributor

checking..

@snnn snnn removed the build build issues; typically submitted using template label May 29, 2024
wangyems added a commit that referenced this issue May 30, 2024
### Description
<!-- Describe your changes. -->

#20788

Will do sm70 validation separately. 

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider platform:windows issues related to the Windows platform
Projects
None yet
Development

No branches or pull requests

4 participants