Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Build] fail to build rel-1.19.0 vs CUDA 12.6 on Windows #21676

Closed
mc-nv opened this issue Aug 8, 2024 · 15 comments
Closed

[Build] fail to build rel-1.19.0 vs CUDA 12.6 on Windows #21676

mc-nv opened this issue Aug 8, 2024 · 15 comments
Labels
build build issues; typically submitted using template ep:CUDA issues related to the CUDA execution provider platform:windows issues related to the Windows platform

Comments

@mc-nv
Copy link
Contributor

mc-nv commented Aug 8, 2024

Describe the issue

Unable to build the ONNX Runtime our of release candidate branch on Windows against CUDA 12.6

Urgency

This issue is vital if release plans to support CUDA 12.6

Target platform

Windows

Build script

build.bat --cmake_generator "Visual Studio 17 2022" --config Release --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=60;61;70;75;80;86;90" --skip_submodule_sync --parallel --build_shared_lib --compile_no_warning_as_error --skip_tests --update --build --build_dir /workspace/build --use_cuda --cuda_version "12.6" --cuda_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6" --cudnn_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6" --use_tensorrt --tensorrt_home "/tensorrt"

Error / output

It could be dependencies related


       "C:\tmp\tritonbuild\onnxruntime\build\install.vcxproj" (default target) (1) ->
       "C:\tmp\tritonbuild\onnxruntime\build\ALL_BUILD.vcxproj" (default target) (3) ->
       "C:\tmp\tritonbuild\onnxruntime\build\triton-onnxruntime-backend.vcxproj" (default target) (16) ->
       "C:\tmp\tritonbuild\onnxruntime\build\ort_target.vcxproj" (default target) (18) ->
       (CustomBuild target) -> 
         C:\workspace\build\Release\_deps\cutlass-src\include\cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp(136): error C2061: syntax error: identifier 'SharedStorage' [C:\workspace\build\Release\onnxruntime_providers_cuda.vcxproj] [C:\tmp\tritonbuild\onnxruntime\build\ort_target.vcxproj]
         C:\workspace\build\Release\_deps\cutlass-src\include\cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp(140): error C3646: 'math_wg_order': unknown override specifier [C:\workspace\build\Release\onnxruntime_providers_cuda.vcxproj] [C:\tmp\tritonbuild\onnxruntime\build\ort_target.vcxproj]
         C:\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 12.6.targets(799,9): error MSB3721: The command ""C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin\nvcc.exe"  --use-local-env -ccbin "C:\BuildTools\VC\Tools\MSVC\14.39.33519\bin\HostX64\x64" -x cu   -I"C:\workspace\build\Release\_deps\utf8_range-src" -IC:\workspace\onnxruntime\include\onnxruntime -IC:\workspace\onnxruntime\include\onnxruntime\core\session -I"C:\workspace\build\Release\_deps\pytorch_cpuinfo-src\include" -IC:\workspace\build\Release -IC:\workspace\onnxruntime\onnxruntime -I"C:\workspace\build\Release\_deps\abseil_cpp-src" -I"C:\workspace\build\Release\_deps\safeint-src" -I"C:\workspace\build\Release\_deps\gsl-src\include" -I"C:\workspace\build\Release\_deps\date-src\include" -I"C:\workspace\build\Release\_deps\onnx-src" -I"C:\workspace\build\Release\_deps\onnx-build" -I"C:\workspace\build\Release\_deps\protobuf-src\src" -I"C:\workspace\build\Release\_deps\flatbuffers-src\include" -I"C:\workspace\build\Release\_deps\cutlass-src\include" -I"C:\workspace\build\Release\_deps\cutlass-src\examples" -I"C:\workspace\build\Release\_deps\cutlass-src\tools\util\include" -I"C:\workspace\build\Release\_deps\eigen-src" -I\TensorRT\include -I"C:\workspace\build\Release\_deps\mp11-src\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\include"     --keep-dir onnxrunt.4B28B068\x64\Release  -maxrregcount=0   --machine 64 --compile -cudart shared -allow-unsupported-compiler --expt-relaxed-constexpr --Werror default-stream-launch -Xcudafe --diag_suppress=bad_friend_decl -Xcudafe --diag_suppress=unsigned_compare_with_zero -Xcudafe --diag_suppress=expr_has_no_effect -include algorithm -std=c++17 --generate-code=arch=compute_60,code=[compute_60,sm_60] --generate-code=arch=compute_61,code=[compute_61,sm_61] --generate-code=arch=compute_70,code=[compute_70,sm_70] --generate-code=arch=compute_75,code=[compute_75,sm_75] --generate-code=arch=compute_80,code=[compute_80,sm_80] --generate-code=arch=compute_86,code=[compute_86,sm_86] --generate-code=arch=compute_90,code=[compute_90,sm_90] -Xcudafe --diag_suppress=conversion_function_not_usable --threads 1 -Xcompiler="/EHsc -Ob2 -Zi /utf-8 /sdl /experimental:external /external:W0 /external:templates- /external:IC:/workspace/onnxruntime/cmake /external:IC:/workspace/build/Release /wd4251 /wd4201 /wd4324 /wd5054 /w15038 /wd4251 /wd4201 /wd4324 /wd5054 /w15038 /wd4834 /wd4127 /Zc:__cplusplus"   -D_WINDOWS -DNDEBUG -DVER_MAJOR=1 -DVER_MINOR=19 -DVER_BUILD=0 -DVER_PRIVATE=0 -D"VER_STRING=\"ORT_VERSION\"" -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_USE_THREADS -DDISABLE_CUSPARSE_DEPRECATED -DPLATFORM_WINDOWS -DNOGDI -DNOMINMAX -D_USE_MATH_DEFINES -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -DUSE_CUDA=1 -DUSE_FLASH_ATTENTION=1 -DUSE_MEMORY_EFFICIENT_ATTENTION=1 -DUSE_TENSORRT=1 -DONLY_C_LOCALE=0 -DONNX_NAMESPACE=onnx -DONNX_ML=1 -DONNX_USE_LITE_PROTO=1 -D__ONNX_NO_DOC_STRINGS -DWIN32_LEAN_AND_MEAN -DORT_ENABLE_STREAM -DEIGEN_MPL2_ONLY -DEIGEN_HAS_CONSTEXPR -DEIGEN_HAS_VARIADIC_TEMPLATES -DEIGEN_HAS_CXX11_MATH -DEIGEN_HAS_CXX11_ATOMIC -DEIGEN_STRONG_INLINE=inline -D"CMAKE_INTDIR=\"Release\"" -Donnxruntime_providers_cuda_EXPORTS -D_WINDLL -D_MBCS -DEIGEN_HAS_C99_MATH -DCPUINFO_SUPPORTED -DNDEBUG -DVER_MAJOR=1 -DVER_MINOR=19 -DVER_BUILD=0 -DVER_PRIVATE=0 -D"VER_STRING=\"ORT_VERSION\"" -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_USE_THREADS -DDISABLE_CUSPARSE_DEPRECATED -DPLATFORM_WINDOWS -DNOGDI -DNOMINMAX -D_USE_MATH_DEFINES -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -DUSE_CUDA=1 -DUSE_FLASH_ATTENTION=1 -DUSE_MEMORY_EFFICIENT_ATTENTION=1 -DUSE_TENSORRT=1 -DONLY_C_LOCALE=0 -DONNX_NAMESPACE=onnx -DONNX_ML=1 -DONNX_USE_LITE_PROTO=1 -D__ONNX_NO_DOC_STRINGS -DWIN32_LEAN_AND_MEAN -DORT_ENABLE_STREAM -DEIGEN_MPL2_ONLY -DEIGEN_HAS_CONSTEXPR -DEIGEN_HAS_VARIADIC_TEMPLATES -DEIGEN_HAS_CXX11_MATH -DEIGEN_HAS_CXX11_ATOMIC -DEIGEN_STRONG_INLINE=inline -D"CMAKE_INTDIR=\"Release\"" -Donnxruntime_providers_cuda_EXPORTS -Xcompiler "/EHsc /W4 /nologo /O2 /FS   /MD /GR" -Xcompiler "/Fdonnxruntime_providers_cuda.dir\Release\vc143.pdb" -o onnxruntime_providers_cuda.dir\Release\moe_gemm_kernels_fp16_fp16.obj "C:\workspace\onnxruntime\onnxruntime\contrib_ops\cuda\moe\ft_moe\moe_gemm_kernels_fp16_fp16.cu"" exited with code 2. [C:\workspace\build\Release\onnxruntime_providers_cuda.vcxproj] [C:\tmp\tritonbuild\onnxruntime\build\ort_target.vcxproj]
         C:\workspace\build\Release\_deps\cutlass-src\include\cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp(136): error C2061: syntax error: identifier 'SharedStorage' [C:\workspace\build\Release\onnxruntime_providers_cuda.vcxproj] [C:\tmp\tritonbuild\onnxruntime\build\ort_target.vcxproj]
         C:\workspace\build\Release\_deps\cutlass-src\include\cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp(140): error C3646: 'math_wg_order': unknown override specifier [C:\workspace\build\Release\onnxruntime_providers_cuda.vcxproj] [C:\tmp\tritonbuild\onnxruntime\build\ort_target.vcxproj]
         C:\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 12.6.targets(799,9): error MSB3721: The command ""C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin\nvcc.exe"  --use-local-env -ccbin "C:\BuildTools\VC\Tools\MSVC\14.39.33519\bin\HostX64\x64" -x cu   -I"C:\workspace\build\Release\_deps\utf8_range-src" -IC:\workspace\onnxruntime\include\onnxruntime -IC:\workspace\onnxruntime\include\onnxruntime\core\session -I"C:\workspace\build\Release\_deps\pytorch_cpuinfo-src\include" -IC:\workspace\build\Release -IC:\workspace\onnxruntime\onnxruntime -I"C:\workspace\build\Release\_deps\abseil_cpp-src" -I"C:\workspace\build\Release\_deps\safeint-src" -I"C:\workspace\build\Release\_deps\gsl-src\include" -I"C:\workspace\build\Release\_deps\date-src\include" -I"C:\workspace\build\Release\_deps\onnx-src" -I"C:\workspace\build\Release\_deps\onnx-build" -I"C:\workspace\build\Release\_deps\protobuf-src\src" -I"C:\workspace\build\Release\_deps\flatbuffers-src\include" -I"C:\workspace\build\Release\_deps\cutlass-src\include" -I"C:\workspace\build\Release\_deps\cutlass-src\examples" -I"C:\workspace\build\Release\_deps\cutlass-src\tools\util\include" -I"C:\workspace\build\Release\_deps\eigen-src" -I\TensorRT\include -I"C:\workspace\build\Release\_deps\mp11-src\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\include"     --keep-dir onnxrunt.4B28B068\x64\Release  -maxrregcount=0   --machine 64 --compile -cudart shared -allow-unsupported-compiler --expt-relaxed-constexpr --Werror default-stream-launch -Xcudafe --diag_suppress=bad_friend_decl -Xcudafe --diag_suppress=unsigned_compare_with_zero -Xcudafe --diag_suppress=expr_has_no_effect -include algorithm -std=c++17 --generate-code=arch=compute_60,code=[compute_60,sm_60] --generate-code=arch=compute_61,code=[compute_61,sm_61] --generate-code=arch=compute_70,code=[compute_70,sm_70] --generate-code=arch=compute_75,code=[compute_75,sm_75] --generate-code=arch=compute_80,code=[compute_80,sm_80] --generate-code=arch=compute_86,code=[compute_86,sm_86] --generate-code=arch=compute_90,code=[compute_90,sm_90] -Xcudafe --diag_suppress=conversion_function_not_usable --threads 1 -Xcompiler="/EHsc -Ob2 -Zi /utf-8 /sdl /experimental:external /external:W0 /external:templates- /external:IC:/workspace/onnxruntime/cmake /external:IC:/workspace/build/Release /wd4251 /wd4201 /wd4324 /wd5054 /w15038 /wd4251 /wd4201 /wd4324 /wd5054 /w15038 /wd4834 /wd4127 /Zc:__cplusplus"   -D_WINDOWS -DNDEBUG -DVER_MAJOR=1 -DVER_MINOR=19 -DVER_BUILD=0 -DVER_PRIVATE=0 -D"VER_STRING=\"ORT_VERSION\"" -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_USE_THREADS -DDISABLE_CUSPARSE_DEPRECATED -DPLATFORM_WINDOWS -DNOGDI -DNOMINMAX -D_USE_MATH_DEFINES -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -DUSE_CUDA=1 -DUSE_FLASH_ATTENTION=1 -DUSE_MEMORY_EFFICIENT_ATTENTION=1 -DUSE_TENSORRT=1 -DONLY_C_LOCALE=0 -DONNX_NAMESPACE=onnx -DONNX_ML=1 -DONNX_USE_LITE_PROTO=1 -D__ONNX_NO_DOC_STRINGS -DWIN32_LEAN_AND_MEAN -DORT_ENABLE_STREAM -DEIGEN_MPL2_ONLY -DEIGEN_HAS_CONSTEXPR -DEIGEN_HAS_VARIADIC_TEMPLATES -DEIGEN_HAS_CXX11_MATH -DEIGEN_HAS_CXX11_ATOMIC -DEIGEN_STRONG_INLINE=inline -D"CMAKE_INTDIR=\"Release\"" -Donnxruntime_providers_cuda_EXPORTS -D_WINDLL -D_MBCS -DEIGEN_HAS_C99_MATH -DCPUINFO_SUPPORTED -DNDEBUG -DVER_MAJOR=1 -DVER_MINOR=19 -DVER_BUILD=0 -DVER_PRIVATE=0 -D"VER_STRING=\"ORT_VERSION\"" -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_USE_THREADS -DDISABLE_CUSPARSE_DEPRECATED -DPLATFORM_WINDOWS -DNOGDI -DNOMINMAX -D_USE_MATH_DEFINES -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -DUSE_CUDA=1 -DUSE_FLASH_ATTENTION=1 -DUSE_MEMORY_EFFICIENT_ATTENTION=1 -DUSE_TENSORRT=1 -DONLY_C_LOCALE=0 -DONNX_NAMESPACE=onnx -DONNX_ML=1 -DONNX_USE_LITE_PROTO=1 -D__ONNX_NO_DOC_STRINGS -DWIN32_LEAN_AND_MEAN -DORT_ENABLE_STREAM -DEIGEN_MPL2_ONLY -DEIGEN_HAS_CONSTEXPR -DEIGEN_HAS_VARIADIC_TEMPLATES -DEIGEN_HAS_CXX11_MATH -DEIGEN_HAS_CXX11_ATOMIC -DEIGEN_STRONG_INLINE=inline -D"CMAKE_INTDIR=\"Release\"" -Donnxruntime_providers_cuda_EXPORTS -Xcompiler "/EHsc /W4 /nologo /O2 /FS   /MD /GR" -Xcompiler "/Fdonnxruntime_providers_cuda.dir\Release\vc143.pdb" -o onnxruntime_providers_cuda.dir\Release\moe_gemm_kernels_fp16_uint4.obj "C:\workspace\onnxruntime\onnxruntime\contrib_ops\cuda\moe\ft_moe\moe_gemm_kernels_fp16_uint4.cu"" exited with code 2. [C:\workspace\build\Release\onnxruntime_providers_cuda.vcxproj] [C:\tmp\tritonbuild\onnxruntime\build\ort_target.vcxproj]
         C:\workspace\build\Release\_deps\cutlass-src\include\cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp(136): error C2061: syntax error: identifier 'SharedStorage' [C:\workspace\build\Release\onnxruntime_providers_cuda.vcxproj] [C:\tmp\tritonbuild\onnxruntime\build\ort_target.vcxproj]
         C:\workspace\build\Release\_deps\cutlass-src\include\cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp(140): error C3646: 'math_wg_order': unknown override specifier [C:\workspace\build\Release\onnxruntime_providers_cuda.vcxproj] [C:\tmp\tritonbuild\onnxruntime\build\ort_target.vcxproj]
         C:\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 12.6.targets(799,9): error MSB3721: The command ""C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin\nvcc.exe"  --use-local-env -ccbin "C:\BuildTools\VC\Tools\MSVC\14.39.33519\bin\HostX64\x64" -x cu   -I"C:\workspace\build\Release\_deps\utf8_range-src" -IC:\workspace\onnxruntime\include\onnxruntime -IC:\workspace\onnxruntime\include\onnxruntime\core\session -I"C:\workspace\build\Release\_deps\pytorch_cpuinfo-src\include" -IC:\workspace\build\Release -IC:\workspace\onnxruntime\onnxruntime -I"C:\workspace\build\Release\_deps\abseil_cpp-src" -I"C:\workspace\build\Release\_deps\safeint-src" -I"C:\workspace\build\Release\_deps\gsl-src\include" -I"C:\workspace\build\Release\_deps\date-src\include" -I"C:\workspace\build\Release\_deps\onnx-src" -I"C:\workspace\build\Release\_deps\onnx-build" -I"C:\workspace\build\Release\_deps\protobuf-src\src" -I"C:\workspace\build\Release\_deps\flatbuffers-src\include" -I"C:\workspace\build\Release\_deps\cutlass-src\include" -I"C:\workspace\build\Release\_deps\cutlass-src\examples" -I"C:\workspace\build\Release\_deps\cutlass-src\tools\util\include" -I"C:\workspace\build\Release\_deps\eigen-src" -I\TensorRT\include -I"C:\workspace\build\Release\_deps\mp11-src\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\include"     --keep-dir onnxrunt.4B28B068\x64\Release  -maxrregcount=0   --machine 64 --compile -cudart shared -allow-unsupported-compiler --expt-relaxed-constexpr --Werror default-stream-launch -Xcudafe --diag_suppress=bad_friend_decl -Xcudafe --diag_suppress=unsigned_compare_with_zero -Xcudafe --diag_suppress=expr_has_no_effect -include algorithm -std=c++17 --generate-code=arch=compute_60,code=[compute_60,sm_60] --generate-code=arch=compute_61,code=[compute_61,sm_61] --generate-code=arch=compute_70,code=[compute_70,sm_70] --generate-code=arch=compute_75,code=[compute_75,sm_75] --generate-code=arch=compute_80,code=[compute_80,sm_80] --generate-code=arch=compute_86,code=[compute_86,sm_86] --generate-code=arch=compute_90,code=[compute_90,sm_90] -Xcudafe --diag_suppress=conversion_function_not_usable --threads 1 -Xcompiler="/EHsc -Ob2 -Zi /utf-8 /sdl /experimental:external /external:W0 /external:templates- /external:IC:/workspace/onnxruntime/cmake /external:IC:/workspace/build/Release /wd4251 /wd4201 /wd4324 /wd5054 /w15038 /wd4251 /wd4201 /wd4324 /wd5054 /w15038 /wd4834 /wd4127 /Zc:__cplusplus"   -D_WINDOWS -DNDEBUG -DVER_MAJOR=1 -DVER_MINOR=19 -DVER_BUILD=0 -DVER_PRIVATE=0 -D"VER_STRING=\"ORT_VERSION\"" -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_USE_THREADS -DDISABLE_CUSPARSE_DEPRECATED -DPLATFORM_WINDOWS -DNOGDI -DNOMINMAX -D_USE_MATH_DEFINES -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -DUSE_CUDA=1 -DUSE_FLASH_ATTENTION=1 -DUSE_MEMORY_EFFICIENT_ATTENTION=1 -DUSE_TENSORRT=1 -DONLY_C_LOCALE=0 -DONNX_NAMESPACE=onnx -DONNX_ML=1 -DONNX_USE_LITE_PROTO=1 -D__ONNX_NO_DOC_STRINGS -DWIN32_LEAN_AND_MEAN -DORT_ENABLE_STREAM -DEIGEN_MPL2_ONLY -DEIGEN_HAS_CONSTEXPR -DEIGEN_HAS_VARIADIC_TEMPLATES -DEIGEN_HAS_CXX11_MATH -DEIGEN_HAS_CXX11_ATOMIC -DEIGEN_STRONG_INLINE=inline -D"CMAKE_INTDIR=\"Release\"" -Donnxruntime_providers_cuda_EXPORTS -D_WINDLL -D_MBCS -DEIGEN_HAS_C99_MATH -DCPUINFO_SUPPORTED -DNDEBUG -DVER_MAJOR=1 -DVER_MINOR=19 -DVER_BUILD=0 -DVER_PRIVATE=0 -D"VER_STRING=\"ORT_VERSION\"" -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_USE_THREADS -DDISABLE_CUSPARSE_DEPRECATED -DPLATFORM_WINDOWS -DNOGDI -DNOMINMAX -D_USE_MATH_DEFINES -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -DUSE_CUDA=1 -DUSE_FLASH_ATTENTION=1 -DUSE_MEMORY_EFFICIENT_ATTENTION=1 -DUSE_TENSORRT=1 -DONLY_C_LOCALE=0 -DONNX_NAMESPACE=onnx -DONNX_ML=1 -DONNX_USE_LITE_PROTO=1 -D__ONNX_NO_DOC_STRINGS -DWIN32_LEAN_AND_MEAN -DORT_ENABLE_STREAM -DEIGEN_MPL2_ONLY -DEIGEN_HAS_CONSTEXPR -DEIGEN_HAS_VARIADIC_TEMPLATES -DEIGEN_HAS_CXX11_MATH -DEIGEN_HAS_CXX11_ATOMIC -DEIGEN_STRONG_INLINE=inline -D"CMAKE_INTDIR=\"Release\"" -Donnxruntime_providers_cuda_EXPORTS -Xcompiler "/EHsc /W4 /nologo /O2 /FS   /MD /GR" -Xcompiler "/Fdonnxruntime_providers_cuda.dir\Release\vc143.pdb" -o onnxruntime_providers_cuda.dir\Release\moe_gemm_kernels_fp32_fp32.obj "C:\workspace\onnxruntime\onnxruntime\contrib_ops\cuda\moe\ft_moe\moe_gemm_kernels_fp32_fp32.cu"" exited with code 2. [C:\workspace\build\Release\onnxruntime_providers_cuda.vcxproj] [C:\tmp\tritonbuild\onnxruntime\build\ort_target.vcxproj]
         C:\BuildTools\MSBuild\Microsoft\VC\v170\Microsoft.CppCommon.targets(254,5): error MSB8066: Custom build for 'C:\tmp\tritonbuild\onnxruntime\build\CMakeFiles\1391fbda87be57075fb5bba7a38c2954\onnxruntime.rule;C:\tmp\tritonbuild\onnxruntime\build\CMakeFiles\c0b7ec8ce4dc22ca22ac8622f7a49e15\ort_target.rule;C:\tmp\tritonbuild\onnxruntime\CMakeLists.txt' exited with code 1. [C:\tmp\tritonbuild\onnxruntime\build\ort_target.vcxproj]

Visual Studio Version

BUILDTOOLS_VERSION:17.9.34622.214

GCC / Compiler Version

No response

@mc-nv mc-nv added the build build issues; typically submitted using template label Aug 8, 2024
@github-actions github-actions bot added ep:CUDA issues related to the CUDA execution provider platform:windows issues related to the Windows platform labels Aug 8, 2024
@mc-nv
Copy link
Contributor Author

mc-nv commented Aug 8, 2024

cc: @pranavsharma

@yf711
Copy link
Contributor

yf711 commented Aug 8, 2024

@tianleiwu have you tried building cuda ep with cuda12.6?
I wonder if cutlass need to be updated to fit cuda12.6

@tianleiwu
Copy link
Contributor

tianleiwu commented Aug 9, 2024

@wangyems, it seems that there is build error in MOE gemm code with cuda 12.6. Please help take a look:

tmpxft_000010f0_00000000-7_moe_gemm_kernels_fp16_fp16.cudafe1.cpp
D:\git\onnxruntime\build\cuda12\Release_deps\cutlass-src\include\cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp(136): error C2061: syntax error: identifier 'SharedStorage' [D:\git\onnxruntime\build\cuda12\Release\onnxruntime_providers_cuda_obj.vcxproj]
D:\git\onnxruntime\build\cuda12\Release_deps\cutlass-src\include\cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp(136): note: the template instantiation context (the oldest one first) is
D:\git\onnxruntime\build\cuda12\Release_deps\cutlass-src\include\cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp(60): note: while compiling class template partial specialization 'cutlass::gemm::kernel::GemmUniversal<ProblemShape_,CollectiveMainloop_,C
ollectiveEpilogue_,TileScheduler_,enable_if<std::is_base_of_vcutlass::gemm::KernelTmaWarpSpecializedPingpong,CollectiveMainloop_::DispatchPolicy::Schedule,void>::type>'
D:\git\onnxruntime\build\cuda12\Release_deps\cutlass-src\include\cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp(124): note: while compiling class 'cutlass::gemm::kernel::GemmUniversal<ProblemShape_,CollectiveMainloop_,CollectiveEpilogue_,TileSchedule
r_,enable_if<std::is_base_of_vcutlass::gemm::KernelTmaWarpSpecializedPingpong,CollectiveMainloop_::DispatchPolicy::Schedule,void>::type>::SharedStorage'
D:\git\onnxruntime\build\cuda12\Release_deps\cutlass-src\include\cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp(133): note: while compiling class 'cutlass::gemm::kernel::GemmUniversal<ProblemShape_,CollectiveMainloop_,CollectiveEpilogue_,TileSchedule
r_,enable_if<std::is_base_of_vcutlass::gemm::KernelTmaWarpSpecializedPingpong,CollectiveMainloop_::DispatchPolicy::Schedule,void>::type>::SharedStorage::PipelineStorage'
D:\git\onnxruntime\build\cuda12\Release_deps\cutlass-src\include\cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp(140): error C3646: 'math_wg_order': unknown override specifier [D:\git\onnxruntime\build\cuda12\Release\onnxruntime_providers_cuda_obj.vcxpr
oj]
tmpxft_0000160c_00000000-7_image_scaler_impl.cudafe1.cpp
C:\Program Files\Microsoft Visual Studio\2022\Enterprise\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 12.5.targets(799,9): error MSB3721: The command ""C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin\nvcc.exe" --use-local-env -ccbin "C:\Program File
s\Microsoft Visual Studio\2022\Enterprise\VC\Tools\MSVC\14.40.33807\bin\HostX64\x64" -x cu -I"D:\git\onnxruntime\build\cuda12\Release_deps\utf8_range-src" -ID:\git\onnxruntime\include\onnxruntime -ID:\git\onnxruntime\include\onnxruntime\core\session -I"D:\git\onnxru
ntime\build\cuda12\Release_deps\pytorch_cpuinfo-src\include" -ID:\git\onnxruntime\build\cuda12\Release -ID:\git\onnxruntime\onnxruntime -I"D:\git\onnxruntime\build\cuda12\Release_deps\abseil_cpp-src" -I"D:\git\onnxruntime\build\cuda12\Release_deps\safeint-src" -I"D:
\git\onnxruntime\build\cuda12\Release_deps\gsl-src\include" -I"D:\git\onnxruntime\build\cuda12\Release_deps\date-src\include" -I"D:\git\onnxruntime\build\cuda12\Release_deps\onnx-src" -I"D:\git\onnxruntime\build\cuda12\Release_deps\onnx-build" -I"D:\git\onnxruntime
\build\cuda12\Release_deps\protobuf-src\src" -I"D:\git\onnxruntime\build\cuda12\Release_deps\flatbuffers-src\include" -I"D:\git\onnxruntime\build\cuda12\Release_deps\cutlass-src\include" -I"D:\git\onnxruntime\build\cuda12\Release_deps\cutlass-src\examples" -I"D:\gi
t\onnxruntime\build\cuda12\Release_deps\cutlass-src\tools\util\include" -I"D:\git\onnxruntime\build\cuda12\Release_deps\eigen-src" -I"C:\nvidia\TensorRT-10.0.1.6.Windows10.win10.cuda-12.4\TensorRT-10.0.1.6\include" -I"D:\git\onnxruntime\build\cuda12\Release_deps\cud
nn_frontend-src\include" -I"D:\git\onnxruntime\build\cuda12\Release_deps\mp11-src\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\include" -I"C:\nvidia\cudnn-windows-x86_64-9.1.1.17_cuda12-archive\include" -I"C:\Program Files\NVIDIA GPU Computing
Toolkit\CUDA\v12.6\include" --keep-dir onnxrunt.7C32413E\x64\Release -maxrregcount=0 --machine 64 --compile -cudart shared -allow-unsupported-compiler --expt-relaxed-constexpr --Werror default-stream-launch -Xcudafe --diag_suppress=bad_friend_decl -Xcudafe --dia
g_suppress=unsigned_compare_with_zero -Xcudafe --diag_suppress=expr_has_no_effect -include algorithm -std=c++17 --generate-code=arch=compute_89,code=[compute_89,sm_89] -Xcudafe --diag_suppress=conversion_function_not_usable --threads 1 -Werror all-warnings -Xcompiler="
/MP4 /guard:cf /Qspectre /Ob2 /EHsc -Ob2 -Zi /utf-8 /sdl /experimental:external /external:W0 /external:templates- /external:ID:/git/onnxruntime/cmake /external:ID:/git/onnxruntime/build/cuda12/Release /wd4251 /wd4201 /wd4324 /wd5054 /w15038 /wd4251 /wd4201 /wd4324 /wd5
054 /w15038 /wd4834 /wd4127 /Zc:cplusplus" -DWIN32 -D_WINDOWS -D_DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR -DWINAPI_FAMILY=100 -DWINVER=0x0A00 -D_WIN32_WINNT=0x0A00 -DNTDDI_VERSION=0x0A000000 -DONNXRUNTIME_ENABLE_INTEL_METEOR_LAKE_MOBILE_PLATFORM_PERF_PATCH -DNDEBUG -DCP
UINFO_SUPPORTED_PLATFORM=1 -DEIGEN_USE_THREADS -DDISABLE_CUSPARSE_DEPRECATED -DPLATFORM_WINDOWS -DNOGDI -DNOMINMAX -D_USE_MATH_DEFINES -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -DUSE_CUDA=1 -DUSE_FLASH_ATTENTION=1 -DUSE_MEMORY_EFFICIENT_ATTENTION=1 -DUSE_TENSORRT=1 -DO
NLY_C_LOCALE=0 -DONNX_NAMESPACE=onnx -DONNX_ML=1 -DONNX_USE_LITE_PROTO=1 -D__ONNX_NO_DOC_STRINGS -DWIN32_LEAN_AND_MEAN -DORT_ENABLE_STREAM -DEIGEN_MPL2_ONLY -DEIGEN_HAS_CONSTEXPR -DEIGEN_HAS_VARIADIC_TEMPLATES -DEIGEN_HAS_CXX11_MATH -DEIGEN_HAS_CXX11_ATOMIC -DEIGEN_STR
ONG_INLINE=inline -D"CMAKE_INTDIR="Release"" -D_MBCS -DWIN32 -D_WINDOWS -D_DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR -DWINAPI_FAMILY=100 -DWINVER=0x0A00 -D_WIN32_WINNT=0x0A00 -DNTDDI_VERSION=0x0A000000 -DONNXRUNTIME_ENABLE_INTEL_METEOR_LAKE_MOBILE_PLATFORM_PERF_PATCH -DNDE
BUG -DEIGEN_HAS_C99_MATH -DCPUINFO_SUPPORTED -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_USE_THREADS -DDISABLE_CUSPARSE_DEPRECATED -DPLATFORM_WINDOWS -DNOGDI -DNOMINMAX -D_USE_MATH_DEFINES -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -DUSE_CUDA=1 -DUSE_FLASH_ATTENTION=1 -DUSE

MEMORY_EFFICIENT_ATTENTION=1 -DUSE_TENSORRT=1 -DONLY_C_LOCALE=0 -DONNX_NAMESPACE=onnx -DONNX_ML=1 -DONNX_USE_LITE_PROTO=1 -D__ONNX_NO_DOC_STRINGS -DWIN32_LEAN_AND_MEAN -DORT_ENABLE_STREAM -DEIGEN_MPL2_ONLY -DEIGEN_HAS_CONSTEXPR -DEIGEN_HAS_VARIADIC_TEMPLATES -DEIGEN_HA
S_CXX11_MATH -DEIGEN_HAS_CXX11_ATOMIC -DEIGEN_STRONG_INLINE=inline -D"CMAKE_INTDIR="Release"" -Xcompiler "/EHsc /W4 /nologo /O2 /FS /MD /GR" -Xcompiler "/Fdonnxruntime_providers_cuda_obj.dir\Release\onnxruntime_providers_cuda_obj.pdb" -o onnxruntime_providers_cuda

obj.dir\Release\moe_gemm_kernels_fp16_fp16.obj "D:\git\onnxruntime\onnxruntime\contrib_ops\cuda\moe\ft_moe\moe_gemm_kernels_fp16_fp16.cu"" exited with code 2. [D:\git\onnxruntime\build\cuda12\Release\onnxruntime_providers_cuda_obj.vcxproj]

@mc-nv
Copy link
Contributor Author

mc-nv commented Aug 13, 2024

I've tried to downgrade CUDA to 12.5.1 version and compile it against the rel-1.19.0 branch on Windows.
Every time I'm failing with out of memory message against cutlass.

       (CustomBuild target) -> 
         C:\workspace\build\Release\_deps\cutlass-src\include\cute/int_tuple.hpp(51): catastrophic error : out of memory [C:\workspace\build\Release\onnxruntime_providers_cuda.vcxproj] [C:\tmp\tritonbuild\onnxruntime\build\ort_target.vcxproj]

@tianleiwu
Copy link
Contributor

out of memory message against cutlass

@mc-nv, for machine with 32GB memory, try limit parallel like --parallel 4 --nvcc_threads 1 like the following to avoid OOM:

build.bat --cmake_generator "Visual Studio 17 2022" --config Release ^
      --build_wheel --parallel --build_shared_lib ^
      --use_cuda --cuda_version "12.5" --cuda_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5" ^
      --cudnn_home "C:\nvidia\CuDNN\9.1.1.17_cuda12" ^
      --use_tensorrt --tensorrt_home "C:\nvidia\TensorRT\10.0.1.6.cuda-12.4" ^
      --parallel 4 --nvcc_threads 1 ^
      --skip_tests ^
      --use_binskim_compliant_compile_flags ^
      --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=89

@mc-nv
Copy link
Contributor Author

mc-nv commented Aug 16, 2024

I had tried proposed above change in docker image configured with:

tool version
BUILDTOOLS_VERSION 17.9.34622.214
CUDA_VERSION 12.5.1
CUDNN_VERSION 9.3.0.75
PYTHON_VERSION 3.10.11
TENSORRT_VERSION 10.3.0.26
VCPGK_VERSION 2024.03.19

Results of the below command:

RUN build.bat --cmake_generator "Visual Studio 17 2022" --config Release --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=60;61;70;75;80;86;90" --skip_submodule_sync --parallel 4 --nvcc_threads 1 --build_shared_lib --compile_no_warning_as_error --skip_tests --update --build --build_dir /workspace/build --use_cuda --cuda_version "12.5" --cuda_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5" --cudnn_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5" --use_tensorrt --tensorrt_home "/tensorrt"

provides following error statements catastrophic error : out of memory:

 18>C:\workspace\build\Release\_deps\cutlass-src\include\cute/layout_composed.hpp(478): catastrophic error : out of memory [C:\workspace\build\Release\onnxruntime_providers_cuda.vcxproj] [C:\tmp\tritonbuild\onnxruntime\build\ort_target.vcxproj]
               return composition(a.layout_a(), a.offset(), zipped_divide(a.layout_b(), b));
                      ^
                     detected during:
                       instantiation of "auto cute::zipped_divide(const cute::ComposedLayout<A, O, B> &, const Tiler &) [with A=cute::Swizzle<3, 3, 3>, O=cute::_0, B=cute::Layout<cute::tuple<cute::_128, cute::tuple<cute::_64, cute::_2>>, cute::tuple<cute::_64, cute::tuple<cute::C<1>, cute::C<8192>>>>, Tiler=cute::tuple<cute::C<64>, cute::C<16>>]" at line 179 of C:\workspace\build\Release\_deps\cutlass-src\include\cute/atom/copy_atom.hpp
                       instantiation of "auto cute::TiledCopy<Copy_Atom, LayoutCopy_TV, ShapeTiler_MN>::tidfrg_S(STensor &&) [with Copy_Atom=cute::Copy_Atom<cute::SM75_U32x4_LDSM_N, cutlass::half_t>, LayoutCopy_TV=cute::Layout<cute::tuple<cute::tuple<cute::_4, cute::_8, cute::_4>, cute::tuple<cute::tuple<cute::_2, cute::_2, cute::_2>, cute::tuple<cute::_1, cute::_1>>>, cute::tuple<cute::tuple<cute::_128, cute::_1, cute::_16>, cute::tuple<cute::tuple<cute::_64, cute::_8, cute::_512>, cute::tuple<cute::_0, cute::_0>>>>, ShapeTiler_MN=cute::tuple<cute::C<64>, cute::C<16>>, STensor=cute::ComposedLayout<cute::Swizzle<3, 3, 3>, cute::_0, cute::Layout<cute::tuple<cute::_128, cute::tuple<cute::_64, cute::_2>>, cute::tuple<cute::_64, cute::tuple<cute::C<1>, cute::C<8192>>>>>]" at line 354 of C:\workspace\build\Release\_deps\cutlass-src\include\cute/atom/copy_atom.hpp
                       instantiation of "auto cute::ThrCopy<TiledCopy, ThrIdx>::partition_S(STensor &&) const [with TiledCopy=cute::TiledCopy<cute::Copy_Atom<cute::SM75_U32x4_LDSM_N, cutlass::half_t>, cute::Layout<cute::tuple<cute::tuple<cute::_4, cute::_8, cute::_4>, cute::tuple<cute::tuple<cute::_2, cute::_2, cute::_2>, cute::tuple<cute::_1, cute::_1>>>, cute::tuple<cute::tuple<cute::_128, cute::_1, cute::_16>, cute::tuple<cute::tuple<cute::_64, cute::_8, cute::_512>, cute::tuple<cute::_0, cute::_0>>>>, cute::tuple<cute::C<64>, cute::C<16>>>, ThrIdx=int, STensor=cute::Tensor<cute::ViewEngine<cute::smem_ptr<cutlass::half_t *>>, cute::ComposedLayout<cute::Swizzle<3, 3, 3>, cute::_0, cute::Layout<cute::tuple<cute::_128, cute::tuple<cute::_64, cute::_2>>, cute::tuple<cute::_64, cute::tuple<cute::C<1>, cute::C<8192>>>>>> &]" at line 168 of C:\workspace\onnxruntime\onnxruntime\contrib_ops/cuda/bert/flash_attention/flash_fwd_kernel.h
                       instantiation of "void onnxruntime::flash::compute_attn_1rowblock<Kernel_traits,Is_causal,Is_local,Has_alibi,Is_even_MN,Is_even_K,Return_softmax,Params>(const Params &, int, int, int) [with Kernel_traits=onnxruntime::flash::Flash_fwd_kernel_traits<128, 128, 64, 4, false, false, cutlass::half_t, onnxruntime::flash::Flash_kernel_traits<128, 128, 64, 4, cutlass::half_t>>, Is_causal=true, Is_local=false, Has_alibi=false, Is_even_MN=false, Is_even_K=true, Return_softmax=false, Params=onnxruntime::flash::Flash_fwd_params]" at line 998 of C:\workspace\onnxruntime\onnxruntime\contrib_ops/cuda/bert/flash_attention/flash_fwd_kernel.h
                       instantiation of "void onnxruntime::flash::compute_attn<Kernel_traits,Is_causal,Is_local,Has_alibi,Is_even_MN,Is_even_K,Return_softmax,Params>(const Params &) [with Kernel_traits=onnxruntime::flash::Flash_fwd_kernel_traits<128, 128, 64, 4, false, false, cutlass::half_t, onnxruntime::flash::Flash_kernel_traits<128, 128, 64, 4, cutlass::half_t>>, Is_causal=true, Is_local=false, Has_alibi=false, Is_even_MN=false, Is_even_K=true, Return_softmax=false, Params=onnxruntime::flash::Flash_fwd_params]" at line 32 of C:\workspace\onnxruntime\onnxruntime\contrib_ops/cuda/bert/flash_attention/flash_fwd_launch_template.h
                       instantiation of "void onnxruntime::flash::flash_fwd_kernel<Kernel_traits,Is_causal,Is_local,Has_alibi,Is_even_MN,Is_even_K,Return_softmax>(onnxruntime::flash::Flash_fwd_params) [with Kernel_traits=onnxruntime::flash::Flash_fwd_kernel_traits<128, 128, 64, 4, false, false, cutlass::half_t, onnxruntime::flash::Flash_kernel_traits<128, 128, 64, 4, cutlass::half_t>>, Is_causal=true, Is_local=false, Has_alibi=false, Is_even_MN=false, Is_even_K=true, Return_softmax=false]" at line 63 of C:\workspace\onnxruntime\onnxruntime\contrib_ops/cuda/bert/flash_attention/flash_fwd_launch_template.h
                       instantiation of "void onnxruntime::flash::run_flash_fwd<Kernel_traits,Is_causal>(onnxruntime::flash::Flash_fwd_params &, cudaStream_t) [with Kernel_traits=onnxruntime::flash::Flash_fwd_kernel_traits<128, 128, 64, 4, false, false, cutlass::half_t, onnxruntime::flash::Flash_kernel_traits<128, 128, 64, 4, cutlass::half_t>>, Is_causal=true]" at line 210 of C:\workspace\onnxruntime\onnxruntime\contrib_ops/cuda/bert/flash_attention/flash_fwd_launch_template.h
                       instantiation of "void onnxruntime::flash::run_mha_fwd_hdim128<T>(onnxruntime::flash::Flash_fwd_params &, cudaStream_t) [with T=cutlass::half_t]" at line 13 of C:\workspace\onnxruntime\onnxruntime\contrib_ops\cuda\bert\flash_attention\flash_fwd_hdim128_fp16_sm80.cu
           
           1 catastrophic error detected in the compilation of "C:/workspace/onnxruntime/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_hdim128_fp16_sm80.cu".
           Compilation terminated.
           flash_fwd_hdim128_fp16_sm80.cu
    18>C:\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 12.5.targets(799,9): error MSB3721: The command ""C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\bin\nvcc.exe"  --use-local-env -ccbin "C:\BuildTools\VC\Tools\MSVC\14.39.33519\bin\HostX64\x64" -x cu   -I"C:\workspace\build\Release\_deps\utf8_range-src" -IC:\workspace\onnxruntime\include\onnxruntime -IC:\workspace\onnxruntime\include\onnxruntime\core\session -I"C:\workspace\build\Release\_deps\pytorch_cpuinfo-src\include" -IC:\workspace\build\Release -IC:\workspace\onnxruntime\onnxruntime -I"C:\workspace\build\Release\_deps\abseil_cpp-src" -I"C:\workspace\build\Release\_deps\safeint-src" -I"C:\workspace\build\Release\_deps\gsl-src\include" -I"C:\workspace\build\Release\_deps\date-src\include" -I"C:\workspace\build\Release\_deps\onnx-src" -I"C:\workspace\build\Release\_deps\onnx-build" -I"C:\workspace\build\Release\_deps\protobuf-src\src" -I"C:\workspace\build\Release\_deps\flatbuffers-src\include" -I"C:\workspace\build\Release\_deps\cutlass-src\include" -I"C:\workspace\build\Release\_deps\cutlass-src\examples" -I"C:\workspace\build\Release\_deps\cutlass-src\tools\util\include" -I"C:\workspace\build\Release\_deps\eigen-src" -I\TensorRT\include -I"C:\workspace\build\Release\_deps\mp11-src\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\include"     --keep-dir onnxrunt.4B28B068\x64\Release  -maxrregcount=0   --machine 64 --compile -cudart shared -allow-unsupported-compiler --expt-relaxed-constexpr --Werror default-stream-launch -Xcudafe --diag_suppress=bad_friend_decl -Xcudafe --diag_suppress=unsigned_compare_with_zero -Xcudafe --diag_suppress=expr_has_no_effect -include algorithm -std=c++17 --generate-code=arch=compute_60,code=[compute_60,sm_60] --generate-code=arch=compute_61,code=[compute_61,sm_61] --generate-code=arch=compute_70,code=[compute_70,sm_70] --generate-code=arch=compute_75,code=[compute_75,sm_75] --generate-code=arch=compute_80,code=[compute_80,sm_80] --generate-code=arch=compute_86,code=[compute_86,sm_86] --generate-code=arch=compute_90,code=[compute_90,sm_90] -Xcudafe --diag_suppress=conversion_function_not_usable --threads 1 -Xcompiler="/EHsc -Ob2 -Zi /utf-8 /sdl /experimental:external /external:W0 /external:templates- /external:IC:/workspace/onnxruntime/cmake /external:IC:/workspace/build/Release /wd4251 /wd4201 /wd4324 /wd5054 /w15038 /wd4251 /wd4201 /wd4324 /wd5054 /w15038 /wd4834 /wd4127 /Zc:__cplusplus"   -D_WINDOWS -DNDEBUG -DVER_MAJOR=1 -DVER_MINOR=19 -DVER_BUILD=0 -DVER_PRIVATE=0 -D"VER_STRING=\"ORT_VERSION\"" -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_USE_THREADS -DDISABLE_CUSPARSE_DEPRECATED -DPLATFORM_WINDOWS -DNOGDI -DNOMINMAX -D_USE_MATH_DEFINES -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -DUSE_CUDA=1 -DUSE_FLASH_ATTENTION=1 -DUSE_MEMORY_EFFICIENT_ATTENTION=1 -DUSE_TENSORRT=1 -DONLY_C_LOCALE=0 -DONNX_NAMESPACE=onnx -DONNX_ML=1 -DONNX_USE_LITE_PROTO=1 -D__ONNX_NO_DOC_STRINGS -DWIN32_LEAN_AND_MEAN -DORT_ENABLE_STREAM -DEIGEN_MPL2_ONLY -DEIGEN_HAS_CONSTEXPR -DEIGEN_HAS_VARIADIC_TEMPLATES -DEIGEN_HAS_CXX11_MATH -DEIGEN_HAS_CXX11_ATOMIC -DEIGEN_STRONG_INLINE=inline -D"CMAKE_INTDIR=\"Release\"" -Donnxruntime_providers_cuda_EXPORTS -D_WINDLL -D_MBCS -DEIGEN_HAS_C99_MATH -DCPUINFO_SUPPORTED -DNDEBUG -DVER_MAJOR=1 -DVER_MINOR=19 -DVER_BUILD=0 -DVER_PRIVATE=0 -D"VER_STRING=\"ORT_VERSION\"" -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_USE_THREADS -DDISABLE_CUSPARSE_DEPRECATED -DPLATFORM_WINDOWS -DNOGDI -DNOMINMAX -D_USE_MATH_DEFINES -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -DUSE_CUDA=1 -DUSE_FLASH_ATTENTION=1 -DUSE_MEMORY_EFFICIENT_ATTENTION=1 -DUSE_TENSORRT=1 -DONLY_C_LOCALE=0 -DONNX_NAMESPACE=onnx -DONNX_ML=1 -DONNX_USE_LITE_PROTO=1 -D__ONNX_NO_DOC_STRINGS -DWIN32_LEAN_AND_MEAN -DORT_ENABLE_STREAM -DEIGEN_MPL2_ONLY -DEIGEN_HAS_CONSTEXPR -DEIGEN_HAS_VARIADIC_TEMPLATES -DEIGEN_HAS_CXX11_MATH -DEIGEN_HAS_CXX11_ATOMIC -DEIGEN_STRONG_INLINE=inline -D"CMAKE_INTDIR=\"Release\"" -Donnxruntime_providers_cuda_EXPORTS -Xcompiler "/EHsc /W4 /nologo /O2 /FS   /MD /GR" -Xcompiler "/Fdonnxruntime_providers_cuda.dir\Release\vc143.pdb" -o onnxruntime_providers_cuda.dir\Release\flash_fwd_hdim128_fp16_sm80.obj "C:\workspace\onnxruntime\onnxruntime\contrib_ops\cuda\bert\flash_attention\flash_fwd_hdim128_fp16_sm80.cu"" exited with code 1. [C:\workspace\build\Release\onnxruntime_providers_cuda.vcxproj] [C:\tmp\tritonbuild\onnxruntime\build\ort_target.vcxproj]
           Compiling CUDA source file ..\..\onnxruntime\onnxruntime\contrib_ops\cuda\bert\flash_attention\flash_fwd_hdim192_bf16_sm80.cu...

@tianleiwu
Copy link
Contributor

tianleiwu commented Aug 17, 2024

@mc-nv,

Could you try upgrade Visual Studio to the latest version?

I tried Visual Studio Enterprise 2022 version 17.11.0 with latest MSVC v143 build tools, and there is no problem in my machine.

tool version
BUILDTOOLS_VERSION Visual Studio Enterprise 17.11.0 and MSVC v143(latest)
CUDA_VERSION 12.5.1
CUDNN_VERSION 9.3.0.75
PYTHON_VERSION 3.10.13 (from AnaConda)
TENSORRT_VERSION 10.3.0.26

Select all the build tools that marked as latest in Visual Studio Installer:

image
image

My build script:

pip install cmake numpy --upgrade
call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" amd64
build.bat --cmake_generator "Visual Studio 17 2022" --config Release --build_dir build\cuda12 --build_wheel ^
          --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=60;61;70;75;80;86;90"  --parallel 4 --nvcc_threads 1 ^
          --build_shared_lib --skip_tests --compile_no_warning_as_error ^
           --use_cuda --cuda_version "12.5" --cuda_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5" ^
          --cudnn_home "C:\nvidia\cudnn\9.3.0.75_cuda12" --use_tensorrt --tensorrt_home "C:\nvidia\tensorrt\10.3.0.26_cuda12.5"

It shows that

-- Selecting Windows SDK version 10.0.22621.0 to target Windows 10.0.22631.
-- The C compiler identification is MSVC 19.41.34120.0
-- The CXX compiler identification is MSVC 19.41.34120.0
-- The ASM compiler identification is MSVC
-- Found assembler: C:/Program Files/Microsoft Visual Studio/2022/Enterprise/VC/Tools/MSVC/14.41.34120/bin/Hostx64/x64/cl.exe
-- Found Python: C:\Users\.conda\envs\py310\python.exe (found suitable version "3.10.13", minimum required is "3.8") 
MSBuild version 17.11.2+c078802d4 for .NET Framework
-- The CUDA compiler identification is NVIDIA 12.5.82
-- CMAKE_CUDA_COMPILER_VERSION: 12.5.82

Peak memory usage is about 31 GB during compiling. I used a machine with 32GB RAM and additional virtual memory (page file) 16 GB.

@tianleiwu
Copy link
Contributor

Add related issue: NVIDIA/cutlass#1732

@mc-nv
Copy link
Contributor Author

mc-nv commented Aug 21, 2024

Were not able to compile against rel-1.19.0

BUILDTOOLS_VERSION:17.10.35201.131 
CMAKE_VERSION:3.30.1 
CUDA_VERSION:12.5.1 
CUDNN_VERSION:9.3.0.75 
PYTHON_VERSION:3.10.11 
TENSORRT_VERSION:10.3.0.26 
VCPGK_VERSION:2024.03.19

But was able to successfully build against rel-1.18.1 although I did't use suggested latest version of BuildTools 17.11, choose 17.10 LTSC instead.

@pvijayakrish
Copy link

Team,
We are facing the same issue with the latest rel-1.19.2 as well. Please suggest a resolve.

@abysslover
Copy link

I confirm that the following settings will be successful to build 1.19.2 on Windows:

Cuda: 12.5.1
CUDNN: 9.4.0.
Visual Studio 17 2022
Specifying --compile_no_warning_as_error
TensorRT: 10.4.0.26

I additionally installed protobuf, and zlib and added those binaries to PATH env.

@mc-nv
Copy link
Contributor Author

mc-nv commented Sep 26, 2024

Same for me we can build with 👍 CUDA 12.5 but not with 👎 CUDA 12.6

Succeed with following configuration:

BUILDTOOLS_VERSION:17.12.35309.182
CMAKE_VERSION:3.30.1 
CUDA_VERSION:12.5.1 
CUDNN_VERSION:9.3.0.75 
PYTHON_VERSION:3.10.11 
TENSORRT_VERSION:10.3.0.26 
VCPGK_VERSION:2024.03.19

@mc-nv
Copy link
Contributor Author

mc-nv commented Sep 30, 2024

Hi @snnn ,
Is it possible to update cutlas version to 3.5.1 in deps.txt in response on NVIDIA/cutlass#1732 ?

@tianleiwu
Copy link
Contributor

tianleiwu commented Sep 30, 2024

@mc-nv, #21939 has cutlass 3.5.1. In my test, build is good with cuda 12.6 update 1 in Windows using 3.5.1.
There is performance regression of flash attention on H100 using 3.5.1, which it is still under investigation.

@snnn
Copy link
Member

snnn commented Oct 4, 2024

Should have been resolved in #22316 . If not, please reopen this issue.

@snnn snnn closed this as completed Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build build issues; typically submitted using template ep:CUDA issues related to the CUDA execution provider platform:windows issues related to the Windows platform
Projects
None yet
Development

No branches or pull requests

6 participants