Adding cuda kernel (optimized for sm80) for block-wise 4b quantized float 16 GEMM. #18619
Azure Pipelines / orttraining-ortmodule-distributed (DistributedInferenceTest Onnxruntime_Linux_GPU_Inference_Distributed_Test)
succeeded
Feb 29, 2024 in 27m 41s
DistributedInferenceTest Onnxruntime_Linux_GPU_Inference_Distributed_Test succeeded
Loading