You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Can you share the hyperparameter settings you used for various problem sizes? With the defaults in the repo, I get about 350us for the M=1, N,K=8192 case. Almost 10x slower than reported.
(This is after adding cuda graphs to the example for benchmarking purposes. And also moving the creation of the C matrix outside the gemm_split_k function call.)
The text was updated successfully, but these errors were encountered:
block_m: 16, block_n: 512, block_k: 32, num_stages: 2, num_warps: 4, split_k: 16, group_m: 4
is a reasonable guess? But it's still significantly slower than reported. I get ~60us and Bandwidth: 1100 GB/s. Reported is close to 40us and 1600 GB/sec.
Upgrading to nearly HEAD of triton (nightly from 4/24) seems to have made things slower. Can you reproduce the numbers in the blog at current HEAD of triton?
Can you share the hyperparameter settings you used for various problem sizes? With the defaults in the repo, I get about 350us for the M=1, N,K=8192 case. Almost 10x slower than reported.
(This is after adding cuda graphs to the example for benchmarking purposes. And also moving the creation of the C matrix outside the gemm_split_k function call.)
The text was updated successfully, but these errors were encountered: