Reproduce post numbers #23

ericheatspersimmons · 2024-05-07T21:28:41Z

Can you share the hyperparameter settings you used for various problem sizes? With the defaults in the repo, I get about 350us for the M=1, N,K=8192 case. Almost 10x slower than reported.

(This is after adding cuda graphs to the example for benchmarking purposes. And also moving the creation of the C matrix outside the gemm_split_k function call.)

ericheatspersimmons · 2024-05-07T22:37:26Z

A preliminary search suggests that:

block_m: 16, block_n: 512, block_k: 32, num_stages: 2, num_warps: 4, split_k: 16, group_m: 4
is a reasonable guess? But it's still significantly slower than reported. I get ~60us and Bandwidth: 1100 GB/s. Reported is close to 40us and 1600 GB/sec.

ericheatspersimmons · 2024-05-08T21:15:29Z

Upgrading to nearly HEAD of triton (nightly from 4/24) seems to have made things slower. Can you reproduce the numbers in the blog at current HEAD of triton?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduce post numbers #23

Reproduce post numbers #23

ericheatspersimmons commented May 7, 2024

ericheatspersimmons commented May 7, 2024

ericheatspersimmons commented May 8, 2024

Reproduce post numbers #23

Reproduce post numbers #23

Comments

ericheatspersimmons commented May 7, 2024

ericheatspersimmons commented May 7, 2024

ericheatspersimmons commented May 8, 2024