Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduce post numbers #23

Open
ericheatspersimmons opened this issue May 7, 2024 · 2 comments
Open

Reproduce post numbers #23

ericheatspersimmons opened this issue May 7, 2024 · 2 comments

Comments

@ericheatspersimmons
Copy link

Can you share the hyperparameter settings you used for various problem sizes? With the defaults in the repo, I get about 350us for the M=1, N,K=8192 case. Almost 10x slower than reported.

(This is after adding cuda graphs to the example for benchmarking purposes. And also moving the creation of the C matrix outside the gemm_split_k function call.)

@ericheatspersimmons
Copy link
Author

A preliminary search suggests that:

block_m: 16, block_n: 512, block_k: 32, num_stages: 2, num_warps: 4, split_k: 16, group_m: 4
is a reasonable guess? But it's still significantly slower than reported. I get ~60us and Bandwidth: 1100 GB/s. Reported is close to 40us and 1600 GB/sec.

@ericheatspersimmons
Copy link
Author

Upgrading to nearly HEAD of triton (nightly from 4/24) seems to have made things slower. Can you reproduce the numbers in the blog at current HEAD of triton?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant