Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase warmup and rep for FA benchmark #2256

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open

Conversation

@anmyachev anmyachev marked this pull request as ready for review September 30, 2024 13:29
@anmyachev
Copy link
Contributor Author

anmyachev commented Sep 30, 2024

@ESI-SYD @chengjunlu geomean diff will most likely be less, I will write the exact figures here after https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11106696646/job/30855590713 is finished:

  • Triton ADV: -5%
  • Triton DFT: -2%
  • Xetla: -8%

Are you aware of this effect where as the number of runs increases, the average time gets noticeably worse? I don't know what to do with this slowdown, but I still think the idea of ​​running multiple times (>3, only in this case "*-CV" column will not be NaN) is good (from the point of view of calculating the average).

cc @whitneywhtsang @etiotto

@etiotto
Copy link
Contributor

etiotto commented Oct 1, 2024

@ESI-SYD @chengjunlu geomean diff will most likely be less, I will write the exact figures here after https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11106696646/job/30855590713 is finished:

  • Triton ADV: -5%
  • Triton DFT: -2%
  • Xetla: -8%

Are you aware of this effect where as the number of runs increases, the average time gets noticeably worse? I don't know what to do with this slowdown, but I still think the idea of ​​running multiple times (>3, only in this case "*-CV" column will not be NaN) is good (from the point of view of calculating the average).

cc @whitneywhtsang @etiotto

I think that when the warmup runs "too many times" the GPU may start heating up and then throttle the frequency down, so when the timed run start the performance is reduced. That means we are better off not increasing the rep/warmup to the point we see performance degradations in the benchmarks.

Copy link
Contributor

@etiotto etiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we should increase the number of repetition too much. Going from 10 too 600 repetitions is a huge increase.

The kernel timing distribution should be a normal (gaussian) curve. We only need to run the benchmark enough times to approximate a gaussian "bell" curve. From https://www.scribbr.com/statistics/central-limit-theorem/#:~:text=By%20convention%2C%20we%20consider%20a,if%20the%20population%20is%20normal. looks like 30 is the number of reps we should use.

@@ -234,10 +234,11 @@ def benchmark(Z, H, N_CTX, D_HEAD, CAUSAL, provider):
v = torch.randn((Z, H, N_CTX, D_HEAD), device='xpu', dtype=dtype)
sm_scale = 0.125
quantiles = [0.5, 0.0, 1.0]
warmup, rep = 10, 600
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From 10 to 600 times? Way too many repetitions. It is going to slow down the time it takes to run the benchmarks too much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From 10 to 600 times? Way too many repetitions. It is going to slow down the time it takes to run the benchmarks too much.

This value is measured in milliseconds and is needed for some test combinations where one run takes more than 100 ms.

@whitneywhtsang
Copy link
Contributor

If we revert #2142, then rep is the number of iterations, then the problem of NaNs in CV is gone?

@anmyachev
Copy link
Contributor Author

If we revert #2142, then rep is the number of iterations, then the problem of NaNs in CV is gone?

@whitneywhtsang Most likely yes. However, I made a change to make do_bench function more similar to the one used in upstream triton. If this is not necessary, I can revert some of the changes.

@whitneywhtsang
Copy link
Contributor

If we revert #2142, then rep is the number of iterations, then the problem of NaNs in CV is gone?

@whitneywhtsang Most likely yes. However, I made a change to make do_bench function more similar to the one used in upstream triton. If this is not necessary, I can revert some of the changes.

I also see the benefit of being more similar to upstream triton, but rep meaning the number of iterations is more intuitive IMO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Increase warmup and rep for FA benchmark
3 participants