Update `flash_attention_fwd_benchmark.py` #2265

anmyachev · 2024-09-17T11:30:23Z

CI:

https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10902146943
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10943729332
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10943905958 (without IPEX)
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10944237335 (with ZE_FLAT_DEVICE_HIERARCHY)
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10944810716 (for FA bench)

Error:

torch.OutOfMemoryError: XPU out of memory. Tried to allocate 32.00 GiB. GPU 0 has a total capacity of 64.00 GiB. Of the allocated memory 32.81 GiB is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. Please use `empty_cache` to release all unoccupied cached memory.

It's strange that a total capacity is 64.00 GiB. I need to understand why (the expected capacity should be more in my understanding).

UPD: Maybe it's related to https://spec.oneapi.io/level-zero/latest/core/PROG.html#environment-variables ZE_FLAT_DEVICE_HIERARCHY

ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE can help with this, but for now it has been decided to leave it as is

Signed-off-by: Anatoly Myachev <[email protected]>

This reverts commit de9335c.

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev · 2024-09-19T16:54:09Z

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py

- q, k, v, attn_mask=None, dropout_p=0.0, is_causal=False, scale=sm_scale).to(torch.float32)
- atol = 1e-1 if N_CTX == 16384 else 1e-2
- benchmark_suit.assert_close(triton_fn(), torch_fn(), atol=atol, rtol=1e-3, err_msg='triton to torch')
+ torch_fn = lambda: torch.nn.functional.scaled_dot_product_attention(


Using ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE the available memory is doubled and there is no more out of memory error for upstream pytorch (however, this affects the performance)

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev · 2024-09-20T08:49:56Z

https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10956191114

anmyachev · 2024-09-20T10:49:45Z

benchmarks/triton_kernels_benchmark/benchmark_testing.py

@@ -64,7 +64,10 @@ def do_bench_ipex(fn, warmup=25, rep=100, grad_to_none=None, quantiles=None, fas
 # We maintain a buffer of 256 MB that we clear
 # before each kernel call to make sure that the L2
 # doesn't contain any input data before the run
- cache_size = 256 * 1024 * 1024
+ factor = 1
+ if os.getenv("ZE_FLAT_DEVICE_HIERARCHY", "FLAT") == "COMPOSITE":


By increasing the cache for cleaning accordingly, the performance becomes +- the same for both cases

vlad-penkin linked an issue Sep 18, 2024 that may be closed by this pull request

[Benchmarks][Upstream PyTorch 2.5] Triton and XeTLA softmax performance degrades in comparison with torch 2.1 / ipex 2.1 test proxies #2106

Closed

anmyachev and others added 2 commits September 19, 2024 15:04

Update flash_attention_fwd_benchmark.py

87b59d3

debug

92998b2

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev force-pushed the anmyachev-patch-1 branch from 80a8f26 to 92998b2 Compare September 19, 2024 15:05

anmyachev added 5 commits September 19, 2024 15:34

debug

241719f

Signed-off-by: Anatoly Myachev <[email protected]>

try for FA

5aaa5da

Signed-off-by: Anatoly Myachev <[email protected]>

speedup CI

de9335c

Signed-off-by: Anatoly Myachev <[email protected]>

Revert "speedup CI"

bbd64f3

This reverts commit de9335c.

cleanup

49fc392

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev commented Sep 19, 2024

View reviewed changes

anmyachev marked this pull request as ready for review September 19, 2024 18:32

adjust cache size

a59c94e

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev commented Sep 20, 2024

View reviewed changes

anmyachev closed this Sep 20, 2024

anmyachev mentioned this pull request Sep 27, 2024

02-fused-softmax: PyTorch faster than Triton on Max 1550 in composite mode #2363

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update `flash_attention_fwd_benchmark.py` #2265

Update `flash_attention_fwd_benchmark.py` #2265

anmyachev commented Sep 17, 2024 •

edited

Loading

anmyachev Sep 19, 2024 •

edited

Loading

anmyachev commented Sep 20, 2024

anmyachev Sep 20, 2024

Update flash_attention_fwd_benchmark.py #2265

Update flash_attention_fwd_benchmark.py #2265

Conversation

anmyachev commented Sep 17, 2024 • edited Loading

anmyachev Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

anmyachev commented Sep 20, 2024

anmyachev Sep 20, 2024

Choose a reason for hiding this comment

Update `flash_attention_fwd_benchmark.py` #2265

Update `flash_attention_fwd_benchmark.py` #2265

anmyachev commented Sep 17, 2024 •

edited

Loading

anmyachev Sep 19, 2024 •

edited

Loading