Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update flash_attention_fwd_benchmark.py #2265

Closed
wants to merge 8 commits into from
Closed

Conversation

anmyachev
Copy link
Contributor

@anmyachev anmyachev commented Sep 17, 2024

CI:

Error:

torch.OutOfMemoryError: XPU out of memory. Tried to allocate 32.00 GiB. GPU 0 has a total capacity of 64.00 GiB. Of the allocated memory 32.81 GiB is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. Please use `empty_cache` to release all unoccupied cached memory.

It's strange that a total capacity is 64.00 GiB. I need to understand why (the expected capacity should be more in my understanding).

UPD: Maybe it's related to https://spec.oneapi.io/level-zero/latest/core/PROG.html#environment-variables ZE_FLAT_DEVICE_HIERARCHY

ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE can help with this, but for now it has been decided to leave it as is

Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
This reverts commit de9335c.
Signed-off-by: Anatoly Myachev <[email protected]>
q, k, v, attn_mask=None, dropout_p=0.0, is_causal=False, scale=sm_scale).to(torch.float32)
atol = 1e-1 if N_CTX == 16384 else 1e-2
benchmark_suit.assert_close(triton_fn(), torch_fn(), atol=atol, rtol=1e-3, err_msg='triton to torch')
torch_fn = lambda: torch.nn.functional.scaled_dot_product_attention(
Copy link
Contributor Author

@anmyachev anmyachev Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE the available memory is doubled and there is no more out of memory error for upstream pytorch (however, this affects the performance)

@anmyachev anmyachev marked this pull request as ready for review September 19, 2024 18:32
Signed-off-by: Anatoly Myachev <[email protected]>
@anmyachev
Copy link
Contributor Author

@@ -64,7 +64,10 @@ def do_bench_ipex(fn, warmup=25, rep=100, grad_to_none=None, quantiles=None, fas
# We maintain a buffer of 256 MB that we clear
# before each kernel call to make sure that the L2
# doesn't contain any input data before the run
cache_size = 256 * 1024 * 1024
factor = 1
if os.getenv("ZE_FLAT_DEVICE_HIERARCHY", "FLAT") == "COMPOSITE":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By increasing the cache for cleaning accordingly, the performance becomes +- the same for both cases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant