02-fused-softmax: PyTorch faster than Triton on Max 1550 in composite mode #2363

pbchekin · 2024-09-26T22:57:50Z

ZE_FLAT_DEVICE_HIERARCHY=FLAT

ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE

The text was updated successfully, but these errors were encountered:

chengjunlu · 2024-09-27T00:29:52Z

I think it maybe caused by the register spill.
In the composite mode, the private space for spilling maybe always allocated in one tile and all the threads on the other tile has to access the private space across the EMIB bus.

We may need to implement a new softmax kernel which is close to the torch's implementation and re-test the performance.

anmyachev · 2024-09-27T10:42:06Z

@pbchekin in composite mode we need to adjust cache size (double it) #2265 (comment). I wonder how the chart will change after this change.

pbchekin added the performance label Sep 26, 2024

vlad-penkin added this to the 4.3 [Performance] Tracking milestone Sep 27, 2024

vlad-penkin added the research label Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

02-fused-softmax: PyTorch faster than Triton on Max 1550 in composite mode #2363

02-fused-softmax: PyTorch faster than Triton on Max 1550 in composite mode #2363

pbchekin commented Sep 26, 2024

chengjunlu commented Sep 27, 2024

anmyachev commented Sep 27, 2024

02-fused-softmax: PyTorch faster than Triton on Max 1550 in composite mode #2363

02-fused-softmax: PyTorch faster than Triton on Max 1550 in composite mode #2363

Comments

pbchekin commented Sep 26, 2024

chengjunlu commented Sep 27, 2024

anmyachev commented Sep 27, 2024