[llama] Generate causal mask better #832

Groverkss · 2025-01-16T11:47:57Z

Previously the causal mask was being generated as::

causal_mask = triu(context_len, context_len)
causal_mask = causal_mask[:batch_seqlen, :batch_seqlen]

This is not a good thing to do. This is simply making it harder for the compiler to fuse this "fill like" computation to a dispatch. The slicing with a dynamic dimension is harder to move around.

Instead, it's a better idea to generate the attention mask as:

causal_mask = triu(batch_seqlen, batch_seqlen)

This PR also removes the ability to put the mask in a buffer. This is again a really bad idea. The causal mask computation should never actually materialize outside a dispatch. If the compiler fails to do this, we should see memory usage spikes and fix the compiler.

Groverkss · 2025-01-16T16:21:40Z

Turning into draft since I want to see if we can make the compiler propagate the slice.

[llama] Generate causal mask better

911e157

Groverkss requested review from rsuderman, archana-ramalingam and stellaraccident and removed request for rsuderman and archana-ramalingam January 16, 2025 11:51

Groverkss marked this pull request as draft January 16, 2025 16:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[llama] Generate causal mask better #832

[llama] Generate causal mask better #832

Groverkss commented Jan 16, 2025 •

edited

Loading

Groverkss commented Jan 16, 2025

[llama] Generate causal mask better #832

Are you sure you want to change the base?

[llama] Generate causal mask better #832

Conversation

Groverkss commented Jan 16, 2025 • edited Loading

Groverkss commented Jan 16, 2025

Groverkss commented Jan 16, 2025 •

edited

Loading