Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is another attempt to improve performance for large contexts, see #25
Basically, when we want to process a very long context, the KQ mask, which is stored as
f32
(orf16
, if using flash attention), becomes quite significant in size. If running on the GPU, the cost for copying the KQ mask to the GPU (the mask is created on the host CPU) becomes non-negligible. If running on a CPU that has limited memory bandwidth (basically allx86
orx86_64
), the KQ mask may not fit in the cache, or if it does fit it reduces the cache available for other data by a significant amount, which results in a measurable impact on the performance of theSOFT_MAX
(or the new fusedSOFT_CAP_MAX
) operation. Hence, it will be desirable to reduce the size of the KQ mask.If not using ALiBi (basically almost always these days), the KQ mask stored 2 values:
0, -INFINITY
. It can therefore be represented as a binary mask, thus reducing its size by a factor of 32.This PR adds an option to use a binary KQ mask. It is off by default as not all platforms are implemented, but can be turned on using
-bkq
or--binary-kq
on the command line. This will have no effect if flash attention is used (KQ mask remainsf16
as before). If turned on but not supported by the back-end (non-AVX512
CPUs), the program will assert and terminate.I see 3-5% performance gains on CUDA and a Ryzen-7950X CPU for a context of 32k tokens, and about 2-3% on Metal for a context of 16k. So, nothing earth-shattering. and hence not quite convinced to merge it.