Skip to content

Latest commit

 

History

History
229 lines (188 loc) · 13.2 KB

CHANGELOG.md

File metadata and controls

229 lines (188 loc) · 13.2 KB

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[0.0.24] - TBD

[0.0.23] - 2023-12-05

Pre-built binary wheels require PyTorch 2.1.1

Fixed

  • fMHA: Fixed a bug in cutlass backend forward pass where the logsumexp was not correctly calculated, resulting in wrong results in the BW pass. This would happen with MQA when one sequence has a query with length%64 == 1
  • fMHA: Updated Flash-Attention to v2.3.6 - this fixes a performance regression in causal backward passes, and now supports BlockDiagonalCausalWithOffsetPaddedKeysMask

Added

  • fMHA: Added LocalAttentionFromBottomRightMask (local)
  • fMHA: Added LowerTriangularFromBottomRightMask (causal)
  • fMHA: Added LowerTriangularFromBottomRightLocalAttentionMask (local + causal)

Removed

  • Removed xformers.triton.sum_strided

[0.0.22] - 2023-09-27

Fixed

  • fMHA: Backward pass now works in PyTorch deterministic mode (although slower)

Added

  • fMHA: Added experimental support for Multi-Query Attention and Grouped-Query Attention. This is handled by passing 5-dimensional inputs to memory_efficient_attention, see the documentation for more details
  • fMHA: Added experimental support for Local Attention biases to memory_efficient_attention
  • Added an example of efficient LLaMa decoding using xformers operators
  • Added Flash-Decoding for faster attention during Large Language Model (LLM) decoding - up to 50x faster for long sequences (token decoding up to 8x faster end-to-end)
  • Added an efficient rope implementation in triton, to be used in LLM decoding
  • Added selective activation checkpointing, which gives fine-grained control of which activations to keep and which activations to recompute
  • xformers.info now indicates the Flash-Attention version used

Removed

  • fMHA: Removed smallK backend support for CPU. memory_efficient_attention only works for CUDA/GPU tensors now
  • DEPRECATION: Many classes in xformers.factory, xformers.triton and xformers.components have been or will be deprecated soon (see tracking issue facebookresearch/xformers#848)

[0.0.21] - 2023-08-18

Improved

  • fMHA: Updated flash-attention to v2, with massive performance improvements for both the forward pass and backward pass. This implementation is now used by default when it's available

Bug fixes

  • fMHA/cutlass: Fix potential race condition in the FW/BW passes
  • fMHA/cutlass: Fix attn_bias stride overflow for very long sequences (>32k)
  • LowerTriangularMask is now backward compatible with older xformers versions

Breaking changes

  • memory_efficient_attention now expects the attn_bias argument to have a head dimension
  • memory_efficient_attention no longer broadcasts the batch/head dimensions of attn_bias. Please use .expand if you need to broadcast the bias
  • Remove causal_diagonal argument from BlockDiagonalCausalWithOffsetPaddedKeysMask

Added

  • Binary wheels on pypi/conda now contain H100 kernels
  • fMHA: Added backend specialized for decoding that does not use TensorCores - useful when not using multiquery

NOTE: Binary wheels are now provided only for PyTorch 2 with cuda 11.8. It is still possible to use xFormers with older versions of PyTorch by building from source or using conda.

[0.0.20] - 2023-05-23

Improved

  • fMHA/cutlass (backward): Massive performance improvements when batch_size * num_heads is low (10x+)
  • fMHA/cutlass: Further performance improvements for both the forward & backward kernels
  • fMHA (backward): Now dispatching to cutlass when embed_dim>64
  • fMHA: Updated Flash-Attention to v1.0.5

Added

  • fMHA now runs on H100 (support is experimental)

[0.0.19] - 2023-04-28

Added

  • Display nvcc version used to compile xformers in python -m xformers.info

Fixed

[0.0.18] - 2023-03-31

Added

  • Added xformers.ops.index_select_cat and xformers.ops.scaled_index_add - those are experimental functions that only work with a few shapes, and can be used to write efficient stochastic depth in transformer architectures for instance

Fixed

  • fMHA: memory_efficient_attention now accepts torch.Tensor as attention bias for any seqlen, although there are still requirements on the alignment of the bias tensor (see facebookresearch/xformers#683)

[0.0.17] - 2023-03-28

Fixed

Added

[0.0.16] - 2023-01-31

Fixed

Added

[0.0.15] - Skipped

[0.0.14] - 2022-11-10

Fixed

Added

[0.0.13] - 2022-09-26

Added

  • fMHA: Added CUTLASS-based kernel for xformers.ops.memory_efficient_attention. This kernel is automatically depending on the inputs, and works on any GPU after P100 [facebookresearch/xformers#362]

[0.0.12] - 2022-08-08

Fixed

Added

[0.0.11] - 2022-05-30

Fixed

Added

[0.0.10] - 2022-03-14

Fixed

Added

[0.0.9] - 2022-02-09

Added

Fixed

[0.0.8] - 2022-01-07

Fixed

Added

[0.0.7] - 2021-11-30

Fixed

[0.0.6] - 2021-11-24

Fixed

Added

[0.0.4] - 2021-11-16

Fixed

Added

[0.0.3] - 2021-11-01

Fixed

[0.0.2] - 2021-11-01

Fixed

Added