Skip to content

0.4.0

Compare
Choose a tag to compare
@kyegomez kyegomez released this 14 Jul 20:15
· 147 commits to master since this release

Changelog
Bug Fixes
Issue: ValueError: too many values to unpack (expected 3)

Root Cause: The attention function was returning more than three values, but the code was trying to unpack its return values into only three variables.
Resolution: Modified the line where the attention function is called to collect all additional return values into a list using the * operator.
Issue: RuntimeError: The size of tensor a (64) must match the size of tensor b (2) at non-singleton dimension 1

Root Cause: The code was trying to add two tensors of different sizes in the forward method of the DynamicDilatedAttention class.
Resolution: Modified the line where the tensors are added to ensure that attn_output has the same size as the corresponding slice of outputs before trying to add them.
Issue: ValueError: not enough values to unpack (expected 7, got 6)

Root Cause: The flash_attn function in the FlashAttention class was trying to unpack the shape of the q tensor into seven variables, but the q tensor only had six dimensions.
Resolution: Modified the forward method of the DilatedAttention class to reshape the x tensor correctly before passing it to the attention function.
Improvements
Improvement: Added assertions to check the types and values of the parameters in the init method of the DilatedAttention class to prevent incorrect usage.

Improvement: Added a check for the Distributed parameter in the init method of the DilatedAttention class to decide whether to use the DataParallel wrapper for the FlashAttention modules.

Improvement: Modified the forward method of the DilatedAttention class to process each segment of the input separately for each attention head, allowing the attention heads to share information between different segments.

Improvement: Modified the forward method of the DilatedAttention class to use a buffer to store the attn_output_resized tensor instead of creating a new tensor of zeros in every forward pass, improving efficiency.