Update LLaMA attention fusions #19200

kunal-vaishnavi · 2024-01-19T03:39:44Z

Description

This PR updates the LLaMA-2 attention fusions by adding the following.

Loading the PyTorch model from Hugging Face with the LlamaAttention class before exporting
Updating the attention mask pattern matching to support another case

This PR also fixes this issue.

Motivation and Context

Recent changes to Hugging Face's transformers library break the existing pattern matching. Since the attention fusions aim to change the graph from LayerNorm Op --> Set of Attention Nodes --> LayerNorm Op to LayerNorm Op --> Attention Op --> LayerNorm Op per layer, ultimately it does not matter what nodes comprise the Set of Attention Nodes because they will all be removed and replaced by the Attention Op in the end.

Therefore, it does not matter whether the LlamaAttention class or a different attention class is used to load the PyTorch model before exporting because the expected graphs after the attention fusions will look identical no matter the attention class chosen. By loading the PyTorch model with the LlamaAttention class instead of other attention classes (e.g. LlamaFlashAttention2 or LlamaSdpaAttention) and then exporting it to ONNX, the existing pattern matching will continue to work.

### Description This PR updates the LLaMA-2 attention fusions by adding the following. - Loading the PyTorch model from Hugging Face with the `LlamaAttention` class before exporting - Updating the attention mask pattern matching to support another case This PR also fixes [this issue](#19040). ### Motivation and Context Recent changes to Hugging Face's `transformers` library break the existing pattern matching. Since the attention fusions aim to change the graph from `LayerNorm Op --> Set of Attention Nodes --> LayerNorm Op` to `LayerNorm Op --> Attention Op --> LayerNorm Op` per layer, ultimately it does not matter what nodes comprise the `Set of Attention Nodes` because they will all be removed and replaced by the `Attention Op` in the end. Therefore, it does not matter whether the `LlamaAttention` class or a different attention class is used to load the PyTorch model before exporting because the expected graphs after the attention fusions will look identical no matter the attention class chosen. By loading the PyTorch model with the `LlamaAttention` class instead of other attention classes (e.g. `LlamaFlashAttention2` or `LlamaSdpaAttention`) and then exporting it to ONNX, the existing pattern matching will continue to work.

### Description This PR updates the Whisper export with beam search by adding the following. - Fixes a bug when running `DecoderMaskedMultiHeadAttention` in the Whisper with beam search model - Sets the default PyTorch attention implementation to `eager` to allow existing attention fusions to continue working - Re-uses the cache directory when loading the PyTorch model to reduce memory used on disk - Adds `--disable_auto_mixed_precision` to the example FP16 export command ### Motivation and Context - [This PR](#19112) added the `is_unidirectional` parameter to `CheckInputs`, but it was not provided when checking the inputs in `DecoderMaskedMultiHeadAttention`. - [This PR](#19200) explains the reasoning behind why `eager` is used to load the `WhisperAttention` class. - By re-using the cache directory for loading the PyTorch model, only one copy of the PyTorch model is saved on disk instead of two copies. - By providing this flag, there will be less Cast nodes in the Whisper with beam search model to switch between FP16 and FP32 precision.

kunal-vaishnavi added 2 commits January 19, 2024 00:20

Fix attention mask pattern matching

e44f347

Add symbolic shape inference after optimization

f589fdf

kunal-vaishnavi added the release:1.17.0 label Jan 19, 2024

RyanUnderhill previously approved these changes Jan 19, 2024

View reviewed changes

Update prerequisites

12890e5

kunal-vaishnavi dismissed RyanUnderhill’s stale review via 12890e5 January 19, 2024 07:04

RyanUnderhill approved these changes Jan 19, 2024

View reviewed changes

kunal-vaishnavi merged commit a3ecb63 into microsoft:main Jan 19, 2024
81 of 88 checks passed

kunal-vaishnavi mentioned this pull request Jan 19, 2024

[Documentation] Both new LLama-7B examples are now broken #19040

Closed

This was referenced Jan 26, 2024

llama_v2_7b_16h stopped working with torch.jit.trace pytorch/pytorch#117752

Closed

Update Whisper export with beam search #19322

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update LLaMA attention fusions #19200

Update LLaMA attention fusions #19200

kunal-vaishnavi commented Jan 19, 2024

Update LLaMA attention fusions #19200

Update LLaMA attention fusions #19200

Conversation

kunal-vaishnavi commented Jan 19, 2024

Description

Motivation and Context