Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

phi2 contrib ops changes #19112

Merged
merged 16 commits into from
Jan 22, 2024
Merged

phi2 contrib ops changes #19112

merged 16 commits into from
Jan 22, 2024

Conversation

wangyems
Copy link
Contributor

@wangyems wangyems commented Jan 12, 2024

Description

  1. support causal mask in MHA cpu
  2. support custom rotary_dim in rotary_emb
  3. add bf16 for rotary_emb
  4. fix a bug in attention rotary

Motivation and Context

Copy link
Member

@yufenglee yufenglee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@wangyems wangyems merged commit 21034a2 into main Jan 22, 2024
91 of 94 checks passed
@wangyems wangyems deleted the wangye/phi2_ops branch January 22, 2024 18:17
mszhanyi added a commit that referenced this pull request Jan 23, 2024
YUNQIUGUO pushed a commit that referenced this pull request Jan 23, 2024
### Description
<!-- Describe your changes. -->
1. support causal mask in MHA cpu
2. support custom rotary_dim in rotary_emb
3. add bf16 for rotary_emb
4. fix a bug in attention rotary


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
YUNQIUGUO pushed a commit that referenced this pull request Jan 30, 2024
### Description
This PR updates the Whisper export with beam search by adding the
following.

- Fixes a bug when running `DecoderMaskedMultiHeadAttention` in the
Whisper with beam search model
- Sets the default PyTorch attention implementation to `eager` to allow
existing attention fusions to continue working
- Re-uses the cache directory when loading the PyTorch model to reduce
memory used on disk
- Adds `--disable_auto_mixed_precision` to the example FP16 export
command

### Motivation and Context
- [This PR](#19112) added
the `is_unidirectional` parameter to `CheckInputs`, but it was not
provided when checking the inputs in `DecoderMaskedMultiHeadAttention`.
- [This PR](#19200)
explains the reasoning behind why `eager` is used to load the
`WhisperAttention` class.
- By re-using the cache directory for loading the PyTorch model, only
one copy of the PyTorch model is saved on disk instead of two copies.
- By providing this flag, there will be less Cast nodes in the Whisper
with beam search model to switch between FP16 and FP32 precision.
YUNQIUGUO pushed a commit that referenced this pull request Jan 30, 2024
### Description
This PR updates the Whisper export with beam search by adding the
following.

- Fixes a bug when running `DecoderMaskedMultiHeadAttention` in the
Whisper with beam search model
- Sets the default PyTorch attention implementation to `eager` to allow
existing attention fusions to continue working
- Re-uses the cache directory when loading the PyTorch model to reduce
memory used on disk
- Adds `--disable_auto_mixed_precision` to the example FP16 export
command

### Motivation and Context
- [This PR](#19112) added
the `is_unidirectional` parameter to `CheckInputs`, but it was not
provided when checking the inputs in `DecoderMaskedMultiHeadAttention`.
- [This PR](#19200)
explains the reasoning behind why `eager` is used to load the
`WhisperAttention` class.
- By re-using the cache directory for loading the PyTorch model, only
one copy of the PyTorch model is saved on disk instead of two copies.
- By providing this flag, there will be less Cast nodes in the Whisper
with beam search model to switch between FP16 and FP32 precision.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants