[DML EP] Support partial rotary embedding (#22417) · axinging/onnxruntime@f610605

Commit

[DML EP] Support partial rotary embedding (microsoft#22417)

### Description
This adds support for partial RotaryEmbedding to DML. Essentially,
partial RotaryEmbedding simply consists of doing the rotary embedding
calculation on a subregion of the input tensor of as if its head size
was `rotary_embedding_dim`, while leaving the second part of the tensor
(i.e. `head_size - rotary_embedding_dim`) alone.

To achieve this, all we need to do is follow the following steps:

1. Split the tensor into 2 parts
2. Run the rotary embedding algorithm on the first part, just like we
were doing before on the entire tensor
3. Join the 2 parts back together

Since we're leaving the middle part intact, the RotaryEmbedding fusion
will still be done within DML. Also, the concat at the end is
essentially free because DML optimizes it out and directly allocate the
result of RotaryEmbedding at the right place. The only overhead here is
the splitting of the tensor at the beginning, which we should eventually
make part of the RotaryEmbedding fusion within DML.



### Motivation and Context
This fix allows us to correctly run models that have a
`partial_rotary_factor` setting in huggingface, including Nvidia's
Nemotron: https://huggingface.co/nvidia/Nemotron-Mini-4B-Instruct

Loading branch information

PatriceVignola authored Oct 16, 2024

1 parent a164228 commit f610605

0 comments on commit `f610605`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `f610605`

Commit

There are no files selected for viewing

0 comments on commit f610605

0 comments on commit `f610605`