Use tilized operators for Mistral AI speedup on inference #3812

muthutt · 2023-11-15T21:56:36Z

use tilized operators for Mistral AI migrate from row-major
tilized tensors are faster on GS/WH TT Tensix cores
use the tt_lib.tensor.tilize_with_val_padding when necessary

Reference: https://github.com/tenstorrent-metal/tt-metal/tree/a3740def58c3b8672b7e3279261506ae70b97810/models/demos/resnet/tt
/metalResnetBlock50.py

Sudharsan-V · 2023-11-28T13:06:20Z

The tensors in the mistral model are converted to the TILE layout to use the tilized operators and the commit is updated for the same.
Corresponding PR: #4029

boris-drazic · 2023-11-28T20:03:54Z

Further performance improvements for tiles...

Our goal is to only have tensors in tile layout from input to output.
There are some OPs in the model that will not work with tilized tensors, but they seem to be confined to rotary embedding and cache updates.
Lets have tiles in every other place.

Starting in models/experimental/mistral/tt/mistral_transformer.py when we loop over layers

h = torch_to_tt_tensor_rm(h, self.device, put_on_device=False)
for layer in self.layers:
      h = layer(h, freqs_cis, positions, mask)

Put h in tile layout here instead of row major (dimension with sequence length will need to be padded to 32) and then h will go in layer as tile and come out as tile.

Then in models/experimental/mistral/tt/mistral_transformer_block.py x is tilized and padded when sent to RMSNorm

r = self.attention.forward(self.attention_norm(x), freqs_cis, positions, mask)

Update how RMSNorm works so that is just applies tt_lib.tensor.rmsnorm to input tile tensor. Since we have full rows of data followed by full rows of padding RMSNorm should work fine.
All rows with data have no padding in them, so for each row we will divided with correct number inside tt_lib.tensor.rmsnorm and get correct results.

Next up is models/experimental/mistral/tt/mistral_attention.py
First thing here is Linears that shuld be fine with zero padding

xq, xk, xv = self.wq(x), self.wk(x), self.wv(x)

then lets skip all padding and reshaping until rotary embedding.
At this point convert to torch and in torch slice out padding so that we have xq shape [batch, sequence, 32, 128] and xk shape [batch, sequence, 8, 128].

Then after key and value are computed with key, value = self.repeat_kv
we have shape [batch, sequence, 32, 128] for query, key, and value.
Can we just pad second dim from sequence to 32 with zeroes and convert to tile layout and then do transpose and computation of score all in tile with no more conversions?

Sudharsan-V · 2023-11-30T13:37:44Z

The commit has been revised based on the provided suggestions.
In models/experimental/mistral/tt/mistral_transformer.py, the input 'h' has been transformed into a TILE layout, ensuring that it enters and exits the loop in the TILE layout.

In models/experimental/mistral/tt/mistral_transformer_block.py, the performance of the rms_norm is good when the input is in TILE Layout.

In models/experimental/mistral/tt/mistral_attention.py, I have removed the padding entirely up to the rotary embedding. However, I couldn't eliminate reshape completely. The linear operation produces an output xq with shape[1, 1, 32, 4096], xk with shape [1, 1, 32, 1024] and xv with shape [1, 1, 32, 1024] which needs to be transformed into [1, 11, 32, 128] (xq), [1, 11, 8, 128] (xk) and [1, 11, 8, 128] (xv) for feeding into the rotary embedding. Therefore, after the linear operation, I converted it to a torch tensor, sliced the tensor, and reshaped it to achieve the desired shape.

Corresponding PR: #4029

Sudharsan-V · 2023-12-07T10:18:17Z

The commit for the mistral model is updated by optimizing the rotary_embed method.
Previously, the conversion of freqs_cis (PyTorch complex tensor) to tt_lib complex tensor was happening
#transformer_block + ( max_tokens * #transformer_block) times.

Prefill stage: #transformer_block
Decode stage: max_tokens * #transformer_block
Total: #transformer_block + ( max_tokens * #transformer_block)

In the optimized version of the mistral model, the conversion of freqs_cis (PyTorch complex tensor) to tt_lib complex tensor will happen 1 + max_tokens times.

Prefill stage:  1
Decode stage: max_tokens 
Total: 1 + max_tokens

Note: #transformer_block = 32(Number of transformer_block)

Corresponding PR: #4029

muthutt · 2023-12-07T16:55:58Z

def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0) -> torch.Tensor:
    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
    t = torch.arange(end, device=freqs.device)  # type: ignore
    freqs = torch.outer(t, freqs).float()  # type: ignore
    return torch.polar(torch.ones_like(freqs), freqs)  # complex64

@vigneshkeerthivasanx you can also attempt to move this code to on device since we have tt_lib.tensor.arange,
tt_lib.tensor, and implement torch.polar as shown below,

def tt_lib_polar(abs,angle):
      s = tt_lib.tensor.sin(angle)
      c = tt_lib.tensor.cos(angle)
      r = tt_lib.tensor.mul(abs,c)
      i = tt_lib.tensor.mul(abs,s)
      return tt_lib.tensor.complex_tensor(r,i)

this should make the tensors on device and hold them there saving some time for moving the data back/forth

Sudharsan-V · 2023-12-11T15:59:59Z

@muthutt , the commit is updated by incorporating the above-mentioned comments.

After modifying the method precompute_freqs_cis, there is a very slight degradation in the pcc.
Previously the pcc was 0.9928005641025528
After modifying the method precompute_freqs_cis, the pcc is 0.9921967026489718.

Even though the variation in the pcc looks insignificant, this modification has greatly affected the result of the gs-demo.

Input Prompt: ['A man is sitting on a roof ']
Output Prompt before modifying the `precompute_freqs_cis`: 'A man is sitting on a roof 100 meters above the ground.\n\nA man is sitting on'
Output Prompt after modification: 'A man is sitting on a roof 100 feet of a a.10.10.1'

Note: I was not able to use tt_lib's power directly, so I have used the following logic:
pow(exponent, base) = exp(exponent * log(base))

I have used tt_lib.tensor.exp, tt_lib.tensor.mul ad tt_lib.tensor.log accordigly

Corresponding PR: #4029

muthutt · 2023-12-11T22:33:06Z

lets keep changes that preserve a sensible output thanks for trying ou the proposal

…

On Mon, Dec 11, 2023 at 8:00 AM Sudharsan ***@***.***> wrote: @muthutt <https://github.com/muthutt> , the commit is updated by incorporating the above-mentioned comments. After modifying the method precompute_freqs_cis, there is a very slight degradation in the pcc. Previously the pcc was 0.9928005641025528 After modifying the method precompute_freqs_cis, the pcc is 0.9921967026489718. Even though the variation in the pcc looks insignificant, this modification has greatly affected the result of the gs-demo. Input Prompt: ['A man is sitting on a roof '] Output Prompt before modifying the `precompute_freqs_cis`: 'A man is sitting on a roof 100 meters above the ground.\n\nA man is sitting on' Output Prompt after modification: 'A man is sitting on a roof 100 feet of a a.10.10.1' Note: I was not able to use tt_lib's power directly, so I have used the following logic: pow(exponent, base) = exp(exponent * log(base)) I have used tt_lib.tensor.exp, tt_lib.tensor.mul ad tt_lib.tensor.log accordigly Corresponding PR: #4029 <#4029> — Reply to this email directly, view it on GitHub <#3812 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BAGOCNFILMAO5Z2LGEXU67TYI4UYXAVCNFSM6AAAAAA7NFMUVWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJQGM3TEMBWG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Sudharsan-V · 2023-12-12T09:35:15Z

The commit is reverted to preserve the sensible output.
Corresponding PR: #4029
Issue for TT-Scatter op: #4294

muthutt added the mistral Mistral AI bringup label Nov 15, 2023

muthutt assigned boris-drazic and saichandax Nov 15, 2023

muthutt changed the title ~~use tilized operators for Mistral AI migrate from row-major~~ use tilized operators for Mistral AI speedup on inference Nov 15, 2023

boris-drazic added the models Models that run in tt-metal label Nov 16, 2023

saichandax assigned Sudharsan-V Nov 16, 2023

vigneshkeerthivasanx added a commit that referenced this issue Nov 23, 2023

#3812: WIP on Tilize operators for mistral model

cae169b

vigneshkeerthivasanx added a commit that referenced this issue Nov 24, 2023

#3812: Use tilize operators for mistral model

acc4292

vigneshkeerthivasanx added a commit that referenced this issue Nov 24, 2023

#3812: Use tilize operators for mistral model

6fbd4e1

saichandax changed the title ~~use tilized operators for Mistral AI speedup on inference~~ Use tilized operators for Mistral AI speedup on inference Nov 28, 2023

vigneshkeerthivasanx added a commit that referenced this issue Nov 28, 2023

#3812: Use tilize operators for mistral model

b6db38c

vigneshkeerthivasanx added a commit that referenced this issue Nov 29, 2023

#3812: Use tilize operators for mistral model

0b52e2d

vigneshkeerthivasanx added a commit that referenced this issue Nov 30, 2023

#3812: Use tilize operators for mistral model

38d3b1c

vigneshkeerthivasanx added a commit that referenced this issue Dec 7, 2023

#3812: Use tilize operators for mistral model

814aa39

vigneshkeerthivasanx added a commit that referenced this issue Dec 7, 2023

#3812: Use tilize operators for mistral model

59d02dc

muthutt mentioned this issue Dec 7, 2023

polar operator for Mistral #4231

Closed

vigneshkeerthivasanx added a commit that referenced this issue Dec 11, 2023

#3812: Use tilize operators for mistral model

0edc84d

vigneshkeerthivasanx added a commit that referenced this issue Dec 12, 2023

#3812: Use tilize operators for mistral model

4f22c5b

vigneshkeerthivasanx added a commit that referenced this issue Dec 13, 2023

#3812: Use tilize operators for mistral model

5d9cee5

vigneshkeerthivasanx added a commit that referenced this issue Dec 14, 2023

#3812: Use tilize operators for mistral model

0e5a1db

vigneshkeerthivasanx added a commit that referenced this issue Dec 15, 2023

#3812: Use tilize operators for mistral model

92b3236

vigneshkeerthivasanx added a commit that referenced this issue Dec 26, 2023

#3812: Use tilize operators for mistral model

ce152f7

Sudharsan-V pushed a commit that referenced this issue Dec 29, 2023

#3812: Use tilize operators for mistral model

da99462

Sudharsan-V pushed a commit that referenced this issue Jan 2, 2024

#3812: Use tilize operators for mistral model

494ad53

vigneshkeerthivasanx added a commit that referenced this issue Jan 2, 2024

#3812: Use tilize operators for mistral model

b3f4f4b

saichandax closed this as completed Jan 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use tilized operators for Mistral AI speedup on inference #3812

Use tilized operators for Mistral AI speedup on inference #3812

muthutt commented Nov 15, 2023 •

edited

Loading

Sudharsan-V commented Nov 28, 2023

boris-drazic commented Nov 28, 2023

Sudharsan-V commented Nov 30, 2023 •

edited

Loading

Sudharsan-V commented Dec 7, 2023 •

edited

Loading

muthutt commented Dec 7, 2023

Sudharsan-V commented Dec 11, 2023

muthutt commented Dec 11, 2023 via email

Sudharsan-V commented Dec 12, 2023 •

edited

Loading

Use tilized operators for Mistral AI speedup on inference #3812

Use tilized operators for Mistral AI speedup on inference #3812

Comments

muthutt commented Nov 15, 2023 • edited Loading

Sudharsan-V commented Nov 28, 2023

boris-drazic commented Nov 28, 2023

Sudharsan-V commented Nov 30, 2023 • edited Loading

Sudharsan-V commented Dec 7, 2023 • edited Loading

muthutt commented Dec 7, 2023

Sudharsan-V commented Dec 11, 2023

muthutt commented Dec 11, 2023 via email

Sudharsan-V commented Dec 12, 2023 • edited Loading

muthutt commented Nov 15, 2023 •

edited

Loading

Sudharsan-V commented Nov 30, 2023 •

edited

Loading

Sudharsan-V commented Dec 7, 2023 •

edited

Loading

Sudharsan-V commented Dec 12, 2023 •

edited

Loading