Granite support #1218

Datta0 · 2024-10-29T17:20:13Z

No description provided.

danielhanchen

Great work again! Just some comments :)

danielhanchen · 2024-11-26T08:37:33Z

unsloth/models/granite.py

+        Q = Q.transpose(1, 2)
+        K = K.transpose(1, 2)
+        V = V.transpose(1, 2)
+        sw = getattr(self.config, "sliding_window", None)


@Datta0 Is sliding window attention necessary for Granite?

Um, not necessary. Removing it.

danielhanchen · 2024-11-26T08:38:18Z

unsloth/models/granite.py

+    if past_key_value is not None:
+        kv_seq_len += past_key_value[0].shape[-2]
+
+    assert position_embeddings is not None


I remember you said we must pass in the position embeddings - did we calculate the cos and sine matrices in RoPE incorrectly?

Oh this is just a validation. We are calculating the sin, cos and passing from here.

danielhanchen · 2024-11-26T08:39:43Z

unsloth/models/granite.py

+pass
+
+
+def GraniteDecoderLayer_fast_forward(


Can we inherit from LlamaDecoderLayer_fast_forward? [Actually scratch that - I forgot Granite has a residual multiplier]

I'm assuming it's because of position_embeddings

danielhanchen · 2024-11-26T08:40:50Z

unsloth/models/granite.py

+            use_cache=use_cache,
+            padding_mask=padding_mask,
+            position_embeddings = position_embeddings,
+            _flag_for_generation=True,


I don't think flagging it for generation is a good idea - we dynamically have to set this

Oh this is inspired from gemma2. Should we set it to what we see in the config?

danielhanchen · 2024-11-26T08:41:16Z

unsloth/models/granite.py

+    Vn = self.paged_attention_V[:kv_seq_len].permute(1, 2, 0, 3)
+
+    # Handle sliding windows
+    sliding_window = self.config.sliding_window if hasattr(self.config, "sliding_window") else self.config.max_position_embeddings


Is SWA necessary in Granite?

danielhanchen · 2024-11-26T08:42:48Z

unsloth/models/granite.py

+            do_prefill = not hasattr(decoder_layer.self_attn, "paged_attention"),
+            position_embeddings = position_embeddings,
+        )
+        hidden_states = residual + hidden_states * self.config.residual_multiplier


Technically we could use addmm to fuse this entirely into 1 op

I resorted to using torch.add cuz we don't have any matmul here. Thanks for the suggestion :)

danielhanchen · 2024-11-26T08:43:39Z

unsloth/models/granite.py

+
+
+    @staticmethod
+    def post_patch(model):


We can ignore this I think (if it's a copy from Llama) - it should auto inherit it (I think)

Correct me if I'm wrong but Wouldn't tie word embeddings mandate handling this separately?

danielhanchen · 2024-11-26T08:44:09Z

unsloth/models/llama.py

@@ -617,6 +617,7 @@ def LlamaModel_fast_forward(
    IS_GEMMA  = self.config.model_type.startswith("gemma")
    IS_GEMMA2 = self.config.model_type.startswith("gemma2")
    IS_COHERE = self.config.model_type.startswith("cohere")
+    IS_GRANITE = self.config.model_type.startswith("granite")


Fix up spacing to make all the equal signs spaced evenly :)

danielhanchen · 2024-11-26T08:44:21Z

unsloth/models/llama.py

@@ -763,6 +766,12 @@ def LlamaModel_fast_forward(
        pass
    pass

+
+    if IS_GRANITE:


So this is a must must?

Yeah iirc, Granit's forward calculates it here and passes on and without this it throws error (I don't exactly remember the error unfortunately)

danielhanchen · 2024-11-26T08:45:12Z

unsloth/models/llama.py

@@ -974,6 +986,9 @@ def _CausalLM_fast_forward(
        loss = None
        logit_softcapping = getattr(self.config, "final_logit_softcapping", 0)
        logit_scaling     = getattr(self.config, "logit_scale", 0)
+        if self.config.model_type == "granite":
+            # granite uses logit_scaling as key and they divide by the scale unlike cohere
+            logit_scaling = 1 / getattr(self.config, "logits_scaling", 1)


Oh interesting - can you confirm it's not Cohere type logit scaling thanks :)

granite uses logit_scaling as key and they divide by the scale unlike cohere
notice that for granite, logits_scale is 16 and for cohere it is 0.125 (aka 1/8) in their respective configs

granite

cohere

Datta0 added 3 commits October 28, 2024 18:57

[WIP] Support for Granite

17f6292

Fixup inference

1713728

Cleanup flex attention

afb3f01

Datta0 marked this pull request as ready for review October 31, 2024 14:15

danielhanchen requested changes Nov 26, 2024

View reviewed changes

Datta0 added 4 commits November 26, 2024 12:02

Merge remote-tracking branch 'origin/nightly' into granite_support

dab436b

remove sliding window

7b777f6

Use torch.add for residual multiplier

fdd1d95

Merge remote-tracking branch 'origin/nightly' into granite_support

2543daf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Granite support #1218

Granite support #1218

Datta0 commented Oct 29, 2024

danielhanchen left a comment

danielhanchen Nov 26, 2024

Datta0 Nov 26, 2024

danielhanchen Nov 26, 2024

Datta0 Nov 26, 2024

danielhanchen Nov 26, 2024

Datta0 Nov 26, 2024

danielhanchen Nov 26, 2024

Datta0 Nov 26, 2024

danielhanchen Nov 26, 2024

danielhanchen Nov 26, 2024

Datta0 Nov 26, 2024

danielhanchen Nov 26, 2024

Datta0 Nov 26, 2024

danielhanchen Nov 26, 2024

danielhanchen Nov 26, 2024

Datta0 Nov 26, 2024

danielhanchen Nov 26, 2024

Datta0 Nov 26, 2024

Granite support #1218

Are you sure you want to change the base?

Granite support #1218

Conversation

Datta0 commented Oct 29, 2024

danielhanchen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment