Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load GPT-J from HF #39

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Fireblossom
Copy link

load the GPT-J checkpoint newer version of transformers

@@ -89,6 +91,21 @@ def __init__(self, config, device=None):
**attn_config,
)

#check weights contiguous
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason why we need to check this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The weights for GPT-J's attention layers are not contiguous, which will raise Tensors must be contiguous in deepspeed.
I find a similar issue here

@CoEich
Copy link
Collaborator

CoEich commented Nov 28, 2022

Hi,

thx again for another PR. It would be nice to use the official HF version for MAGMA. However, last time we tried to implement this we noticed a slight difference in model outputs which we could not really get to the bottom of.

I'm very careful with these kinds of changes, so it would be great if you could compare the logits for some example inputs before/after your change.

Best,

Constantin

@Fireblossom
Copy link
Author

Fireblossom commented Nov 28, 2022

Hi,

thx again for another PR. It would be nice to use the official HF version for MAGMA. However, last time we tried to implement this we noticed a slight difference in model outputs which we could not really get to the bottom of.

I'm very careful with these kinds of changes, so it would be great if you could compare the logits for some example inputs before/after your change.

Best,

Constantin

Hi Constantin,

Thank you for reading my changes. Regarding the slight difference, I think it is hard for me to explain something from the example inputs/outputs right now.

However, a change in the structure of the model may be the cause.
In the old version HF, you used GPT-Neo to simulate GPT-J, GPTNeoMLP has no activation function.
in the new version HF, there is an activation function NewGELUActivation() added.

before:

(0): GPTNeoBlock(
      (ln_1): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
      (attn): GPTNeoAttention(
        (attention): GPTNeoSelfAttention(
          (attn_dropout): Dropout(p=0, inplace=False)
          (resid_dropout): Dropout(p=0, inplace=False)
          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (out_proj): Linear(in_features=4096, out_features=4096, bias=False)
        )
      )
      (mlp): Sequential(
        (0): GPTNeoMLP(
          (c_fc): Linear(in_features=4096, out_features=16384, bias=True)
          (c_proj): Linear(in_features=16384, out_features=4096, bias=True)
          (dropout): Dropout(p=0, inplace=False)
        )
        (1): Adapter(
          (adapter): Sequential(
            (0): Linear(in_features=4096, out_features=1024, bias=True)
            (1): ReLU()
            (2): Linear(in_features=1024, out_features=4096, bias=True)
          )
        )
      )
    )

at present (without adding adapters)
GPT-Neo:

(0): GPTNeoBlock(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPTNeoAttention(
          (attention): GPTNeoSelfAttention(
            (attn_dropout): Dropout(p=0.0, inplace=False)
            (resid_dropout): Dropout(p=0.0, inplace=False)
            (k_proj): Linear(in_features=768, out_features=768, bias=False)
            (v_proj): Linear(in_features=768, out_features=768, bias=False)
            (q_proj): Linear(in_features=768, out_features=768, bias=False)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPTNeoMLP(
          (c_fc): Linear(in_features=768, out_features=3072, bias=True)
          (c_proj): Linear(in_features=3072, out_features=768, bias=True)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.0, inplace=False)
        )
      )

GPT-J:

(0): GPTJBlock(
        (ln_1): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
        (attn): GPTJAttention(
          (attn_dropout): Dropout(p=0.0, inplace=False)
          (resid_dropout): Dropout(p=0.0, inplace=False)
          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (out_proj): Linear(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): GPTJMLP(
          (fc_in): Linear(in_features=4096, out_features=16384, bias=True)
          (fc_out): Linear(in_features=16384, out_features=4096, bias=True)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.0, inplace=False)
        )
      )

Hope this helps you!

Best,

Changxu

@CoEich
Copy link
Collaborator

CoEich commented Dec 6, 2022

Hmm this puzzles me a bit, but in any case, unless consistent behavior with the old version is ensured (e.g. by checking that all the hidden states are the same for a couple of example inputs) I will not merge these changes.

Let me know if you manage to do it and thx for the effort.

Best,

Constantin

@nikisalli
Copy link

Hello, did you manage to make this work?

@Fireblossom
Copy link
Author

Hello, did you manage to make this work?

Hi,
I can confirm that the modification itself is runnable and can be fine-tuned by the same method as the MAGMA paper.
As discussed above, the modified branch may have inconsistent output with the checkpoint provided by the repo, so this PR will not be merged.
But I don't have a lot of time to dive into this right now. Maybe I will do it later.

@nikisalli
Copy link

Hi, thank you for the fast answer!
do you have a working checkpoint? the default one has some dimensionality differences and I'd avoid to copy and paste the tensors by hand. Can you upload it somewhere?

@Fireblossom
Copy link
Author

Hi, thank you for the fast answer! do you have a working checkpoint? the default one has some dimensionality differences and I'd avoid to copy and paste the tensors by hand. Can you upload it somewhere?

The data I use for fine-tuning is in a completely different domain so I'm afraid my checkpoint can't meet your needs right now.

@nikisalli
Copy link

ah, ok, thank you anyway

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants