-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the code of Differential Transformers training #1663
Comments
Hi @mucunxie , the basic training code for DIFF is similar to the code provided at https://aka.ms/yoco , you can make a few changes and merge DIFF code into it. You can also use other open-source code frameworks and plug DIFF into it by changing a few lines. |
Hello, @YTianZHU , I tried implementing a diff transformer using huggingface. But I can't reproduce your experimental results. In the experiment, the ordinary transformer performed better than the diff transformer. I trained about 10 B tokens of data. And adding RMSNorm in my experiment doesn't seem to have much benefit. I see that Kaiming initialization was used in the warehouse history, and modeling on Huggingface is mostly initialized with normal_. I'm not sure if it's caused by initialization. May I ask if there is anything I have overlooked that has resulted in poor performance? Here is the code I have modified.
|
@Adamyangs Hi, seems you use half head dimension for Diff. We suggest using same head dimension but half number of heads. For e.g., a Transformer has 16 heads and head dimension of 128, than corresponding Diff Transformer has 8 heads and 128 headdim for qk and 256 headdim for v. |
Thank you for your response. My setting seems like a fair method of comparison. If the Diff model only surpasses a regular transformer under the condition of having the same number of heads, could it be that this hyperparameter setting is not particularly well-suited for the regular transformer, which may be more favorable for the Diff Transformer? I would be interested to hear your thoughts on this issue. Additionally, I have a question regarding the learning of lambda. You’ve designed a complex lambda update rule, but it appears that updating lambda doesn’t significantly affect the loss. Have you conducted any ablation experiments to investigate this? And, I found an issue when implementing the settings you mentioned. When using gqa, if the number of kv heads remains the same, the diff transformer will have a larger cache. Therefore, should the number of kv heads be halved in the experiment? |
@Adamyangs, Hi,
|
Hi, @YTianZHU |
Model I am using (UniLM, MiniLM, LayoutLM ...):Differential Transformers
Thank you for your works! Can you provide you code of Differential Transformers training
The text was updated successfully, but these errors were encountered: