-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error when converting NMT model with ALiBi or RoPe #1657
Comments
Can you provide more in detail how to run the converter? |
ct2-opennmt-py-converter --model_path test_step_100.pt --output_dir test |
I faced the same problem. Unfortunately python/ctranslate2/converters/opennmt_py.py currently only supports ALiBi or RoPe for decoder_type == "transformer_lm" (LLMs) and does not support for seq2seq. Also, unfortunately, there is no support for activating gated-gelu (it is added in literally two lines, but ALiBi or RoPe is much more difficult to add in my opinion). In my experiments, seq2seq with a combination of RoPe and gated-gelu converges much faster when trained via OpenNMT-py. And the learning rate (tok/sec) is also 10% higher than that of relative positional embeddings. If someone can add python/ctranslate2/converters/opennmt_py.py for seq2seq ALiBi, RoPe and gated-gelu to the converter, then this will make it possible to speed up the inference of much higher quality advanced translator models. |
It should not be that difficult to add rope / alibi to the encoder. But do you have numbers to support the benefit? If you use "silu" it uses gated ffn, it should be suported out of the box, no? I'll add them |
Here are more specific numbers and observations:
Here is a description of the models on screenshots:
Based on the above experiments, I think the optimal solution would be the combination of RoPE + gated-gelu, in terms of convergence speed and learning and operating speed (tok/s). Also, as far as I understand, RoPE embeddings and GLU activation options are now actually a standard and are used in the most modern models. Edited: fix mistakes, and I’ll add, I also tried silu activation, but it turned out worse than gelu or relu, I assume there is also a beta parameter that needs to be configured and silu in this case is more suitable for converting already ordinary LLMs with silu type activation (SwiGLU). |
there is no beta for silu, there is rope theta but should not impact the perf in nmt since it is length related. |
I apologize here, I got a little mixed up. Beta meant the coefficient of the Swish function. When beta = 1 it will be equal to the Silu function.
Everything is correct here, GLU increases the number of parameters of the FF layer by 50%. I took this into account and reduced the size of the FF layer by 50%. I relied on this paper: GLU Variants Improve Transformer |
can you test to convert your onmt-py model with #1687 (the rope one with gated-gelu) and tell me if it works ? |
so did you try "silu" of onmt-py repo (hence gated silu) or basic silu ? |
I tried to convert the micro model, but got an error, maybe I need to add gated-gelu to SUPPORTED_ACTIVATIONS:
Now a new error has appeared:
|
I try "silu" of onmt-py repo (hence gated silu) |
this one does not make sense. check your /mnt/DeepLearning/Locomotive/venv/lib/python3.11/site-packages/ctranslate2/specs/attention_spec.py file |
I think I got it, I just added two files from the PR to my venv, but my version of Ctranslate is pretty far behind:
I updated Ctranslate and the conversion went without errors. However, now I cannot yet confirm the functionality of the full-fledged converted model, since I only tested it on a “100-step toy model.” I removed the weights of the previous model with RoPE + gated-gelu and put the model with RPE + gated-gelu (purple in the graphs above) for long-term training, since I did not think that this function would be implemented so quickly, thank you! |
I trained a Gelu and Gated-Gelu "base transformer" and I am not getting the same training curve as yours. As for validation there is a small improvement. Here are some logs: Gate-Gelu (with Rope) ffn 512-2048
And the non gated Gelu (512-3072)
|
Yes, I think you're right! The models on which I saw such a difference differ quite significantly from the standard Transformer base.
For myself, I discovered that deepening the model while simultaneously increasing the FF layer improves the translation (there are many preprints on arxive confirming the guess). For the Transformer base, the maximum I was able to get was about 28 BLEU for the EN_RU pair at 150,000 steps, but the new deep model with a wide FF layer with gated gelu already has about 31.5 BLEU at 18,000 steps. A few explanations about the BLEU chart: Models Mar-12 and Apr-17 - 330M parameters each, the rest - 457M parameters. |
Hello, thank you for a great project!
I am getting this error when using ALiBi or RoPe positional encoding in a tranformer NMT model from OpenNMT-py:
KeyError: 'encoder.embeddings.make_embedding.pe.pe'
Absolute and relative positional encodings are working with the converter script.
The text was updated successfully, but these errors were encountered: