-
Notifications
You must be signed in to change notification settings - Fork 102
"▁" character can be separated when using BPE-dropout #67
Comments
second this. |
Hi If characters Could you describe in more detail why these symbols should be merged with higher priority? |
@xbelonogov I think @TIXFeniks refers to the special token '▁' that merges subwords, not the underscore '_'. |
Yes, I also meant this special token '▁'. (Edited the previous comment) |
@xbelonogov I think '▁' should not be a token on its own but should always be attached to other token to indicate that's a subword, no? |
It is not obvious to me. |
I'm not 100% clear about how BPE is implemented in YTTM but let's take subword-nmt as an example. In subword-nmt, the word separator character (usually space " ") is not considered as part of training data to learn BPE. They only look at pairs of symbols within a word. If a word is split into BPE subwords, they append a special token "@@" at the end of former subword. For example, "hello" --> "he@@ llo". When apply BPE-dropout, it could become something like "h@@ e@@ ll@@ o". The special token "@@" itself can never be a token on its own, but is always appended to other true tokens to indicate how to merge subwords together later. In this case, '▁' should behave similarly. If '▁' is a separate token, a NMT model could mistakenly learn to generate '▁' and even after merging subwords, there are a lot of spaces in between words. For example, my Slovak-English model generates this sentence: When I was 11 , I remember one mor ning w ak ing up the j oy ful sound s in my house . |
YTTM is very similar to subword-nmt.
Subword-nmt creates 2 tokens for each character from alphabet: the original one and with The way how word splitting works is equivalent.
Regarding your problem. Can you check how often this special token occurs alone in your train data? I just checked on some english dataset. This occurs on average once in 50 sentences. So I think this should not affect performance. Also you can explore YTTM model with the following command.
It's easy to see that all tokens like |
The solution could be an ability to disable dropout for such merges ( |
@xbelonogov, what do you think about my suggestion from the previous message? |
Hi, @TIXFeniks I asked Ivan Provilkov. But he isn't sure that this improve performance. If you have experiments that prove effectiveness of this, I will change default behaviour. |
When including BPE-dropout, word boundary character ('▁') can be separated from the first character of the word. This scenario is untested and could be harmful for the model training.
steps to reproduce:
(this also happens with dropout_prob < 1; I used 1 just for it to be reproducible)
Perhaps this behavior could be controlled by a flag to always merge '▁' and the next token?
The text was updated successfully, but these errors were encountered: