You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi everyone, thanks for all of your amazing work on Marian :).
I open this discussion (and hope it's in the right place) because I'm a little confuse with the alignment functionnality of Marian transformer.
So far, my understanding is that if you want to have a transformer model that does alignment, you need to preprocess alignment on your training corpus and feed it (with the --guided-alignment parameter).
However, because the transformer need to be fed sub tokens (preprocessed by sentencepiece or other tokenizer), I am assuming that the preprocessed alignment NEED to be base on the sub tokens.
Therefore, the pipeline for training would also NEED to be the following :
Train a sentencepiece model on your training source and target corpus
Pre-tokenize these two corpus
Get subtokens alignment (with fast_align for exemple)
Create the vocab.yml from the spm.vocab (that are produce when training the spm models) for both source and target
Finally, train a model with using :
pretokenized corpus
pretrained vocab.yml for both target and source
source_target alignment file
Am I correct ? Or is there a less cumbersome pipeline ? Or a pipeline that would only use token level word alignment ?
For a word level alignment, I guess it is simply impossible to get for the following reason :
Word is a difficult notion, and it is language dependent. (On the other side, sentencepiece subtokens just represent sequences of characters, with or without spacing, which is language independent).
The transformer is trained to encode and decode on sub tokens, so how could we get back to the word alignment without a complicated language dependent script ?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi everyone, thanks for all of your amazing work on Marian :).
I open this discussion (and hope it's in the right place) because I'm a little confuse with the alignment functionnality of Marian transformer.
So far, my understanding is that if you want to have a transformer model that does alignment, you need to preprocess alignment on your training corpus and feed it (with the --guided-alignment parameter).
However, because the transformer need to be fed sub tokens (preprocessed by sentencepiece or other tokenizer), I am assuming that the preprocessed alignment NEED to be base on the sub tokens.
Therefore, the pipeline for training would also NEED to be the following :
Am I correct ? Or is there a less cumbersome pipeline ? Or a pipeline that would only use token level word alignment ?
For a word level alignment, I guess it is simply impossible to get for the following reason :
Thanks in advance for your help !
Beta Was this translation helpful? Give feedback.
All reactions