Clarification about alignment for transformers #368

kirianguiller · 2021-04-19T09:59:51Z

kirianguiller
Apr 19, 2021

Hi everyone, thanks for all of your amazing work on Marian :).

I open this discussion (and hope it's in the right place) because I'm a little confuse with the alignment functionnality of Marian transformer.

So far, my understanding is that if you want to have a transformer model that does alignment, you need to preprocess alignment on your training corpus and feed it (with the --guided-alignment parameter).

However, because the transformer need to be fed sub tokens (preprocessed by sentencepiece or other tokenizer), I am assuming that the preprocessed alignment NEED to be base on the sub tokens.

Therefore, the pipeline for training would also NEED to be the following :

Train a sentencepiece model on your training source and target corpus
Pre-tokenize these two corpus
Get subtokens alignment (with fast_align for exemple)
Create the vocab.yml from the spm.vocab (that are produce when training the spm models) for both source and target
Finally, train a model with using :
- pretokenized corpus
- pretrained vocab.yml for both target and source
- source_target alignment file

Am I correct ? Or is there a less cumbersome pipeline ? Or a pipeline that would only use token level word alignment ?

For a word level alignment, I guess it is simply impossible to get for the following reason :

Word is a difficult notion, and it is language dependent. (On the other side, sentencepiece subtokens just represent sequences of characters, with or without spacing, which is language independent).
The transformer is trained to encode and decode on sub tokens, so how could we get back to the word alignment without a complicated language dependent script ?

Thanks in advance for your help !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification about alignment for transformers #368

{{title}}

Replies: 0 comments

Select a reply

Clarification about alignment for transformers #368

kirianguiller Apr 19, 2021

Replies: 0 comments

kirianguiller
Apr 19, 2021