OpenKiwi always download the tokenizer files for XLMRoberta even if a local path is configured. #102

yym6472 · 2021-06-17T12:38:04Z

When I am training the XLM-Roberta based QE system, I pre-downloaded the pre-trained XLM-Roberta model from huggingface's library and modified the field system.model.encoder.model_name in xlmroberta.yaml from the default xlm-roberta-base to my local path that contains the pre-downloaded XLM-Roberta model. However, when running the code, I found OpenKiwi will always download the files config.json and sentencepiece.bpe.model rather than directly use the pre-downloaded ones.

Finally I found this is caused by the following code in kiwi/systems/encoders/xlmroberta.py:48~49:

        if tokenizer_name not in XLM_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST:
            tokenizer_name = 'xlm-roberta-base'

which means if model_name is configured to some local path, it will be rewrite to xlm-roberta-base. However, for Bert and XLM, there is no such issue. Is that a bug or under some consideration?

The text was updated successfully, but these errors were encountered:

captainvera · 2021-07-20T11:25:39Z

Hello @yym6472,

Sorry I took a while to respond.

I tested what you are saying, and you are indeed correct. This version of Openkiwi was initially built when transformers was in version < 2 or 3. The transformers library still had a lot of quirks when working with models other than BERT and a couple others.

This was probably introduced to avoid bugs when loading finetuned versions of these models (like you're trying to do here) as we assumed the tokenizer part wouldn't change.
(Could it be possible that it didn't used to get saved with the model? I really can't recall anymore)

So as of today with transformers in a much more mature state, I'd probably classify this as a bug.
We would really appreciate if you could do a PR to fix it :)

yym6472 · 2021-07-21T06:49:39Z

Thanks for your reply! I have proposed a PR #105 that simply removes the above-mentioned two lines.

captainvera self-assigned this Jul 20, 2021

captainvera added the enhancement New feature or request label Jul 20, 2021

captainvera added the bug Something isn't working label Jul 20, 2021

yym6472 mentioned this issue Jul 21, 2021

Fix bug that always downloads tokenizer files. #105

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenKiwi always download the tokenizer files for XLMRoberta even if a local path is configured. #102

OpenKiwi always download the tokenizer files for XLMRoberta even if a local path is configured. #102

yym6472 commented Jun 17, 2021

captainvera commented Jul 20, 2021

yym6472 commented Jul 21, 2021

OpenKiwi always download the tokenizer files for XLMRoberta even if a local path is configured. #102

OpenKiwi always download the tokenizer files for XLMRoberta even if a local path is configured. #102

Comments

yym6472 commented Jun 17, 2021

captainvera commented Jul 20, 2021

yym6472 commented Jul 21, 2021