[BUG]text 中含有"<"时 tokenizer 报错， #21

uloveqian2021 · 2023-05-30T11:17:24Z

运行下面的代码会报错，经过测试是因为含有"<"
"""
from cpm_live.models import CPMBeeTorch, CPMBeeConfig
from cpm_live.tokenizers import CPMBeeTokenizer
config = CPMBeeConfig.from_json_file("config/cpm-bee-10b.json")
tokenizer = CPMBeeTokenizer()
print(tokenizer._special_tokens)
text = "if 成绩 < 60"
tokens = tokenizer.tokenize(text)
"""

File "text_generation.py", line 28, in
tokens = tokenizer.tokenize(text)
File "/root/CPM-Bee/src/cpm_live/tokenizers/bee.py", line 143, in tokenize
raise ValueError("Unexpected end of text {}".format(text))
ValueError: Unexpected end of text if 成绩 < 60

The text was updated successfully, but these errors were encountered:

zh-zheng · 2023-05-30T12:56:03Z

输入里的<需要替换为<<（除了<mask_0>，<sep>等特殊字符），我们稍后会更新下README。

yssAI · 2023-06-06T06:52:46Z

这样处理是否合理？很多代码相关的出现 < 频率很高，转义就改变含义了。还有一些html类的有” <body> </a></li>“ 等，要写很多规则来判断是否转义

zh-zheng · 2023-06-07T02:37:09Z

这样处理是否合理？很多代码相关的出现 < 频率很高，转义就改变含义了。还有一些html类的有” “ 等，要写很多规则来判断是否转义

这些html无需判断直接转义即可，模型实际看到的不是<<，而是原始的<，不存在改变含义的情况。

yqli2420 · 2023-06-12T02:14:17Z

这样处理是否合理？很多代码相关的出现 < 频率很高，转义就改变含义了。还有一些html类的有” “ 等，要写很多规则来判断是否转义

这些html无需判断直接转义即可，模型实际看到的不是<<，而是原始的<，不存在改变含义的情况。

要是代码中出现的左移操作<<要替换成<<<<吗？

zh-zheng · 2023-06-13T05:04:43Z

这样处理是否合理？很多代码相关的出现 < 频率很高，转义就改变含义了。还有一些html类的有” “ 等，要写很多规则来判断是否转义

这些html无需判断直接转义即可，模型实际看到的不是<<，而是原始的<，不存在改变含义的情况。

要是代码中出现的左移操作<<要替换成<<<<吗？

是的

zh-zheng pinned this issue May 31, 2023

zh-zheng closed this as completed May 31, 2023

zh-zheng added the question Further information is requested label May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]text 中含有"<"时 tokenizer 报错， #21

[BUG]text 中含有"<"时 tokenizer 报错， #21

uloveqian2021 commented May 30, 2023 •

edited

Loading

zh-zheng commented May 30, 2023

yssAI commented Jun 6, 2023 •

edited

Loading

zh-zheng commented Jun 7, 2023 •

edited

Loading

yqli2420 commented Jun 12, 2023

zh-zheng commented Jun 13, 2023

[BUG]text 中含有"<"时 tokenizer 报错， #21

[BUG]text 中含有"<"时 tokenizer 报错， #21

Comments

uloveqian2021 commented May 30, 2023 • edited Loading

zh-zheng commented May 30, 2023

yssAI commented Jun 6, 2023 • edited Loading

zh-zheng commented Jun 7, 2023 • edited Loading

yqli2420 commented Jun 12, 2023

zh-zheng commented Jun 13, 2023

uloveqian2021 commented May 30, 2023 •

edited

Loading

yssAI commented Jun 6, 2023 •

edited

Loading

zh-zheng commented Jun 7, 2023 •

edited

Loading