-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]text 中含有"<"时 tokenizer 报错, #21
Comments
输入里的 |
这样处理是否合理?很多代码相关的出现 < 频率很高,转义就改变含义了。还有一些html类的 有” <body> </a></li>“ 等,要写很多规则来判断是否转义 |
这些html无需判断直接转义即可,模型实际看到的不是 |
要是代码中出现的左移操作<<要替换成<<<<吗? |
是的 |
运行下面的代码会报错,经过测试是因为含有"<"
"""
from cpm_live.models import CPMBeeTorch, CPMBeeConfig
from cpm_live.tokenizers import CPMBeeTokenizer
config = CPMBeeConfig.from_json_file("config/cpm-bee-10b.json")
tokenizer = CPMBeeTokenizer()
print(tokenizer._special_tokens)
text = "if 成绩 < 60"
tokens = tokenizer.tokenize(text)
"""
File "text_generation.py", line 28, in
tokens = tokenizer.tokenize(text)
File "/root/CPM-Bee/src/cpm_live/tokenizers/bee.py", line 143, in tokenize
raise ValueError("Unexpected end of text
{}
".format(text))ValueError: Unexpected end of text
if 成绩 < 60
The text was updated successfully, but these errors were encountered: