Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]text 中含有"<"时 tokenizer 报错, #21

Closed
uloveqian2021 opened this issue May 30, 2023 · 5 comments
Closed

[BUG]text 中含有"<"时 tokenizer 报错, #21

uloveqian2021 opened this issue May 30, 2023 · 5 comments
Labels
question Further information is requested

Comments

@uloveqian2021
Copy link

uloveqian2021 commented May 30, 2023

运行下面的代码会报错,经过测试是因为含有"<"
"""
from cpm_live.models import CPMBeeTorch, CPMBeeConfig
from cpm_live.tokenizers import CPMBeeTokenizer
config = CPMBeeConfig.from_json_file("config/cpm-bee-10b.json")
tokenizer = CPMBeeTokenizer()
print(tokenizer._special_tokens)
text = "if 成绩 < 60"
tokens = tokenizer.tokenize(text)
"""

File "text_generation.py", line 28, in
tokens = tokenizer.tokenize(text)
File "/root/CPM-Bee/src/cpm_live/tokenizers/bee.py", line 143, in tokenize
raise ValueError("Unexpected end of text {}".format(text))
ValueError: Unexpected end of text if 成绩 < 60

@zh-zheng
Copy link
Collaborator

输入里的<需要替换为<<(除了<mask_0><sep>等特殊字符),我们稍后会更新下README。

@zh-zheng zh-zheng pinned this issue May 31, 2023
@zh-zheng zh-zheng added the question Further information is requested label May 31, 2023
@yssAI
Copy link

yssAI commented Jun 6, 2023

这样处理是否合理?很多代码相关的出现 < 频率很高,转义就改变含义了。还有一些html类的 有” <body> </a></li>“ 等,要写很多规则来判断是否转义

@zh-zheng
Copy link
Collaborator

zh-zheng commented Jun 7, 2023

这样处理是否合理?很多代码相关的出现 < 频率很高,转义就改变含义了。还有一些html类的 有” “ 等,要写很多规则来判断是否转义

这些html无需判断直接转义即可,模型实际看到的不是<<,而是原始的<,不存在改变含义的情况。

@yqli2420
Copy link

这样处理是否合理?很多代码相关的出现 < 频率很高,转义就改变含义了。还有一些html类的 有” “ 等,要写很多规则来判断是否转义

这些html无需判断直接转义即可,模型实际看到的不是<<,而是原始的<,不存在改变含义的情况。

要是代码中出现的左移操作<<要替换成<<<<吗?

@zh-zheng
Copy link
Collaborator

这样处理是否合理?很多代码相关的出现 < 频率很高,转义就改变含义了。还有一些html类的 有” “ 等,要写很多规则来判断是否转义

这些html无需判断直接转义即可,模型实际看到的不是<<,而是原始的<,不存在改变含义的情况。

要是代码中出现的左移操作<<要替换成<<<<吗?

是的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants