Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

转换成 sentencepiece 的之后载入失败 #13

Open
yzlnew opened this issue Feb 4, 2024 · 4 comments
Open

转换成 sentencepiece 的之后载入失败 #13

yzlnew opened this issue Feb 4, 2024 · 4 comments

Comments

@yzlnew
Copy link

yzlnew commented Feb 4, 2024

通过类方法 convert_to_sentencepiece 转换为 sp model,再进行 load 的时候报错

import sentencepiece as spm

sp_model = spm.SentencePieceProcessor()
sp_model.Load("sp.model")
libc++abi: terminating due to uncaught exception of type Darts::Details::Exception: /Users/runner/work/sentencepiece/sentencepiece/third_party/darts_clone/darts.h:1143: exception: failed to insert key: zero-length key

相关 issue google/sentencepiece#156

模型里面有 "\0",是否应该在 convert 的时候去掉,以及是否有副作用?

@bojone
Copy link
Owner

bojone commented Feb 6, 2024

转换前的模型方便共享吗?或者给一个最小的复现代码?

@yzlnew
Copy link
Author

yzlnew commented Feb 7, 2024

@bojone 按照 README 的例子复现。模型在这里 https://microbin.yzlnew.com/upload/sloth-worm-falcon

from bytepiece import Tokenizer

tokenizer1 = Tokenizer('tokenizer_80k_small_isolated.model')
tokenizer1.convert_to_sentencepiece('sp.model')

import sentencepiece as spm
tokenizer2 = spm.SentencePieceProcessor("sp.model")

@bojone
Copy link
Owner

bojone commented Feb 9, 2024

@yzlnew 看上去你不是ensure_unicode版本?只有ensure_unicode版本的模型才保证能顺利转换成sentencepiece(在较新的版本中,ensure_unicode默认是开启的,你可以检查一下)

@yzlnew
Copy link
Author

yzlnew commented Feb 9, 2024

@bojone 奇怪了,这个模型是用 0.6.3 训练的,而且也是 ensure_unicode 的。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants