Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error "Should encode value 65536 in one byte" #118

Open
svetlana21 opened this issue Nov 19, 2019 · 2 comments
Open

Error "Should encode value 65536 in one byte" #118

svetlana21 opened this issue Nov 19, 2019 · 2 comments

Comments

@svetlana21
Copy link

svetlana21 commented Nov 19, 2019

Hello!
I stumbled upon this error during tagger training on some part of Taiga corpus of Russian language (~1 Gb of texts): "An error occurred during model training: Should encode value 65536 in one byte!"

The quick question is: does udpipe have some vocabulary size limitations?

The full story is:
I know about this issue #53 and I tried everything written there (I don't have tokens with length > 255 bytes, don't have dubious lemmas - max number of forms for one lemma in my corpus is 158 because of rich morphology of language, set guesser_enrich_dictionary to 1). I also removed all sentences with length more than 255 tokens. But I still get this error.
The only thing helped - to reduce corpus size to ~750 Mb. Size ~800 Mb still causes the error. I guessed the problem is in specific sentences (the difference of these two corpora). Ok, I tried to train tagger on corpora diff and didn't get the error. So, does udpipe have some vocabulary size limitations? Or may be there is some less obvious cause of problem?

@foxik
Copy link
Member

foxik commented Nov 24, 2019

UDPipe has unfortunately several limitations, stemming from the fact that we use a morphological library for representing morphological vocabulary -- and if such an error occurs, you do not get an informative message. You however tried most of the known problems, so I am not sure what the problem could be. If the texts are public, you can send me a download link and I can look.

The new version being prepared will not have any such limitations, if it is any consolation ;-)

@jwijffels
Copy link
Collaborator

@svetlana21 Might be the same error as I had #130
After some debugging it appeared I had a lemma with too many possible word forms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants