Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte #219

Open
Step2Web opened this issue Aug 21, 2022 · 3 comments

Comments

@Step2Web
Copy link

Hi all, we would need some help regarding an issue we've been seeing in our lupa related code:

  File "lupa/_lupa.pyx", line 308, in lupa._lupa.LuaRuntime.execute
  File "lupa/_lupa.pyx", line 1324, in lupa._lupa.run_lua
  File "lupa/_lupa.pyx", line 1333, in lupa._lupa.call_lua
  File "lupa/_lupa.pyx", line 1358, in lupa._lupa.execute_lua_call
  File "lupa/_lupa.pyx", line 281, in lupa._lupa.LuaRuntime.reraise_on_exception
  File "lupa/_lupa.pyx", line 1496, in lupa._lupa.py_call_with_gil
  File "lupa/_lupa.pyx", line 1459, in lupa._lupa.call_python
  File "lupa/_lupa.pyx", line 1144, in lupa._lupa.py_from_lua
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

From what he have seen, this happens when the lua code calls a python function after having used string.sub on a string containing (byte 153) or "”" (byte 226).

I've built a minimal repo that reproduces the issue: https://github.com/Step2Web/lupa-encoding-issue

We'd greatly appreciate some help in how to resolve this. Thank you in advance and please let me know if there's any other information you'll need.

@Step2Web
Copy link
Author

Quick update here, I'm convinced now that this is actually a bug in Lua in how Unicode characters are handled. And using sub, we're splitting the unicode character, which breaks decoding in python later on.

@Le0Developer
Copy link
Contributor

Minimal reproduction: lua.eval('("’"):sub(1,1)')

@Le0Developer
Copy link
Contributor

´ is a compound and represented by multiple bytes.

string.sub ignores compounds and only takes the literal byte, which by itself are not valid UTF-8 (byte 153 or 226 are not valid).

You can set encoding=None when creating the LuaRuntime to disable decoding and get bytes instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants