-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pp_ser: handle unicode characters in the source file with Python 3 #249
base: master
Are you sure you want to change the base?
Conversation
Hi there, this is jenkins continuous integration... |
It turns out that both of the suggested approaches fail with older versions of Python3 (e.g. As a recap. Currently, we get the following when a source file contains Unicode characters: Traceback (most recent call last):
File "./serialbox/src/serialbox-python/pp_ser/pp_ser.py", line 1108, in <module>
ser.preprocess()
File "./serialbox/src/serialbox-python/pp_ser/pp_ser.py", line 950, in preprocess
output_file.write(to_ascii(self.__outputBuffer))
File "./serialbox/src/serialbox-python/pp_ser/pp_ser.py", line 53, in to_ascii
return bytes(text, 'ascii')
UnicodeEncodeError: 'ascii' codec can't encode character '\u2013' in position 9104: ordinal not in range(128) The original solution of this PR with Python 3.5.2 and Traceback (most recent call last):
File "./serialbox/src/serialbox-python/pp_ser/pp_ser.py", line 1101, in <module>
ser.preprocess()
File "./serialbox/src/serialbox-python/pp_ser/pp_ser.py", line 936, in preprocess
self.parse() # first pass, analyse only
File "./serialbox/src/serialbox-python/pp_ser/pp_ser.py", line 901, in parse
for line in input_file:
File "/sw/rhel6-x64/python/python-3.5.2-gcc49/lib/python3.5/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 887: ordinal not in range(128) The updated solution is to convert the input file to |
Hi @skosukhin, thanks for your contribution. And sorry that I didn't reply earlier. For contributions to a GridTools project we need you or your organization to sign a contributor license agreement with ETH. Probably the latter makes more sense, assuming we have more contributions in the future from different people. I'll check and come back to you. |
@havogt we need that for the os upgrade on Tsa (testing ongoing now) , would it be possible to go for an individual agreement first ? We were going to do it but jsut realised Sergey did it also. |
Sure, but not sure if that's actually faster, because @skosukhin probably needs approval from his employer anyway. |
Currently,
pp_ser
fails to process source files containing Unicode character with Python 3. For example:UnicodeEncodeError: 'ascii' codec can't encode character '\u2013' in position 604: ordinal not in range(128)
This is normally not a problem if we use Python 2. The reason is that in the case of Python 2 files are read and written using the same encoding
locale.getpreferredencoding()
(usuallyUTF-8
) but in the case of Python 3, strings that are read with thelocale.getpreferredencoding()
encoding are written usingascii
.One way to fix the problem is to replace
ascii
withlocale.getpreferredencoding()
inside theto_ascii
function (the name of the function is misleading in the case of Python2, therefore I would rename it toto_bytes
).The second approach,
which is implemented in this PR, is to eliminate the explicit conversion of strings. All we need to do is to open the temporary file in the text mode (the default on is'w+b'
).