Explicitly encode output as UTF-8 #237

nickjwhite · 2017-07-28T12:17:35Z

I think I've covered all the cases we need to, but it's possible there could be more.

This should address issue #197 and #10.

kba · 2017-09-29T14:27:01Z

These will only change the debug output. Python3 should already do this but won't break with it. LGTM. @zuphilip ?

QuLogic · 2017-09-29T19:30:10Z

str + bytes does not work; this is broken on Python 3. print also takes str as input; if you encode it, it's going to call str() on it, meaning you get b'...'. I don't see why this is needed.

nickjwhite · 2017-10-24T15:52:36Z

Thanks @QuLogic, I don't write a lot of Python, so hadn't spotted the mixing of str + bytes.

I just updated the pull request, with a version that ensures nothing but str is in the print() arguments. And to force UTF-8, I switched to using codecs.getwriter('utf8') for stdout and stderr for all cases where they could output it. I also improved the commit message, so it's clearer what the problem is and why it's worth fixing.

I think I've covered all the cases we need to, but there could be more. This is needed as if a locale isn't set to UTF-8, an error will result of this form: UnicodeEncodeError: 'ascii' codec can't encode character u'\u0113' in position 60: ordinal not in range(128) While the ideal solution is for the user to set their locale to UTF-8, it is better that we print debug output which may not be displayed correctly than that we output a fatal (and non-obvious) error, potentially some time into processing. This also fixes some cases of implicitly combining str and *obj together when printing debug output, which fails with some Python versions, by explicitly using str.join(obj).

nickjwhite · 2017-10-31T18:11:29Z

Just as an additional justification / use-case for this being necessary, I'm currently using Ocropus on a system (that I don't manage) which uses slurm, and no matter what I try I cannot persuade the slurm job manager on this system to not strip locale information. And again, to reiterate, UTF-8 is always the correct thing to output here, regardless of locale, and no fatal errors should be emitted just because of a weird locale.

QuLogic · 2017-10-31T19:10:36Z

Does setting PYTHONIOENCODING not help?

nickjwhite · 2017-11-02T09:40:06Z

@QuLogic, yes, that should do the job as well, I had not found that environment variable before, thanks for the suggestion.

However, I think ocropy should automatically and always output UTF-8, as that's what it's dealing with for input and output, and to do otherwise risks unnecessary runtime crashes. Given the diverse environments Ocropus can run in I think it's best to just enforce this in the program - after all, we have got bug reports from users (#10, #197), and frankly it took me ages to figure out where the issue was too, with the weird HFS system I need to use.

I accidentally missed these from the original commit (c4ae4b).

kba · 2017-12-08T16:38:19Z

As laid out in #197 there are a few options to achieve this. Revisiting the UTF8/debug output issues, I'm not keen on explicitly adding boilerplate to every file with print statements. Changing stdout/stderr encoding in a single place (like common.py) probably achieves the same goal but is error-prone.

I would prefer to replace print/print_info/print_error with a dedicated logging mechanism, such as python's bundled logging or something smaller in the codebase. That would make it easier to override behavior (such as output stream, encoding, logging level).

nickjwhite force-pushed the safeprint branch 2 times, most recently from 4eddbff to 47d86c4 Compare October 24, 2017 15:49

nickjwhite force-pushed the safeprint branch from 47d86c4 to 2b97ad9 Compare October 24, 2017 16:05

nickjwhite force-pushed the safeprint branch from 2b97ad9 to c4ae4b2 Compare October 24, 2017 16:07

Explicitly encode output as UTF-8 for ocropus-errs and ocropus-econf

6c5784e

I accidentally missed these from the original commit (c4ae4b).

zuphilip added 🗣️ discussion 👾 invalid labels Dec 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explicitly encode output as UTF-8 #237

Explicitly encode output as UTF-8 #237

nickjwhite commented Jul 28, 2017

kba commented Sep 29, 2017

QuLogic commented Sep 29, 2017 •

edited

Loading

nickjwhite commented Oct 24, 2017

nickjwhite commented Oct 31, 2017

QuLogic commented Oct 31, 2017 •

edited

Loading

nickjwhite commented Nov 2, 2017

kba commented Dec 8, 2017

Explicitly encode output as UTF-8 #237

Are you sure you want to change the base?

Explicitly encode output as UTF-8 #237

Conversation

nickjwhite commented Jul 28, 2017

kba commented Sep 29, 2017

QuLogic commented Sep 29, 2017 • edited Loading

nickjwhite commented Oct 24, 2017

nickjwhite commented Oct 31, 2017

QuLogic commented Oct 31, 2017 • edited Loading

nickjwhite commented Nov 2, 2017

kba commented Dec 8, 2017

QuLogic commented Sep 29, 2017 •

edited

Loading

QuLogic commented Oct 31, 2017 •

edited

Loading