Python3 compatibility #319

kba · 2019-01-03T17:56:38Z

This is another go to make most of the engine compatible with python3, because ocropy is the last python2 holdout in the open source ecosphere. Even with development effectively ceased, it's still widely used and should at least be forward-compatible with python 3.4+.

Basically, we don't want to keep supporting Python2 in our stack just for ocropy :-)

Uses six as a compatibility library to handle things like unicode/byte strings, urlopen, pickle.

Updates the CircleCI configuration to the current format and tests across 2.7 and 3.4-3.7.

Since the test suite isn't that intuitive, I cannot guarantee I didn't break stuff.

I'm somewhat baffled that the default model (pickled with python2) would work for the python3 variant of rpred. I suspect an error in the setup, but see e.g the output in https://circleci.com/gh/OCR-D/ocropy/38, ctrl-f for # loading object ./models/en-default.pyrnn.gz. I remember @mittagessen predicting this to fail because of some of the old-style classes used but it doesn't...

This also updates scipy, numpy, matplotlib and introduces imageio to work against some of the DeprecationWarnings that have turned into Errors in more recent versions of the libraries.

Some hacky type conversions (like converting boolean image arrays to float32) need further scrutiny.

Feedback is appreciated.

@tmbdev @zuphilip @syedsaqibbukhari @QuLogic @amitdo @mittagessen @wrznr @finkf

…allout

….imwrite, fix (?) a few conversions, reenable all tests

kba · 2019-01-04T08:46:23Z

Published this PR as a fork to pypi to test integrations. If anyone's interested, it can be installed with

pip install ocrd-fork-ocropy==1.4.0a3

mittagessen · 2019-01-07T10:07:42Z

I only ran a cursory test but it seems to work.

I'm fairly certain that unpickling old-style classes on python 3 didn't work at some point in the past but it eminently does now. You might want to change all the class definitions in lstm.py to new style as python 3 classes auto-inherit from object now.

amitdo · 2019-01-07T10:41:10Z

I'm fairly certain that unpickling old-style classes on python 3 didn't work at some point in the past but it eminently does now

https://bugs.python.org/issue5180

You might want to change all the class definitions in lstm.py to new style as python 3 classes auto-inherit from object now.

https://portingguide.readthedocs.io/en/latest/classes.html

mittagessen · 2019-01-07T15:21:05Z

https://portingguide.readthedocs.io/en/latest/classes.html

Sorry, I might've been a bit unclear. Old-style and new-style definitions for these classes are equivalent on py3 and updating these to the now common class foo(object) syntax is little more than a nice hint that the code has been touched in the last ten years. I doubt anybody has ever done multiple inheritance on these classes, so it shouldn't break anything on py2.7

zuphilip · 2019-01-13T16:17:32Z

Hey @kba this looks very interesting! I agree that it would be good to make sure that ocropy also runs on python3. It seems that it is now easier to achieve python2 and python3 compatibility with libraries like six. Let me ask some general questions before looking deeper at the code:

Is it okay to first look into this PR also it will create merge conflicts with possibly every other PR? (I think: yes)
Should we be concerned about @tmbdev that he could not want this PR? (I don't think so.)
Is the pip command still the same after your last commits?
Did you see any difference in your tests for the new code or under Python 3?
Is this PR ready to review or are you still working on it (e.g. hacky type conversions)?

Conflicts: ocropus-gpageseg

kba · 2019-01-14T09:05:45Z

Is it okay to first look into this PR also it will create merge conflicts with possibly every other PR? (I think: yes)

The conflicts will be substantial but trivial since structurally nothing should change, just syntax and some API calls in a lot of places.

Should we be concerned about @tmbdev that he could not want this PR? (I don't think so.)

Since this makes no essential changes except future-proofing the code, I wouldn't think so.

Is the pip command still the same after your last commits?

No, just updated it to c773dd2

pip install ocrd-fork-ocropy==1.4.0a3

Did you see any difference in your tests for the new code or under Python 3?

I did not, it works surprisingly (suspiciously :D) well across versions. I've slightly extended run-test-ci but the codebase being what it is, there is no guarantee nothing did break. If something broke through my changes, it broke for py2 and py3 since the log files from the test suite look the same to me.

Is this PR ready to review or are you still working on it (e.g. hacky type conversions)?

It is ready for review and I'm esp. happy for tips on how to do proper type conversions with numpy and PIL :) If you have input on what should be tested to avoid breaking stuff, also appreciated.

mittagessen · 2019-01-14T10:33:36Z

happy for tips on how to do proper type conversions with numpy and PIL

The "proper" way for PIL -> np.array conversion is to explicitly set the image mode to 'L' or 'RGB' to ensure getting a uint8_t array with channel depth 1/3 returned. For the other way around the nested switches in array2pil can probably be replaced by a simple Image.fromarray(ar) as it automatically determines image mode from shape and dtype of the array.

Conflicts: setup.py

zuphilip

I looked through the code and here are some comments as well as questions. Please have a look at it. I just started some testing and will write more about this later.

zuphilip · 2019-01-20T16:04:01Z

.travis.yml

+  - "3.4"
+  - "3.5"
+  - "3.6"
+  - "3.7-dev"


Why does this work now without this whole miniconda stuff? This makes it much easier...

It always worked without miniconda (see circle config and install instructions) but since @QuLogic went through the effort of setting it up with conda, we retained it in travis. I don't know enough about how to change the conda setup to test the various versions, so this seemed the simplest solution.

The conda environment is created with python=$TRAVIS_PYTHON_VERSION, which comes from this key; you didn't really need to re-write everything to get it to work.

But originally, miniconda was only necessary because SciPy took forever to compile from source; I assume there are wheels now.

Also, 3.7 is GA, you should use it and not 3.7-dev which is a very old snapshot.

zuphilip · 2019-01-20T16:05:57Z

.travis.yml

-  - cd ../test_folder
-  - ../ocropy/tests/run-unit
-  - ../ocropy/run-test-ci
+  - ./run-test-ci


What happened with run-unit?

Will fix. Just an oversight.

zuphilip · 2019-01-20T16:07:58Z

.circleci/config.yml

+
+jobs:
+
+  build-python27: &job-template


This was previously circleci.yml. What caused the renaming? Unfortunately git did not recognize this as a renaming...

Circle CI switched from 1.0 to 2.0 in August last year which is very different (.circle/config.yml instead of .circleci.yml, job/workflow based semantics, containers based directly on docker etc.)

zuphilip · 2019-01-20T16:28:07Z

ocrolib/__init__.py

-    "hocr",
-    "lang",
-    "default",
-    "lineest",


Can these changes break possibly for someone who is relying on importing from ocrolib? But even if yes, is that a realistic scenario? For whom?

If I understand this correctly, then the __all__ variable is just used for supporting from ocrolib import * which we currently don't use in this project. Should we support that then anyway? What principles to follow then? Or could probably also delete more in this file?

This might indeed break stuff, because default was renamed default. I would argue that from ocrolib import * really is bad practice. common.py has some ~70 functions and a few classes. If we want to make absolutely sure, nobody using the code as a library (which few people do I suppose) will experience breaks from wildcard imports, it would be better to list all those exports explicitly in __all__.

zuphilip · 2019-01-20T16:29:21Z

ocrolib/__init__.py

-from common import *
-from default import traceback as trace
+from .defaults import traceback as trace
+from .common import *


For what are these import statements needed at all?

See above. They are imported to be exported, so users can write

from ocrolib import allsplitext

instead of

from ocrolib.common import allsplitext

zuphilip · 2019-01-20T18:55:01Z

ocropus-linegen

@@ -17,11 +23,16 @@ import matplotlib.pyplot as plt
 from PIL import Image
 from PIL import ImageFont,ImageDraw
 from scipy.ndimage import filters,measurements,interpolation
-from scipy.misc import imsave
+from imageio import imwrite


Okay, that is a new library we need. I guess that we could not continue to use scipy or image io for some reason?

🔴 This leads to an error for in one test case from run-test-ci or easier try the direct call (with an existing image):

root@b9b634c48ea6:/ocropy# python ./ocropus-dewarp 'temp/0001/010011.bin.png' # inputs 1 # CenterNormalizer # temp/0001/010011.bin.png ERROR Imageio Pillow plugin requires Pillow, not PIL!

It seems that in ocrolib.read_image_gray uses PIL which is incompatible then with imwrite from imageio.

Oh, wait... The problem was that Pillow was just too old (3.1.2). After updateing to 5.1.4 it seems to work now. Should we update our requirements for Pillow, or would this better be something for imageio itself?

Now, there is a warning about conversion:

root@b9b634c48ea6:/ocropy# python ./ocropus-dewarp 'temp/0001/010011.bin.png' # inputs 1 # CenterNormalizer # temp/0001/010011.bin.png WARNING:root:Lossy conversion from float32 to uint8. Range [0, 1]. Convert image to uint8 prior to saving to suppress this warning.

Is this new and should we do something about it?

These are the "hacky conversions" I was talking about in the original PR comment. They are not lossy though, it's just that boolean values converted to uint8 directly (as the error message recommends), thresholds values in a way that the result is just plain black.

As it is, it's inefficient and should be fixed but not erroneous behavior AFAIK.

zuphilip · 2019-01-20T18:57:35Z

ocropus-hocr

-else:    
-    lfiles = python.sum([glob.glob(d+"/??????.bin.png") for d in dirs],[])
+else:
+    lfiles = sum([glob.glob(d+"/??????.bin.png") for d in dirs],[])


I guess this was needed before, because we imported numpy differently?

zuphilip · 2019-01-20T19:03:02Z

ocropus-gtedit

    else:
-        data = urllib2.urlopen(image).read()
+        data = urlopen(unicode(image)).read()


Are all these unicode transformation needed? It looks for me that you are transforming twice: once the image on line 223 and the data again which is just part of the image.

You're right, they are redundant, artifact of earlier iteration of the code, I'll remove them.

zuphilip · 2019-01-20T19:33:27Z

ocrolib/toplevel.py

 import functools
 import linecache
 import os
 import sys
 import warnings
-from types import NoneType
-# FIXME from ... import wrap


What happened here? I cannot say much about the other changes in this file...

Decorator tracing theoretically could wrap a function for debugging. It's not used in the code and from what I can see, has been broken for many years.

NoneType was an unused import.

zuphilip · 2019-01-20T20:07:30Z

ocrolib/chars.py

    if germanic:
        # germanic quoting style reverses the shapes
        # straight double quotes
-        s = re.sub(ur"\s+''",u"”",s)
-        s = re.sub(u"''\s+",u"“",s)


Why is this r?

See above:

r"foo" is a string variant where escape sequences with backslash are not treated as such, useful in regexes, so as not to have to escape the backslash itself.

kba added 11 commits January 2, 2019 15:39

WIP: py3 compat with six

c4d3823

require six

c62ebc1

circleci: update to 2.0

51d908c

simplify unpickling, py3 compatible, use GZipfile instead of gunzip c…

d5db950

…allout

py3-forward compatible unicode decoding in normalize_text

77ceabe

bump scipy, numpy, matplotlib, replace scipy.misc.imsave with imageio…

90ac1ff

….imwrite, fix (?) a few conversions, reenable all tests

run-test-ci: Overrideable TRAIN_ITERATIONS

6bfa1ef

run-test-ci: smoke test to make sure ocropus-* --help works

7aa8367

sixify various py2-isms

949588d

don't smoke test lpred, ltrain (requires clstm)

6651063

👷 Test pythion 3.4 - 3.7

3ff4cd6

kba force-pushed the py3-again branch 3 times, most recently from 51b75e0 to 3c36720 Compare January 4, 2019 08:07

💚 test with python3 in travis, with non-conda setup

581e225

kba force-pushed the py3-again branch from 3c36720 to 581e225 Compare January 4, 2019 08:10

kba added 2 commits January 9, 2019 11:56

run-test-ci: test saving model

bb76524

Use gzip module instead of os.popoen callout for gzipping pickle

7db9cbc

kba force-pushed the py3-again branch from a3fbc1e to 7db9cbc Compare January 9, 2019 14:10

kba mentioned this pull request Jan 14, 2019

complement of readme file #300

Closed

Merge remote-tracking branch 'upstream/master' into py3-again

c773dd2

Conflicts: ocropus-gpageseg

kba added 2 commits January 14, 2019 13:34

invert requirements.txt / setup.py install_requires logic (sigh)

ac5f5f7

Conflicts: setup.py

make old-style classes inherit from object

21b1aa8

kba force-pushed the py3-again branch from 5f3df75 to 21b1aa8 Compare January 14, 2019 13:16

kba added 2 commits January 15, 2019 14:20

relax six requirement to be compatible with coremltools/kraken

89d982b

Add a --version flag to all commands (except lpred/ltrain)

332ff6f

zuphilip reviewed Jan 20, 2019

View reviewed changes

kba added 4 commits January 21, 2019 08:16

💚 travis: re-enable unit tests

babbd42

💚 run-unit with generic python, not python3

3b424e8

ocropus-gtedit: remove redundant unicode statements

8d9b990

change back defaults.py to default.py

d890047

worldofpeace mentioned this pull request Nov 27, 2019

Remove lots of pygtk using software NixOS/nixpkgs#74295

Merged

10 tasks

This comment was marked as spam.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python3 compatibility #319

Python3 compatibility #319

kba commented Jan 3, 2019 •

edited

Loading

kba commented Jan 4, 2019 •

edited

Loading

mittagessen commented Jan 7, 2019

amitdo commented Jan 7, 2019

mittagessen commented Jan 7, 2019

zuphilip commented Jan 13, 2019

kba commented Jan 14, 2019 •

edited

Loading

mittagessen commented Jan 14, 2019

zuphilip left a comment

zuphilip Jan 20, 2019

kba Jan 21, 2019

QuLogic Jan 21, 2019 •

edited

Loading

zuphilip Jan 20, 2019

kba Jan 21, 2019

zuphilip Jan 20, 2019

kba Jan 21, 2019

zuphilip Jan 20, 2019

zuphilip Jan 20, 2019

kba Jan 21, 2019

zuphilip Jan 20, 2019

kba Jan 21, 2019

zuphilip Jan 20, 2019

zuphilip Jan 20, 2019 •

edited

Loading

zuphilip Jan 20, 2019

zuphilip Jan 20, 2019 •

edited

Loading

kba Jan 21, 2019

zuphilip Jan 20, 2019

kba Jan 21, 2019

zuphilip Jan 20, 2019

kba Jan 21, 2019

zuphilip Jan 20, 2019

kba Jan 21, 2019

zuphilip Jan 20, 2019

kba Jan 21, 2019

This comment was marked as spam.

Python3 compatibility #319

Are you sure you want to change the base?

Python3 compatibility #319

Conversation

kba commented Jan 3, 2019 • edited Loading

kba commented Jan 4, 2019 • edited Loading

mittagessen commented Jan 7, 2019

amitdo commented Jan 7, 2019

mittagessen commented Jan 7, 2019

zuphilip commented Jan 13, 2019

kba commented Jan 14, 2019 • edited Loading

mittagessen commented Jan 14, 2019

zuphilip left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

QuLogic Jan 21, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zuphilip Jan 20, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zuphilip Jan 20, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as spam.

kba commented Jan 3, 2019 •

edited

Loading

kba commented Jan 4, 2019 •

edited

Loading

kba commented Jan 14, 2019 •

edited

Loading

QuLogic Jan 21, 2019 •

edited

Loading

zuphilip Jan 20, 2019 •

edited

Loading

zuphilip Jan 20, 2019 •

edited

Loading