Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python3 compatibility #319

Open
wants to merge 23 commits into
base: master
Choose a base branch
from
Open

Python3 compatibility #319

wants to merge 23 commits into from

Conversation

kba
Copy link
Collaborator

@kba kba commented Jan 3, 2019

This is another go to make most of the engine compatible with python3, because ocropy is the last python2 holdout in the open source ecosphere. Even with development effectively ceased, it's still widely used and should at least be forward-compatible with python 3.4+.

Basically, we don't want to keep supporting Python2 in our stack just for ocropy :-)

Uses six as a compatibility library to handle things like unicode/byte strings, urlopen, pickle.

Updates the CircleCI configuration to the current format and tests across 2.7 and 3.4-3.7.

Since the test suite isn't that intuitive, I cannot guarantee I didn't break stuff.

I'm somewhat baffled that the default model (pickled with python2) would work for the python3 variant of rpred. I suspect an error in the setup, but see e.g the output in https://circleci.com/gh/OCR-D/ocropy/38, ctrl-f for # loading object ./models/en-default.pyrnn.gz. I remember @mittagessen predicting this to fail because of some of the old-style classes used but it doesn't...

This also updates scipy, numpy, matplotlib and introduces imageio to work against some of the DeprecationWarnings that have turned into Errors in more recent versions of the libraries.

Some hacky type conversions (like converting boolean image arrays to float32) need further scrutiny.

Feedback is appreciated.

@tmbdev @zuphilip @syedsaqibbukhari @QuLogic @amitdo @mittagessen @wrznr @finkf

@kba kba force-pushed the py3-again branch 3 times, most recently from 51b75e0 to 3c36720 Compare January 4, 2019 08:07
@kba
Copy link
Collaborator Author

kba commented Jan 4, 2019

Published this PR as a fork to pypi to test integrations. If anyone's interested, it can be installed with

pip install ocrd-fork-ocropy==1.4.0a3

@mittagessen
Copy link

I only ran a cursory test but it seems to work.

I'm fairly certain that unpickling old-style classes on python 3 didn't work at some point in the past but it eminently does now. You might want to change all the class definitions in lstm.py to new style as python 3 classes auto-inherit from object now.

@amitdo
Copy link
Contributor

amitdo commented Jan 7, 2019

I'm fairly certain that unpickling old-style classes on python 3 didn't work at some point in the past but it eminently does now

https://bugs.python.org/issue5180

You might want to change all the class definitions in lstm.py to new style as python 3 classes auto-inherit from object now.

https://portingguide.readthedocs.io/en/latest/classes.html

@mittagessen
Copy link

https://portingguide.readthedocs.io/en/latest/classes.html

Sorry, I might've been a bit unclear. Old-style and new-style definitions for these classes are equivalent on py3 and updating these to the now common class foo(object) syntax is little more than a nice hint that the code has been touched in the last ten years. I doubt anybody has ever done multiple inheritance on these classes, so it shouldn't break anything on py2.7

@zuphilip
Copy link
Collaborator

Hey @kba this looks very interesting! I agree that it would be good to make sure that ocropy also runs on python3. It seems that it is now easier to achieve python2 and python3 compatibility with libraries like six. Let me ask some general questions before looking deeper at the code:

  1. Is it okay to first look into this PR also it will create merge conflicts with possibly every other PR? (I think: yes)
  2. Should we be concerned about @tmbdev that he could not want this PR? (I don't think so.)
  3. Is the pip command still the same after your last commits?
  4. Did you see any difference in your tests for the new code or under Python 3?
  5. Is this PR ready to review or are you still working on it (e.g. hacky type conversions)?

@kba kba mentioned this pull request Jan 14, 2019
@kba
Copy link
Collaborator Author

kba commented Jan 14, 2019

Is it okay to first look into this PR also it will create merge conflicts with possibly every other PR? (I think: yes)

The conflicts will be substantial but trivial since structurally nothing should change, just syntax and some API calls in a lot of places.

Should we be concerned about @tmbdev that he could not want this PR? (I don't think so.)

Since this makes no essential changes except future-proofing the code, I wouldn't think so.

Is the pip command still the same after your last commits?

No, just updated it to c773dd2

pip install ocrd-fork-ocropy==1.4.0a3

Did you see any difference in your tests for the new code or under Python 3?

I did not, it works surprisingly (suspiciously :D) well across versions. I've slightly extended run-test-ci but the codebase being what it is, there is no guarantee nothing did break. If something broke through my changes, it broke for py2 and py3 since the log files from the test suite look the same to me.

Is this PR ready to review or are you still working on it (e.g. hacky type conversions)?

It is ready for review and I'm esp. happy for tips on how to do proper type conversions with numpy and PIL :) If you have input on what should be tested to avoid breaking stuff, also appreciated.

@mittagessen
Copy link

happy for tips on how to do proper type conversions with numpy and PIL

The "proper" way for PIL -> np.array conversion is to explicitly set the image mode to 'L' or 'RGB' to ensure getting a uint8_t array with channel depth 1/3 returned. For the other way around the nested switches in array2pil can probably be replaced by a simple Image.fromarray(ar) as it automatically determines image mode from shape and dtype of the array.

Copy link
Collaborator

@zuphilip zuphilip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked through the code and here are some comments as well as questions. Please have a look at it. I just started some testing and will write more about this later.

- "3.4"
- "3.5"
- "3.6"
- "3.7-dev"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this work now without this whole miniconda stuff? This makes it much easier...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It always worked without miniconda (see circle config and install instructions) but since @QuLogic went through the effort of setting it up with conda, we retained it in travis. I don't know enough about how to change the conda setup to test the various versions, so this seemed the simplest solution.

Copy link
Contributor

@QuLogic QuLogic Jan 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conda environment is created with python=$TRAVIS_PYTHON_VERSION, which comes from this key; you didn't really need to re-write everything to get it to work.

But originally, miniconda was only necessary because SciPy took forever to compile from source; I assume there are wheels now.

Also, 3.7 is GA, you should use it and not 3.7-dev which is a very old snapshot.

- cd ../test_folder
- ../ocropy/tests/run-unit
- ../ocropy/run-test-ci
- ./run-test-ci
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened with run-unit?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix. Just an oversight.


jobs:

build-python27: &job-template
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was previously circleci.yml. What caused the renaming? Unfortunately git did not recognize this as a renaming...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Circle CI switched from 1.0 to 2.0 in August last year which is very different (.circle/config.yml instead of .circleci.yml, job/workflow based semantics, containers based directly on docker etc.)

"hocr",
"lang",
"default",
"lineest",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these changes break possibly for someone who is relying on importing from ocrolib? But even if yes, is that a realistic scenario? For whom?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this correctly, then the __all__ variable is just used for supporting from ocrolib import * which we currently don't use in this project. Should we support that then anyway? What principles to follow then? Or could probably also delete more in this file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might indeed break stuff, because default was renamed default. I would argue that from ocrolib import * really is bad practice. common.py has some ~70 functions and a few classes. If we want to make absolutely sure, nobody using the code as a library (which few people do I suppose) will experience breaks from wildcard imports, it would be better to list all those exports explicitly in __all__.

from common import *
from default import traceback as trace
from .defaults import traceback as trace
from .common import *
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what are these import statements needed at all?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above. They are imported to be exported, so users can write

from ocrolib import allsplitext

instead of

from ocrolib.common import allsplitext

@@ -17,11 +23,16 @@ import matplotlib.pyplot as plt
from PIL import Image
from PIL import ImageFont,ImageDraw
from scipy.ndimage import filters,measurements,interpolation
from scipy.misc import imsave
from imageio import imwrite
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that is a new library we need. I guess that we could not continue to use scipy or image io for some reason?

Copy link
Collaborator

@zuphilip zuphilip Jan 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 This leads to an error for in one test case from run-test-ci or easier try the direct call (with an existing image):

root@b9b634c48ea6:/ocropy# python ./ocropus-dewarp 'temp/0001/010011.bin.png'
# inputs 1
# CenterNormalizer
# temp/0001/010011.bin.png
ERROR Imageio Pillow plugin requires Pillow, not PIL!

It seems that in ocrolib.read_image_gray uses PIL which is incompatible then with imwrite from imageio.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, wait... The problem was that Pillow was just too old (3.1.2). After updateing to 5.1.4 it seems to work now. Should we update our requirements for Pillow, or would this better be something for imageio itself?

Copy link
Collaborator

@zuphilip zuphilip Jan 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, there is a warning about conversion:

root@b9b634c48ea6:/ocropy# python ./ocropus-dewarp 'temp/0001/010011.bin.png'
# inputs 1
# CenterNormalizer
# temp/0001/010011.bin.png
WARNING:root:Lossy conversion from float32 to uint8. Range [0, 1]. Convert image
 to uint8 prior to saving to suppress this warning.

Is this new and should we do something about it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the "hacky conversions" I was talking about in the original PR comment. They are not lossy though, it's just that boolean values converted to uint8 directly (as the error message recommends), thresholds values in a way that the result is just plain black.

As it is, it's inefficient and should be fixed but not erroneous behavior AFAIK.

else:
lfiles = python.sum([glob.glob(d+"/??????.bin.png") for d in dirs],[])
else:
lfiles = sum([glob.glob(d+"/??????.bin.png") for d in dirs],[])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this was needed before, because we imported numpy differently?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

ocropus-gtedit Outdated
else:
data = urllib2.urlopen(image).read()
data = urlopen(unicode(image)).read()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all these unicode transformation needed? It looks for me that you are transforming twice: once the image on line 223 and the data again which is just part of the image.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, they are redundant, artifact of earlier iteration of the code, I'll remove them.

import functools
import linecache
import os
import sys
import warnings
from types import NoneType
# FIXME from ... import wrap
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened here? I cannot say much about the other changes in this file...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decorator tracing theoretically could wrap a function for debugging. It's not used in the code and from what I can see, has been broken for many years.

NoneType was an unused import.

if germanic:
# germanic quoting style reverses the shapes
# straight double quotes
s = re.sub(ur"\s+''",u"”",s)
s = re.sub(u"''\s+",u"“",s)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this r?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above:

r"foo" is a string variant where escape sequences with backslash are not treated as such, useful in regexes, so as not to have to escape the backslash itself.

dbsdat

This comment was marked as spam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants