-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python3 compatibility #319
base: master
Are you sure you want to change the base?
Conversation
….imwrite, fix (?) a few conversions, reenable all tests
51b75e0
to
3c36720
Compare
Published this PR as a fork to pypi to test integrations. If anyone's interested, it can be installed with pip install ocrd-fork-ocropy==1.4.0a3 |
I only ran a cursory test but it seems to work. I'm fairly certain that unpickling old-style classes on python 3 didn't work at some point in the past but it eminently does now. You might want to change all the class definitions in |
https://bugs.python.org/issue5180
|
Sorry, I might've been a bit unclear. Old-style and new-style definitions for these classes are equivalent on py3 and updating these to the now common |
Hey @kba this looks very interesting! I agree that it would be good to make sure that ocropy also runs on python3. It seems that it is now easier to achieve python2 and python3 compatibility with libraries like six. Let me ask some general questions before looking deeper at the code:
|
Conflicts: ocropus-gpageseg
The conflicts will be substantial but trivial since structurally nothing should change, just syntax and some API calls in a lot of places.
Since this makes no essential changes except future-proofing the code, I wouldn't think so.
No, just updated it to c773dd2
I did not, it works surprisingly (suspiciously :D) well across versions. I've slightly extended run-test-ci but the codebase being what it is, there is no guarantee nothing did break. If something broke through my changes, it broke for py2 and py3 since the log files from the test suite look the same to me.
It is ready for review and I'm esp. happy for tips on how to do proper type conversions with numpy and PIL :) If you have input on what should be tested to avoid breaking stuff, also appreciated. |
The "proper" way for PIL -> np.array conversion is to explicitly set the image mode to 'L' or 'RGB' to ensure getting a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked through the code and here are some comments as well as questions. Please have a look at it. I just started some testing and will write more about this later.
- "3.4" | ||
- "3.5" | ||
- "3.6" | ||
- "3.7-dev" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this work now without this whole miniconda stuff? This makes it much easier...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It always worked without miniconda (see circle config and install instructions) but since @QuLogic went through the effort of setting it up with conda, we retained it in travis. I don't know enough about how to change the conda setup to test the various versions, so this seemed the simplest solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conda environment is created with python=$TRAVIS_PYTHON_VERSION
, which comes from this key; you didn't really need to re-write everything to get it to work.
But originally, miniconda was only necessary because SciPy took forever to compile from source; I assume there are wheels now.
Also, 3.7 is GA, you should use it and not 3.7-dev which is a very old snapshot.
- cd ../test_folder | ||
- ../ocropy/tests/run-unit | ||
- ../ocropy/run-test-ci | ||
- ./run-test-ci |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened with run-unit
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix. Just an oversight.
|
||
jobs: | ||
|
||
build-python27: &job-template |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was previously circleci.yml
. What caused the renaming? Unfortunately git
did not recognize this as a renaming...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Circle CI switched from 1.0 to 2.0 in August last year which is very different (.circle/config.yml
instead of .circleci.yml
, job/workflow based semantics, containers based directly on docker etc.)
"hocr", | ||
"lang", | ||
"default", | ||
"lineest", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these changes break possibly for someone who is relying on importing from ocrolib
? But even if yes, is that a realistic scenario? For whom?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this correctly, then the __all__
variable is just used for supporting from ocrolib import *
which we currently don't use in this project. Should we support that then anyway? What principles to follow then? Or could probably also delete more in this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might indeed break stuff, because default
was renamed default
. I would argue that from ocrolib import *
really is bad practice. common.py
has some ~70 functions and a few classes. If we want to make absolutely sure, nobody using the code as a library (which few people do I suppose) will experience breaks from wildcard imports, it would be better to list all those exports explicitly in __all__
.
from common import * | ||
from default import traceback as trace | ||
from .defaults import traceback as trace | ||
from .common import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what are these import statements needed at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above. They are imported to be exported, so users can write
from ocrolib import allsplitext
instead of
from ocrolib.common import allsplitext
@@ -17,11 +23,16 @@ import matplotlib.pyplot as plt | |||
from PIL import Image | |||
from PIL import ImageFont,ImageDraw | |||
from scipy.ndimage import filters,measurements,interpolation | |||
from scipy.misc import imsave | |||
from imageio import imwrite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that is a new library we need. I guess that we could not continue to use scipy or image io for some reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 This leads to an error for in one test case from run-test-ci
or easier try the direct call (with an existing image):
root@b9b634c48ea6:/ocropy# python ./ocropus-dewarp 'temp/0001/010011.bin.png'
# inputs 1
# CenterNormalizer
# temp/0001/010011.bin.png
ERROR Imageio Pillow plugin requires Pillow, not PIL!
It seems that in ocrolib.read_image_gray
uses PIL which is incompatible then with imwrite
from imageio
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, wait... The problem was that Pillow was just too old (3.1.2). After updateing to 5.1.4 it seems to work now. Should we update our requirements for Pillow, or would this better be something for imageio itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, there is a warning about conversion:
root@b9b634c48ea6:/ocropy# python ./ocropus-dewarp 'temp/0001/010011.bin.png'
# inputs 1
# CenterNormalizer
# temp/0001/010011.bin.png
WARNING:root:Lossy conversion from float32 to uint8. Range [0, 1]. Convert image
to uint8 prior to saving to suppress this warning.
Is this new and should we do something about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the "hacky conversions" I was talking about in the original PR comment. They are not lossy though, it's just that boolean values converted to uint8 directly (as the error message recommends), thresholds values in a way that the result is just plain black.
As it is, it's inefficient and should be fixed but not erroneous behavior AFAIK.
else: | ||
lfiles = python.sum([glob.glob(d+"/??????.bin.png") for d in dirs],[]) | ||
else: | ||
lfiles = sum([glob.glob(d+"/??????.bin.png") for d in dirs],[]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this was needed before, because we imported numpy differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
ocropus-gtedit
Outdated
else: | ||
data = urllib2.urlopen(image).read() | ||
data = urlopen(unicode(image)).read() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all these unicode transformation needed? It looks for me that you are transforming twice: once the image
on line 223 and the data
again which is just part of the image
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, they are redundant, artifact of earlier iteration of the code, I'll remove them.
import functools | ||
import linecache | ||
import os | ||
import sys | ||
import warnings | ||
from types import NoneType | ||
# FIXME from ... import wrap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened here? I cannot say much about the other changes in this file...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decorator tracing
theoretically could wrap a function for debugging. It's not used in the code and from what I can see, has been broken for many years.
NoneType
was an unused import.
if germanic: | ||
# germanic quoting style reverses the shapes | ||
# straight double quotes | ||
s = re.sub(ur"\s+''",u"”",s) | ||
s = re.sub(u"''\s+",u"“",s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this r
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above:
r"foo" is a string variant where escape sequences with backslash are not treated as such, useful in regexes, so as not to have to escape the backslash itself.
This is another go to make most of the engine compatible with python3, because ocropy is the last python2 holdout in the open source ecosphere. Even with development effectively ceased, it's still widely used and should at least be forward-compatible with python 3.4+.
Basically, we don't want to keep supporting Python2 in our stack just for ocropy :-)
Uses six as a compatibility library to handle things like unicode/byte strings, urlopen, pickle.
Updates the CircleCI configuration to the current format and tests across 2.7 and 3.4-3.7.
Since the test suite isn't that intuitive, I cannot guarantee I didn't break stuff.
I'm somewhat baffled that the default model (pickled with python2) would work for the python3 variant of rpred. I suspect an error in the setup, but see e.g the output in https://circleci.com/gh/OCR-D/ocropy/38, ctrl-f for
# loading object ./models/en-default.pyrnn.gz
. I remember @mittagessen predicting this to fail because of some of the old-style classes used but it doesn't...This also updates scipy, numpy, matplotlib and introduces imageio to work against some of the DeprecationWarnings that have turned into Errors in more recent versions of the libraries.
Some hacky type conversions (like converting boolean image arrays to float32) need further scrutiny.
Feedback is appreciated.
@tmbdev @zuphilip @syedsaqibbukhari @QuLogic @amitdo @mittagessen @wrznr @finkf