Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New high-level Python API #94

Open
wants to merge 38 commits into
base: master
Choose a base branch
from
Open

New high-level Python API #94

wants to merge 38 commits into from

Conversation

jbaiter
Copy link

@jbaiter jbaiter commented Oct 24, 2016

Since the original Python bindings are not working anymore and are unlikely to be fixed/maintained in the future, I created new high-level bindings using Cython. The module is compatible with both Python 2 and 3 and can be installed by running pip install . in the root directory of the repository.

For both training and prediction, images loaded via PIL/Pillow can be used, as well as numpy arrays.

Currently only the OCR functionality is exposed, but I plan on adding a wrapper around ClstmText in the future.

The API documentation can be found at https://jbaiter.github.io/clstm.

An example on how the training and prediction API is used can be found in run_uw3_500.py. This script is very close to what the run-uw3-500 application does, only through Python, so it can be used to compare performance. In my tests I found that the performance of the Python and C++ versions is pretty much indistinguishable.

@jbaiter
Copy link
Author

jbaiter commented Oct 25, 2016

I failed to get it to run in Debian Jessie with either Python2/3 but that is probably an include path problem. Cython either refused to import shared_ptr from libcpp.memory or segfaulted :|

Did you install Cython from the Jessie package? I just tried it with a pip installed Cython on Jessie and it works fine with both 2 and 3 (after fixing the bytes/str bug).

edit: Can confirm, this is due to Jessie shipping with Cython 0.21.1. Smart pointers like shared_ptr were only added in 0.23 :-/ I updated the requirements accordingly.

@kba
Copy link
Collaborator

kba commented Oct 25, 2016

I tried the version shipped in Jessie stable first, then pip install, but it seemed to fall back to the Jessie bundled path at some point. As I said, I guess it's just a path issue.

I'll try the fixed bytes/str commit later.

@kba
Copy link
Collaborator

kba commented Oct 25, 2016

After removing cython3, it works with Python3 in Jessie, no more unicode/str/bytes related exceptions 🎉 It's weird that /usr/lib cython takes precedence over /usr/local/lib or $HOME/.local/lib but apparently that's either an issue with Debian or my setup.

sudo pip2 install Cython; sudo pip2 install . work fine but python2 run_uw3_500.py immediately segaults.

@jbaiter
Copy link
Author

jbaiter commented Oct 25, 2016

Hm, that's weird :-) Can you make a core dump and check out the trace with gdb?

$ ulimit -c unlimited
$ python run_uw3_500.py
$ gdb $(which python) core
# Then enter `bt` to get the backtrace

@jbaiter
Copy link
Author

jbaiter commented Oct 26, 2016

The segfault was due to a compatibility problem with older versions of Pillow, Jessie uses 2.6.1 while I used 3.4.2 for developing. 2.9.0 added the width and height attributes, which I used to differentiate between Pillow.Image and numpy.ndarray in the image loading logic. Since images loaded with 2.6.1 did not have either of these attributes, they were interpreted as numpy arrays. Funnily, all the interfaces on Pillow.Image I accessed during image loading were also present on numpy.ndarray, but returned different things, which led to segfaults pretty deep into the stack.

@mittagessen
Copy link
Contributor

Awesome. Are already working on interfacing the lower level INetwork interface? If not I'll put something together as I'm currently working on a new training subcommand for kraken and the old swig bindings are not complete enough for that purpose.

@jbaiter
Copy link
Author

jbaiter commented Oct 26, 2016

Nope, I played around with it for a while, but gave up on it pretty quickly. My main aim was to make accessing the high-level OCR stuff from clstmhl.h available from Python, which is what >90% of all clstm users are currently using (via the CLI). I don't know if it's really worth the effort, since there are already a number of really good ML libraries with LSTM support available for Python.
Why do you need access to the lower-level APIs?

@mittagessen
Copy link
Contributor

My main need is having access to the output matrix for running a slightly modified label extractor producing bounding boxes as the label locations are just the point of the maximum value in the softmax layer in a thresholded region. Explicit codec access is also rather useful.

I'd quite like to switch to a ML library more widely used but I haven't found one yet that doesn't use incredibly annoying serialization (pickles, pickles everywhere and somewhat easy to fix) and more importantly has reasonably performant model instantiation. With CLSTM I'm able to instantiate/deserialize models instantaneously while tensorflow and theano always run compilation (and per default optimization) steps which take at least a minute even on a modern machine. As far as I know it is also rather inherent in their design so there's no way around it.

@amitdo
Copy link
Contributor

amitdo commented Oct 26, 2016

@mittagessen
what about this one:
https://github.com/baidu-research/warp-ctc
?

@amitdo
Copy link
Contributor

amitdo commented Oct 26, 2016

warp-ctc used with LSTM
https://github.com/dmlc/mxnet/tree/master/example/warpctc

"""
graphemes_str = u"".join(sorted(graphemes))
cdef vector[int] codec
cdef Py_ssize_t length = len(graphemes_str.encode("UTF-16")) // 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UTF-16 is a variable length encoding, which may increase code point count and produce an incorrect codec. Just exchange the length calculation by len(graphemes_str) and everything should be fine.

@mittagessen
Copy link
Contributor

I had a short look at mxnet as it seemed promising and I prefer its interface to theano's; initialization still takes quite a bit of time and warp-ctc is prone to crashes (so no drop-in replacement), although I'll probably work more with it for the layout analysis thingy once I get around to it.

@mittagessen
Copy link
Contributor

Sorry for spamming but there's one major reason for using the lower level interface. By preloading the entire training set into memory and doing all the normalization, encoding, etc. once, I've just now decreased training time by ~2/3. While I'm fairly sure the main reason is just having everything in main memory rerunning the codec and line normalization over and over again seems needlessly wasteful.

@jbaiter
Copy link
Author

jbaiter commented Oct 26, 2016

That's a really good point. I'll see what I can do about exposing the lower-level interfaces :-)

@wanghaisheng
Copy link

wanghaisheng commented Oct 26, 2016

@mittagessen

By preloading the entire training set into memory and doing all the normalization, encoding, etc. once, I've just now decreased training time by ~2/3.

how ? can i realize that through your kraken trainning api?

@mittagessen
Copy link
Contributor

@wanghaisheng: You don't really as the old swig interface is broken, so it isn't quite possible to instantiate a network. What is working (since yesterday night) is continuing training a model with the separate_derivs branch and some minor bug fixes to the swig interface. Wait a few days until we've sorted out some of the parallel development.

@jbaiter
Copy link
Author

jbaiter commented Oct 26, 2016

@mittagessen I've started work on exposing the INetwork interface, but am now stuck on creating wrappers around the Eigen tensor types (Eigen::Tensor<T, N>, Eigen::TensorMap<T>). It would be great if we could create an adapter so we can instantiate those types from numpy arrays (and vice versa) without having to copy the data. There's eigency, which claims to offer just that, but it's only for the regular Eigen types, not the (still officially unsupported) tensor types used by clstm :-/ Any ideas?

@amitdo
Copy link
Contributor

amitdo commented Oct 26, 2016

@jbaiter
What about basing your cython binding on the older matrix based code?

@mittagessen
Copy link
Contributor

The eigency code for eigen->numpy is just:

@cython.boundscheck(False)
cdef np.ndarray[float, ndim=2] ndarray_float_C(float *data, long rows, long cols, long row_stride, long col_stride):
    cdef float[:,:] mem_view = <float[:rows,:cols]>data
    dtype = 'float'
    cdef int itemsize = np.dtype(dtype).itemsize
    return as_strided(np.asarray(mem_view, dtype=dtype, order="C"), strides=[row_stride*itemsize, col_stride*itemsize])

for bazillion combinations of orders and data types and while I haven't looked at the memory layout of a tensor object it should work for 2nd order tensors without adaptation (ugly but workable for now).

The other way around is in eigency_cpp.h and will probably work for 2nd order tensors, too. For higher orders I'd have to take a look at how strides are implemented in both ndarray and eigen tensors.

Basic CLI for training OCR using python bindings
@kba
Copy link
Collaborator

kba commented Dec 8, 2017

I've merged this with current master in cython-2017 branch, so as not to interfere with any changes you may not have pushed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants