New high-level Python API #94

jbaiter · 2016-10-24T14:46:29Z

Since the original Python bindings are not working anymore and are unlikely to be fixed/maintained in the future, I created new high-level bindings using Cython. The module is compatible with both Python 2 and 3 and can be installed by running pip install . in the root directory of the repository.

For both training and prediction, images loaded via PIL/Pillow can be used, as well as numpy arrays.

Currently only the OCR functionality is exposed, but I plan on adding a wrapper around ClstmText in the future.

The API documentation can be found at https://jbaiter.github.io/clstm.

An example on how the training and prediction API is used can be found in run_uw3_500.py. This script is very close to what the run-uw3-500 application does, only through Python, so it can be used to compare performance. In my tests I found that the performance of the Python and C++ versions is pretty much indistinguishable.

- Use PIL.PyAccess when filling Tensor2 from image - Return unicode string from `ClstmOcr.aligned` - Disable warnings during compilation

into a higher-level `prepare_training` method

jbaiter · 2016-10-25T14:40:58Z

I failed to get it to run in Debian Jessie with either Python2/3 but that is probably an include path problem. Cython either refused to import shared_ptr from libcpp.memory or segfaulted :|

Did you install Cython from the Jessie package? I just tried it with a pip installed Cython on Jessie and it works fine with both 2 and 3 (after fixing the bytes/str bug).

edit: Can confirm, this is due to Jessie shipping with Cython 0.21.1. Smart pointers like shared_ptr were only added in 0.23 :-/ I updated the requirements accordingly.

kba · 2016-10-25T14:54:24Z

I tried the version shipped in Jessie stable first, then pip install, but it seemed to fall back to the Jessie bundled path at some point. As I said, I guess it's just a path issue.

I'll try the fixed bytes/str commit later.

kba · 2016-10-25T17:22:31Z

After removing cython3, it works with Python3 in Jessie, no more unicode/str/bytes related exceptions 🎉 It's weird that /usr/lib cython takes precedence over /usr/local/lib or $HOME/.local/lib but apparently that's either an issue with Debian or my setup.

sudo pip2 install Cython; sudo pip2 install . work fine but python2 run_uw3_500.py immediately segaults.

jbaiter · 2016-10-25T19:29:15Z

Hm, that's weird :-) Can you make a core dump and check out the trace with gdb?

$ ulimit -c unlimited
$ python run_uw3_500.py
$ gdb $(which python) core
# Then enter `bt` to get the backtrace

jbaiter · 2016-10-26T09:14:39Z

The segfault was due to a compatibility problem with older versions of Pillow, Jessie uses 2.6.1 while I used 3.4.2 for developing. 2.9.0 added the width and height attributes, which I used to differentiate between Pillow.Image and numpy.ndarray in the image loading logic. Since images loaded with 2.6.1 did not have either of these attributes, they were interpreted as numpy arrays. Funnily, all the interfaces on Pillow.Image I accessed during image loading were also present on numpy.ndarray, but returned different things, which led to segfaults pretty deep into the stack.

mittagessen · 2016-10-26T11:59:20Z

Awesome. Are already working on interfacing the lower level INetwork interface? If not I'll put something together as I'm currently working on a new training subcommand for kraken and the old swig bindings are not complete enough for that purpose.

jbaiter · 2016-10-26T12:30:07Z

Nope, I played around with it for a while, but gave up on it pretty quickly. My main aim was to make accessing the high-level OCR stuff from clstmhl.h available from Python, which is what >90% of all clstm users are currently using (via the CLI). I don't know if it's really worth the effort, since there are already a number of really good ML libraries with LSTM support available for Python.
Why do you need access to the lower-level APIs?

mittagessen · 2016-10-26T13:00:02Z

My main need is having access to the output matrix for running a slightly modified label extractor producing bounding boxes as the label locations are just the point of the maximum value in the softmax layer in a thresholded region. Explicit codec access is also rather useful.

I'd quite like to switch to a ML library more widely used but I haven't found one yet that doesn't use incredibly annoying serialization (pickles, pickles everywhere and somewhat easy to fix) and more importantly has reasonably performant model instantiation. With CLSTM I'm able to instantiate/deserialize models instantaneously while tensorflow and theano always run compilation (and per default optimization) steps which take at least a minute even on a modern machine. As far as I know it is also rather inherent in their design so there's no way around it.

amitdo · 2016-10-26T14:27:23Z

@mittagessen
what about this one:
https://github.com/baidu-research/warp-ctc
?

amitdo · 2016-10-26T14:37:00Z

warp-ctc used with LSTM
https://github.com/dmlc/mxnet/tree/master/example/warpctc

mittagessen · 2016-10-26T14:38:58Z

pyclstm.pyx

+        """
+        graphemes_str = u"".join(sorted(graphemes))
+        cdef vector[int] codec
+        cdef Py_ssize_t length = len(graphemes_str.encode("UTF-16")) // 2


UTF-16 is a variable length encoding, which may increase code point count and produce an incorrect codec. Just exchange the length calculation by len(graphemes_str) and everything should be fine.

mittagessen · 2016-10-26T14:47:10Z

I had a short look at mxnet as it seemed promising and I prefer its interface to theano's; initialization still takes quite a bit of time and warp-ctc is prone to crashes (so no drop-in replacement), although I'll probably work more with it for the layout analysis thingy once I get around to it.

mittagessen · 2016-10-26T14:57:06Z

Sorry for spamming but there's one major reason for using the lower level interface. By preloading the entire training set into memory and doing all the normalization, encoding, etc. once, I've just now decreased training time by ~2/3. While I'm fairly sure the main reason is just having everything in main memory rerunning the codec and line normalization over and over again seems needlessly wasteful.

jbaiter · 2016-10-26T15:03:19Z

That's a really good point. I'll see what I can do about exposing the lower-level interfaces :-)

wanghaisheng · 2016-10-26T17:09:45Z

@mittagessen

By preloading the entire training set into memory and doing all the normalization, encoding, etc. once, I've just now decreased training time by ~2/3.

how ? can i realize that through your kraken trainning api?

mittagessen · 2016-10-26T18:46:58Z

@wanghaisheng: You don't really as the old swig interface is broken, so it isn't quite possible to instantiate a network. What is working (since yesterday night) is continuing training a model with the separate_derivs branch and some minor bug fixes to the swig interface. Wait a few days until we've sorted out some of the parallel development.

jbaiter · 2016-10-26T19:58:43Z

@mittagessen I've started work on exposing the INetwork interface, but am now stuck on creating wrappers around the Eigen tensor types (Eigen::Tensor<T, N>, Eigen::TensorMap<T>). It would be great if we could create an adapter so we can instantiate those types from numpy arrays (and vice versa) without having to copy the data. There's eigency, which claims to offer just that, but it's only for the regular Eigen types, not the (still officially unsupported) tensor types used by clstm :-/ Any ideas?

amitdo · 2016-10-26T20:07:46Z

@jbaiter
What about basing your cython binding on the older matrix based code?

mittagessen · 2016-10-26T20:31:54Z

The eigency code for eigen->numpy is just:

@cython.boundscheck(False)
cdef np.ndarray[float, ndim=2] ndarray_float_C(float *data, long rows, long cols, long row_stride, long col_stride):
    cdef float[:,:] mem_view = <float[:rows,:cols]>data
    dtype = 'float'
    cdef int itemsize = np.dtype(dtype).itemsize
    return as_strided(np.asarray(mem_view, dtype=dtype, order="C"), strides=[row_stride*itemsize, col_stride*itemsize])

for bazillion combinations of orders and data types and while I haven't looked at the memory layout of a tensor object it should work for 2nd order tensors without adaptation (ugly but workable for now).

The other way around is in eigency_cpp.h and will probably work for 2nd order tensors, too. For higher orders I'd have to take a look at how strides are implemented in both ndarray and eigen tensors.

Basic CLI for training OCR using python bindings

kba · 2017-12-08T22:34:05Z

I've merged this with current master in cython-2017 branch, so as not to interfere with any changes you may not have pushed.

jbaiter and others added 30 commits August 4, 2016 04:25

Fix custom eigen path in SConstruct

390fb1a

Fix isnan

b9b0769

Fix test error reporting

bdd149c

First running cythonized bindings

c85894e

Implement bindings for recognition

22570e3

Add levenshtein binding

b8fe1be

Minor stuff

ec56f6d

- Use PIL.PyAccess when filling Tensor2 from image - Return unicode string from `ClstmOcr.aligned` - Disable warnings during compilation

Add example Python script for UW3 training

fdb0fc8

Remove SWIG bindings

2aed97c

Include compiled protobuf files

f1466fb

Fix bug in protobuf stale check

56ca566

python: allow loading of model in constructor

4e2a7aa

python: Add option to use numpy array as image data

d17cea1

Merge branch 'master' of https://github.com/tmbdev/clstm into cython

90b336a

Combine create_bidi/set_learning_rate

368e6af

into a higher-level `prepare_training` method

More docstrings

5e02b0f

Embed function signatures into Python extension

1b66443

Add docs for Python extension

7b32e65

Update README

9aecfe0

Rename to

3c9b5ad

Don't track generated protobuf code

02f14b2

Fix typo in Cython code

4701ccb

Adapt run_uw3_500.py script

fa6b00e

Update docs

eed8b55

Fix typo in docstring

d96195a

Python 3 compatibility

002c9ce

Merge branch 'master' into cython

e9be7cd

Merge remote-tracking branch 'upstream' into cython

92245d8

Add requirements.txt

262b691

Fix std=c++11 flag in setup.py

7050e57

Allow all possible string types for fname in save/load

62dadea

Update required Cython version

d3ee308

Remove unused import

c690c99

Make image loading compatible with Pillow<2.9.0

1c89c0a

jbaiter force-pushed the cython branch from 4db4bb0 to 1c89c0a Compare October 26, 2016 09:06

mittagessen reviewed Oct 26, 2016

View reviewed changes

Fix length calculation (thanks @mittagessen)

47653f3

wanghaisheng mentioned this pull request Oct 26, 2016

CLSTM 安装测试 project 4 notes wanghaisheng/awesome-ocr#14

Closed

Basic training CLI 'pyclstm-train'

40cd277

kba mentioned this pull request Oct 28, 2016

ltrain/lpred: Using CSLTM backend ocropus-archive/DUP-ocropy#130

Open

Merge pull request #1 from kba/cython-trainer-cli

dbd07e4

Basic CLI for training OCR using python bindings

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New high-level Python API #94

New high-level Python API #94

jbaiter commented Oct 24, 2016 •

edited

Loading

jbaiter commented Oct 25, 2016 •

edited

Loading

kba commented Oct 25, 2016

kba commented Oct 25, 2016

jbaiter commented Oct 25, 2016

jbaiter commented Oct 26, 2016 •

edited

Loading

mittagessen commented Oct 26, 2016

jbaiter commented Oct 26, 2016

mittagessen commented Oct 26, 2016

amitdo commented Oct 26, 2016

amitdo commented Oct 26, 2016

mittagessen Oct 26, 2016

mittagessen commented Oct 26, 2016

mittagessen commented Oct 26, 2016

jbaiter commented Oct 26, 2016

wanghaisheng commented Oct 26, 2016 •

edited by kba

Loading

mittagessen commented Oct 26, 2016

jbaiter commented Oct 26, 2016 •

edited

Loading

amitdo commented Oct 26, 2016 •

edited

Loading

mittagessen commented Oct 26, 2016

kba commented Dec 8, 2017

New high-level Python API #94

Are you sure you want to change the base?

New high-level Python API #94

Conversation

jbaiter commented Oct 24, 2016 • edited Loading

jbaiter commented Oct 25, 2016 • edited Loading

kba commented Oct 25, 2016

kba commented Oct 25, 2016

jbaiter commented Oct 25, 2016

jbaiter commented Oct 26, 2016 • edited Loading

mittagessen commented Oct 26, 2016

jbaiter commented Oct 26, 2016

mittagessen commented Oct 26, 2016

amitdo commented Oct 26, 2016

amitdo commented Oct 26, 2016

mittagessen Oct 26, 2016

Choose a reason for hiding this comment

mittagessen commented Oct 26, 2016

mittagessen commented Oct 26, 2016

jbaiter commented Oct 26, 2016

wanghaisheng commented Oct 26, 2016 • edited by kba Loading

mittagessen commented Oct 26, 2016

jbaiter commented Oct 26, 2016 • edited Loading

amitdo commented Oct 26, 2016 • edited Loading

mittagessen commented Oct 26, 2016

kba commented Dec 8, 2017

jbaiter commented Oct 24, 2016 •

edited

Loading

jbaiter commented Oct 25, 2016 •

edited

Loading

jbaiter commented Oct 26, 2016 •

edited

Loading

wanghaisheng commented Oct 26, 2016 •

edited by kba

Loading

jbaiter commented Oct 26, 2016 •

edited

Loading

amitdo commented Oct 26, 2016 •

edited

Loading