Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Tesseract 4.1.0 Planning #2249

Closed
Shreeshrii opened this issue Feb 16, 2019 · 37 comments
Closed

RFC: Tesseract 4.1.0 Planning #2249

Shreeshrii opened this issue Feb 16, 2019 · 37 comments
Labels
Milestone

Comments

@Shreeshrii
Copy link
Collaborator

Planning page on wiki and RFC: Tesseract 4.0.0 – open tasks were used for discussion regarding 4.0.0 release.

Please prioritize and discuss here the tasks to be tackled for next release - 4.1.0.

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Feb 16, 2019

Here is the link to issues marked for 4.1.0 milestone.

This next link is to a longer list of issues marked as bugs.

@zdenop zdenop added this to the 4.1.0 milestone Feb 16, 2019
@Shreeshrii Shreeshrii changed the title RFC: Tesseract 4.1.0 – open tasks RFC: Tesseract 4.1.0 Planning Feb 16, 2019
@Shreeshrii
Copy link
Collaborator Author

Suggestions:

  1. Add a new repo for tesstutorial

People have a hard time trying to follow Ray's tutorial page on wiki for training LSTM engine with all its different options. A new repo which has a minimal set of files for running the tutorial and bash scripts with commands for the same will be helpful. It could also be extended to show a sample of finetuning for a different language.

  1. Add bash scripts (similar to tesstrain.sh) for creating 'strarter traineddataforLSTM trainingusingbox/tiff pairs`.

@Shreeshrii
Copy link
Collaborator Author

An alternative approach could be to include ocrd-train project as a submodule in tesseract for LSTM training. @kba @wrznr @stweil @zdenop - What do you think of the idea?

@zdenop
Copy link
Contributor

zdenop commented Feb 17, 2019

I tagged current code and 4.1.0-rc1 so we can start preparing release. It would be great if somebody could update Changelog, release notes etc...

@kba
Copy link

kba commented Feb 22, 2019

An alternative approach could be to include ocrd-train project as a submodule in tesseract for LSTM training.

@wrznr and me like the idea, we're also open to directly including this in tesseract and/or changing the name of the project to better reflect that it's about tesseract training.

@Shreeshrii
Copy link
Collaborator Author

@kba and @wrznr Thank you for your response.

Currently there are three different options for LSTM training:

  1. ocrd-train - using line images and matching ground truth.

  2. tesstrain.sh - using box/tiff pairs generated from training text and fonts.

  3. The new options of lstmbox and wordstrbox generate box files suitable LSTM training from scanned page images using tesseract. These are suitable for training from scanned images where no ground truth is initially available. For RTL languages, however, they are not directly editable. See comment regarding needed helper scripts by @amito - Add renderer to create WordStr box files from images #2231 (comment). The downside with this approach is that the box files need to be reviewed and corrected before being used for further training. Currently there is no defined path for using these for LSTM training - modified versions of ocrd-train or tesstrain.sh need to be used.

There were earlier suggestions of changing from bash based tesstrain.sh to make based solution.

It would be great if there can be unified and simpler approach to tesseract LSTM training. Would you be open to enhancing ocrd-train to support all?

You will have more independence and control with a separate project. But I am sure it could be directly included in tesseract, if that is your preference. The decision though will be up to the project owners and maintainers.

@jbarlow83
Copy link

I already created tesstrain.py which is a conversion of the bash script to Python with many improvements and better progress and error reporting. It's in master now. It only supports LSTM, however.

A make frontend could call portions of it if someone really wanted to go with make.

@Shreeshrii
Copy link
Collaborator Author

@jbarlow83 I haven't tried the python script yet. Which version of python do I need for it?

@jbarlow83
Copy link

jbarlow83 commented Mar 6, 2019 via email

@Shreeshrii
Copy link
Collaborator Author

@jbarlow83 I have Python 3.5 but am not able to run the script.

python3 --version
Python 3.5.2

What is the command to use?

 python3 ./src/training/tesstrain.py
python3 ./src/training/tesstrain.py
  File "./src/training/tesstrain.py", line 66
    log.info(f"=== Starting training for language {ctx.lang_code}")

@jbarlow83
Copy link

jbarlow83 commented Mar 15, 2019 via email

@Shreeshrii
Copy link
Collaborator Author

I am using
NAME="Ubuntu"
VERSION="16.04.5 LTS (Xenial Xerus)"

Let me see if I can upgrade python.

@amitdo
Copy link
Collaborator

amitdo commented Mar 15, 2019

Debian stable also has 3.5.

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Mar 31, 2019

Please see https://github.com/Shreeshrii/tess4training which has the start of a new sub-repo for tesseract-ocr for running the Tesstutorials.

Please give it a try and let me know if there are any issues with it.

@amitdo
Copy link
Collaborator

amitdo commented Apr 3, 2019

https://abi-laboratory.pro/index.php?view=compat_report&l=tesseract&v1=4.0.0&v2=current&obj=ff668&kind=abi

So, I think that the next release should be 5.0.0, not 4.1.0.

@Shreeshrii Shreeshrii changed the title RFC: Tesseract 4.1.0 Planning RFC: Tesseract 5.0.0 Planning Apr 3, 2019
@Shreeshrii
Copy link
Collaborator Author

Changed the topic to say 5.0.0. Thanks @amitdo.

Also noticed just now that @stweil had setup 'Tesseract Next' as a project at https://github.com/tesseract-ocr/tesseract/projects/1

@zdenop
Copy link
Contributor

zdenop commented Apr 3, 2019

Why? IMO major number should be increase if we do backwards incompatible changes of public API. Excluding adding new renderers (that actually could be handled by provided external programs) changes are minor - usually related to code improvement.

@jbarlow83
Copy link

@zdenop: as @amitdo pointed out, the public API already has serious backward incompatible changes. Serious in the sense that the API is no longer compatible and ABI linkage will fail or crash.

@zdenop
Copy link
Contributor

zdenop commented Apr 7, 2019

Can somebody elaborate on this (serious API change) "a little bit"?

My understanding is that: we modified API => we need co change MINOR version if we are backward compatible with previous version. So:

  1. Extending of API is backward compatible.
  2. Removing of symbols could be problem, but I really doubt in current case it is issue (if yes, than please provide real life example for testing)
  3. Problem with data types I am not able to evaluate... Because first change here was releated to "monitor" I just made simple test: I took example from wiki and build it against 4.0.0 tesseract library (libtesseract.so.4). Then I uninstall tesseract and installed the latest code and run program again (without recompiling). It runs without problem and any error message (no linking or missing symbol).

So can somebody produce real life code where linkage will fail or crash? Without that I really do not see reason why to use 5.0.0. instead of 4.1.0 (=>we are backward compatible)

@jbarlow83
Copy link

jbarlow83 commented Apr 7, 2019

It's definitely not backward compatible.

Regarding removal of symbols, any symbol exported by libtesseract.so is fair game for linking and if symbols are removed linking will fail. A linker can't link to a symbol that isn't there anymore. It does seem unlikely anyone would use Tesseract's API for reversing a singly linked list, but it may be that header-only/inlined functions use exercise it. The removal of the exported symbols is still a backward compatibility breach of the "API contract".

One that shows the problem most clearly is ResultIterator. For example this code:

#include <tesseract/baseapi.h>

int main()
{
	tesseract::TessBaseAPI api;
        // (add some boilerplate here to OCR an image if desired)
        api.GetIterator()->GetBestLSTMSymbolChoices();
	return 0;
}

Disassembles to

0000000100000ecf        callq   0x100000f4e ## symbol stub for: __ZN9tesseract11TessBaseAPI11GetIter
atorEv
0000000100000ed4        movq    %rax, -0xc0(%rbp)
0000000100000edb        jmp     0x100000ee0
0000000100000ee0        movq    -0xc0(%rbp), %rax
0000000100000ee7        movq    (%rax), %rcx
0000000100000eea        movq    0x48(%rcx), %rcx
0000000100000eee        movq    %rax, %rdi
0000000100000ef1        callq   *%rcx   ## relative function call

Since this is a vtable, libtesseract will contain a symbol for the vtable itself. The method names are not in the vtable, only offsets. If the offsets change, we're not backward compatible.

The above disassembly basically says:

*(ResultIterator->vtable[0x48])(this);

and between master and v4.0, the function at address vtable[0x48] changed from being the address of GetBestLSTMSymbolChoices to the address of GetRawLSTMTimesteps. I think it should be pretty clear this will crash.

For your example from the wiki, ResultIterator::GetUTF8Text is not affected by the change in vtable; it's still at the same offset. So yes, it will work.

In the case of ChoiceIterator, several variables were added such that the size of the structure allocated will change. If caller C++ makes a copy of a ChoiceIterator (including an implicit copy), it will allocate the wrong size if linked against libtesseract 4, and this will cause data corruption.

@stweil
Copy link
Contributor

stweil commented Apr 8, 2019

Doesn't the vtable contain only the virtual functions? Then all functions which are not virtual will still be found. Changed sizes of classes are a problem for compatibility if the API users can use new or inline functions which access member variables.

@stweil
Copy link
Contributor

stweil commented Apr 8, 2019

The compatibility report lists few problems, and only 5 are classified as high.

Two of those are for removed symbols. I don't remember that we removed insert or reverse functions, so that still needs to be examined.

The other three are related to @noahmetzger's changes for the choice iterator. Maybe Noah can find a solution which is API compatible, so that could be solved.

There is one problem classified as medium for class ETEXT_DESC. This can be fixed, too.

So I see no reason why it should not be possible to make a version 4.1 which is backward compatible.

@amitdo
Copy link
Collaborator

amitdo commented Apr 8, 2019

Hi @jbarlow83,

Are any of the changes considered API breakage or only ABI breakage?

if I'm not mistaken, the semver is about API, and the soversion (libname.so.6) is about ABI.

@jbarlow83
Copy link

Yes the class vtable only contains virtual methods, but it's still a problem because vtable is a part of the public API and ABI. Classes sizes changes are a problem for just about any class that accesses its member variables, and for classes that have implicit constructors (because they will be created and inlined on the fly) or header-defined constructors (inlined on the fly). The workaround is to use the private class data pattern and ensure the implementation is not part of the API.

I agree these issues are probably fixable. My point was to try to explain why it's not something to ignore, either the version number gets a major bump or the issues get fixed. ResultIterator can be fixed by reordering the methods such that new virtual methods are appended. ChoicesIterator would be trickier - it might work to use private class data and 8 bytes of padding.

@stweil
Copy link
Contributor

stweil commented Apr 8, 2019

The removed functions from oldlist.cpp are the result of my commit f76d8a1. I think that nobody will miss those functions and don't think that 5.0.0 is needed because of them. On the other hand it would be easy to restore them.

@jbarlow83
Copy link

@amitdo As they say: "ABI=library API + compiler ABI".

Removing the list_rec methods is API breakage. The other problems are compiler ABI breakage, but it's all ABI breakage.

semver was written to be language agnostic. I'd argue that in the case of C++ libraries, backward compatibility means ABI compatibility, because without that, applications break.

As far as I can tell we ignore soversion in tesseract and just set it to the application version. We could start managing soversion separately and that would allow bumping the ABI version separately, if we want to make that distinction.

@amitdo
Copy link
Collaborator

amitdo commented Apr 8, 2019

Leptonica is an example for a (C-based) library that makes this distinction.

@zdenop
Copy link
Contributor

zdenop commented Apr 13, 2019

I would like to release 4.1.0 mainly as bugfix release + new features (extension of API with ALTO, LSTMBox and WordStrBox renderes).
This can happen in dedicated branch "4.x" where we can revert all necessary changes in API. So we can continue breaking API in master :-) (master will be tagged as 5.0.0alpha)
Is this proposal fine for you or is there other proposal how to move forward?

@jbarlow83
Copy link

Yes, that's quite fine.

Practically, after my first foray in the 3.0 days, I avoid any attempts to work with Tesseract at the API level because it's been too unstable. I really appreciate that we're paying more attention to it.

@zdenop
Copy link
Contributor

zdenop commented Apr 20, 2019

OK. Now the master is tagged as 5.0.0-alpha and we have 4.1 branch for making API backward compatible.
Can @noahmetzger revert changes related API?
@stweil: What is your suggestion to fix ETEXT_DESC? Also your original commit changes size of end_time
Also we need to find out minimum public headers...

@stweil
Copy link
Contributor

stweil commented Apr 22, 2019

Can we replace STRING and GenericVector by standard C++ types in the public API for 5.0.0? It's a new major version, so we are free to clean as much as possible and reasonable...

@zdenop
Copy link
Contributor

zdenop commented Apr 23, 2019

Yes, this the way. First we need fix API compatibility for 4.1 because than back-porting changes will be difficult.

@zdenop
Copy link
Contributor

zdenop commented Apr 29, 2019

@stweil @noahmetzger: any comments/update on 4.1.0?

@zdenop
Copy link
Contributor

zdenop commented May 1, 2019

Now we reach 100%
image

There still some (strange) report about Problems with Symbols (Low Severity), but I were and are protected:
image

@amitdo
Copy link
Collaborator

amitdo commented Jun 17, 2019

What's the timeline for final 5.0.0?

@zdenop
Copy link
Contributor

zdenop commented Jun 17, 2019

There is no timelime for 5.0.0

@amitdo
Copy link
Collaborator

amitdo commented Jul 3, 2019

IMO 5.0 (or 5.1) should be ready to be released no later than mid-February next year, so it can be packaged and shipped for Ubuntu 20.04 LTS.

@zdenop zdenop modified the milestone: 4.1.0 Jul 12, 2019
@zdenop zdenop changed the title RFC: Tesseract 5.0.0 Planning RFC: Tesseract 4.1.0 Planning Jul 12, 2019
@zdenop zdenop closed this as completed Jul 12, 2019
@amitdo amitdo added the RFC label Mar 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants