-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Tesseract 4.1.0 Planning #2249
Comments
Here is the link to issues marked for 4.1.0 milestone. This next link is to a longer list of issues marked as bugs. |
Suggestions:
People have a hard time trying to follow Ray's tutorial page on wiki for training LSTM engine with all its different options. A new repo which has a minimal set of files for running the tutorial and bash scripts with commands for the same will be helpful. It could also be extended to show a sample of finetuning for a different language.
|
I tagged current code and 4.1.0-rc1 so we can start preparing release. It would be great if somebody could update Changelog, release notes etc... |
@wrznr and me like the idea, we're also open to directly including this in tesseract and/or changing the name of the project to better reflect that it's about tesseract training. |
@kba and @wrznr Thank you for your response. Currently there are three different options for LSTM training:
There were earlier suggestions of changing from It would be great if there can be unified and simpler approach to tesseract LSTM training. Would you be open to enhancing You will have more independence and control with a separate project. But I am sure it could be directly included in tesseract, if that is your preference. The decision though will be up to the project owners and maintainers. |
I already created A |
@jbarlow83 I haven't tried the python script yet. Which version of python do I need for it? |
3.5 or newer I believe. Later edit: this is not accurate; use 3.6
|
@jbarlow83 I have Python 3.5 but am not able to run the script.
What is the command to use?
|
Python 3.6 or newer is required. I thought this was a reasonable
requirement since the baseline Linux system for Tesseract 4 is Ubuntu 18.04.
…On Fri, Mar 15, 2019 at 2:27 AM Shreeshrii ***@***.***> wrote:
@jbarlow83 <https://github.com/jbarlow83> I have Python 3.5 but am not
able to run the script.
python3 --version
Python 3.5.2
What is the command to use?
python3 ./src/training/tesstrain.py
python3 ./src/training/tesstrain.py
File "./src/training/tesstrain.py", line 66
log.info(f"=== Starting training for language {ctx.lang_code}")
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2249 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABvcMxDnGZw69YWNrKFFp89ckZODidxpks5vW2edgaJpZM4a-x1v>
.
|
I am using Let me see if I can upgrade python. |
Debian stable also has 3.5. |
Please see https://github.com/Shreeshrii/tess4training which has the start of a new sub-repo for tesseract-ocr for running the Tesstutorials. Please give it a try and let me know if there are any issues with it. |
So, I think that the next release should be 5.0.0, not 4.1.0. |
Changed the topic to say 5.0.0. Thanks @amitdo. Also noticed just now that @stweil had setup 'Tesseract Next' as a project at https://github.com/tesseract-ocr/tesseract/projects/1 |
Why? IMO major number should be increase if we do backwards incompatible changes of public API. Excluding adding new renderers (that actually could be handled by provided external programs) changes are minor - usually related to code improvement. |
Can somebody elaborate on this (serious API change) "a little bit"? My understanding is that: we modified API => we need co change MINOR version if we are backward compatible with previous version. So:
So can somebody produce real life code where linkage will fail or crash? Without that I really do not see reason why to use 5.0.0. instead of 4.1.0 (=>we are backward compatible) |
It's definitely not backward compatible. Regarding removal of symbols, any symbol exported by libtesseract.so is fair game for linking and if symbols are removed linking will fail. A linker can't link to a symbol that isn't there anymore. It does seem unlikely anyone would use Tesseract's API for reversing a singly linked list, but it may be that header-only/inlined functions use exercise it. The removal of the exported symbols is still a backward compatibility breach of the "API contract". One that shows the problem most clearly is ResultIterator. For example this code: #include <tesseract/baseapi.h>
int main()
{
tesseract::TessBaseAPI api;
// (add some boilerplate here to OCR an image if desired)
api.GetIterator()->GetBestLSTMSymbolChoices();
return 0;
} Disassembles to
Since this is a vtable, libtesseract will contain a symbol for the vtable itself. The method names are not in the vtable, only offsets. If the offsets change, we're not backward compatible. The above disassembly basically says: *(ResultIterator->vtable[0x48])(this); and between master and v4.0, the function at address For your example from the wiki, In the case of |
Doesn't the vtable contain only the virtual functions? Then all functions which are not virtual will still be found. Changed sizes of classes are a problem for compatibility if the API users can use |
The compatibility report lists few problems, and only 5 are classified as high. Two of those are for removed symbols. I don't remember that we removed The other three are related to @noahmetzger's changes for the choice iterator. Maybe Noah can find a solution which is API compatible, so that could be solved. There is one problem classified as medium for class So I see no reason why it should not be possible to make a version 4.1 which is backward compatible. |
Hi @jbarlow83, Are any of the changes considered API breakage or only ABI breakage? if I'm not mistaken, the semver is about API, and the soversion (libname.so.6) is about ABI. |
Yes the class vtable only contains virtual methods, but it's still a problem because vtable is a part of the public API and ABI. Classes sizes changes are a problem for just about any class that accesses its member variables, and for classes that have implicit constructors (because they will be created and inlined on the fly) or header-defined constructors (inlined on the fly). The workaround is to use the private class data pattern and ensure the implementation is not part of the API. I agree these issues are probably fixable. My point was to try to explain why it's not something to ignore, either the version number gets a major bump or the issues get fixed. |
The removed functions from |
@amitdo As they say: "ABI=library API + compiler ABI". Removing the semver was written to be language agnostic. I'd argue that in the case of C++ libraries, backward compatibility means ABI compatibility, because without that, applications break. As far as I can tell we ignore soversion in tesseract and just set it to the application version. We could start managing soversion separately and that would allow bumping the ABI version separately, if we want to make that distinction. |
Leptonica is an example for a (C-based) library that makes this distinction. |
I would like to release 4.1.0 mainly as bugfix release + new features (extension of API with ALTO, LSTMBox and WordStrBox renderes). |
Yes, that's quite fine. Practically, after my first foray in the 3.0 days, I avoid any attempts to work with Tesseract at the API level because it's been too unstable. I really appreciate that we're paying more attention to it. |
OK. Now the master is tagged as 5.0.0-alpha and we have 4.1 branch for making API backward compatible. |
Can we replace |
Yes, this the way. First we need fix API compatibility for 4.1 because than back-porting changes will be difficult. |
@stweil @noahmetzger: any comments/update on 4.1.0? |
What's the timeline for final 5.0.0? |
There is no timelime for 5.0.0 |
IMO 5.0 (or 5.1) should be ready to be released no later than mid-February next year, so it can be packaged and shipped for Ubuntu 20.04 LTS. |
Planning page on wiki and RFC: Tesseract 4.0.0 – open tasks were used for discussion regarding 4.0.0 release.
Please prioritize and discuss here the tasks to be tackled for next release - 4.1.0.
The text was updated successfully, but these errors were encountered: