-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make TessdataManager
able to save archive using LibArchive
#4187
base: main
Are you sure you want to change the base?
Make TessdataManager
able to save archive using LibArchive
#4187
Conversation
@stweil , shall I add a test? |
ASSERT_HOST(is_loaded_); | ||
std::vector<char> data; | ||
Serialize(&data); | ||
if (writer == nullptr) { | ||
#if defined(HAVE_LIBARCHIVE) | ||
return SaveArchiveFile(filename); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change TessdataManager::SaveFile
will always write traineddata files in ZIP format which are incompatible with Tesseract binaries which were build without LibArchive. I'm afraid that would cause problems for a lot of people.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought libarchive can deduce archive types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can, but the current code uses archive_write_set_format_zip
instead of archive_write_set_format_by_name
, so it will always write a ZIP file. And of course libarchive cannot write the proprietary traineddata format.
I'd prefer a more general solution which allows different target formats. In addition it should allow writing to a different output file and use long options. So the syntax might look like this:
|
Isn't automatic format simpler? |
That's right, and this feature of LibArchive would also be used to implement my suggested solution. If we implement support for |
Is there also intention to read such converted data by tesseract? If yes, than please be careful about changing extension: it will break a lot of workflows that looks for available/installed languages (AFAIR also GetAvailableLanguagesAsVector a.k.a Nowadays it is quite common to use private file extension instated of indication it is archive (e.h. xlsx, odt are zip archives) On other hand: if file extension will not be changed and tesseract will be build without libarchive support, that has to be improved error handling why tesseract is not able read traineddata... |
I thought this feature is decided on to be implemented. Now it seems it's an arguable one. My own requirement i.e. inspecting config file in the |
TessDataManager
able to save archive using LibArchive-t
option tocombine_tessdata
to transform proprietary.traineddata
to archive file.