Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is fine-tuning available for the available fonts? #3

Open
wolfassi123 opened this issue Mar 10, 2022 · 10 comments
Open

Is fine-tuning available for the available fonts? #3

wolfassi123 opened this issue Mar 10, 2022 · 10 comments

Comments

@wolfassi123
Copy link

@Shreeshrii I have dowloaded your "ara-Scheherazade" trained data, as it is mentioned that it is finetuned for the "Traditional Arabic Font". I have used it for images using this font, it has performed good, but there is still room for improvement. I tried to fine-tune the before mentioned trained data, so it can perform better when it comes to some of the images that I am working on, but it hasn't worked properly as it is mentioned that in order to finetune, I should be using data from the tess_data best repo.
I am looking at making it a bit better, specially concerning my database, as I am also having an issue where arabic numerals are being inverted and I am set on trying to solve it, as I am trying not to use 2 training datas (one for arabic numerals and one for arabic letters).
The "ara-Scheherazade" is recognizing both arabic numbers and letters, hence recognizing both arabic numbers and letters can be done using one trained data. I wanted to see if you know any way where I can take the "ara-Scheherazade" and make it a bit better concerning my database. And if not, how did you manage to train a training data from scratch that manages to recognize both arabic letters and numerals?

@Shreeshrii
Copy link
Owner

It's been a while since I did that training. See https://github.com/Shreeshrii/tessdata_arabic/blob/master/build/tesstrain_zade.sh for the steps used at that time.

@wolfassi123
Copy link
Author

@Shreeshrii at some point in the file you shared above, in the lstmtraining that starts at line 126, you used $BaseLang.lstm and $BaseLang.traineddata and in the path there was that they should be found in the tessdata_best directory. Did you mean $Lang instead of $BaseLang? Because in my case, there is no Arabic.lstm or Arabic.traineddata found in the tessdata_best directory.

@Shreeshrii
Copy link
Owner

Look in tessdata_best/script directory

@wolfassi123
Copy link
Author

Look in tessdata_best/script directory

Yep, I would find the Arabic.traineddata file, but not the Arabic.lstm file.

@Shreeshrii
Copy link
Owner

Lstm file is extracted from traineddata file. Use combine_tessdata command.

@wolfassi123
Copy link
Author

Lstm file is extracted from traineddata file. Use combine_tessdata command.

I usually use this command when using combine_tessdata : "combine_tessdata -e ./$ModelName/$BaseLang/$BaseLang.traineddata $Lang.lstm"
In this https://github.com/Shreeshrii/tessdata_arabic/blob/master/build/tesstrain_zade.sh, you used the following "combine_tessdata -d ./ara-Scheherazade-train/ara/ara.traineddata".
I can't find the .lstm files generated. Where should I find them after executing combine_tessdata?? Because I know that in the tessdata_best directory, there is an ara.lstm file. But I'm still unable to find the Arabic.lstm file, even after running in the previous cell the following: "combine_tessdata -d /content/tesstutorial/tesseract/tessdata/best/script/Arabic.traineddata"

@Shreeshrii
Copy link
Owner

Combine_tessdata -d just displays the info. Use -e to extract the lstm file.

I haven't looked at this repo in a while. Extract Arabic.lstm from Arabic.traineddata and move to the required dir.

@wolfassi123
Copy link
Author

@Shreeshrii I have been using the pipeline shared above. I've been training for the past few days. I added up my own generated data lines as well, hoping for it to perform better, but I still did not manage to reach a good stage where I have a traineddata that performs well on letters and numbers. I managed to make it better than some of the previous ones, as it is managing to detect numbers, but it is detecting some of them wrong. Do you have any idea how to solve it? I have added around 10 000 lines that had arabic numbers and another 10 000 that had arabic dates, and I am still getting some issues when it comes to numbers. I came across this issue, #3233, as well, despite not having the "can't encode transcription error". There must be a way to train for both numbers and letters in Arabic and have the desired output trained data.

@Shreeshrii
Copy link
Owner

I haven't had success with training both Arabic text and numbers.

@kylefoley76
Copy link

Hi Shreeshrii, I've been trying to do some Arabic OCR but not I can only get about 95% accuracy rate. Have you been able to do any better than that and if so, how?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants