-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is fine-tuning available for the available fonts? #3
Comments
It's been a while since I did that training. See https://github.com/Shreeshrii/tessdata_arabic/blob/master/build/tesstrain_zade.sh for the steps used at that time. |
@Shreeshrii at some point in the file you shared above, in the lstmtraining that starts at line 126, you used $BaseLang.lstm and $BaseLang.traineddata and in the path there was that they should be found in the tessdata_best directory. Did you mean $Lang instead of $BaseLang? Because in my case, there is no Arabic.lstm or Arabic.traineddata found in the tessdata_best directory. |
Look in tessdata_best/script directory |
Yep, I would find the Arabic.traineddata file, but not the Arabic.lstm file. |
Lstm file is extracted from traineddata file. Use combine_tessdata command. |
I usually use this command when using combine_tessdata : "combine_tessdata -e ./$ModelName/$BaseLang/$BaseLang.traineddata $Lang.lstm" |
Combine_tessdata -d just displays the info. Use -e to extract the lstm file. I haven't looked at this repo in a while. Extract Arabic.lstm from Arabic.traineddata and move to the required dir. |
@Shreeshrii I have been using the pipeline shared above. I've been training for the past few days. I added up my own generated data lines as well, hoping for it to perform better, but I still did not manage to reach a good stage where I have a traineddata that performs well on letters and numbers. I managed to make it better than some of the previous ones, as it is managing to detect numbers, but it is detecting some of them wrong. Do you have any idea how to solve it? I have added around 10 000 lines that had arabic numbers and another 10 000 that had arabic dates, and I am still getting some issues when it comes to numbers. I came across this issue, #3233, as well, despite not having the "can't encode transcription error". There must be a way to train for both numbers and letters in Arabic and have the desired output trained data. |
I haven't had success with training both Arabic text and numbers. |
Hi Shreeshrii, I've been trying to do some Arabic OCR but not I can only get about 95% accuracy rate. Have you been able to do any better than that and if so, how? |
@Shreeshrii I have dowloaded your "ara-Scheherazade" trained data, as it is mentioned that it is finetuned for the "Traditional Arabic Font". I have used it for images using this font, it has performed good, but there is still room for improvement. I tried to fine-tune the before mentioned trained data, so it can perform better when it comes to some of the images that I am working on, but it hasn't worked properly as it is mentioned that in order to finetune, I should be using data from the tess_data best repo.
I am looking at making it a bit better, specially concerning my database, as I am also having an issue where arabic numerals are being inverted and I am set on trying to solve it, as I am trying not to use 2 training datas (one for arabic numerals and one for arabic letters).
The "ara-Scheherazade" is recognizing both arabic numbers and letters, hence recognizing both arabic numbers and letters can be done using one trained data. I wanted to see if you know any way where I can take the "ara-Scheherazade" and make it a bit better concerning my database. And if not, how did you manage to train a training data from scratch that manages to recognize both arabic letters and numerals?
The text was updated successfully, but these errors were encountered: