Is fine-tuning available for the available fonts? #3

wolfassi123 · 2022-03-10T07:53:47Z

@Shreeshrii I have dowloaded your "ara-Scheherazade" trained data, as it is mentioned that it is finetuned for the "Traditional Arabic Font". I have used it for images using this font, it has performed good, but there is still room for improvement. I tried to fine-tune the before mentioned trained data, so it can perform better when it comes to some of the images that I am working on, but it hasn't worked properly as it is mentioned that in order to finetune, I should be using data from the tess_data best repo.
I am looking at making it a bit better, specially concerning my database, as I am also having an issue where arabic numerals are being inverted and I am set on trying to solve it, as I am trying not to use 2 training datas (one for arabic numerals and one for arabic letters).
The "ara-Scheherazade" is recognizing both arabic numbers and letters, hence recognizing both arabic numbers and letters can be done using one trained data. I wanted to see if you know any way where I can take the "ara-Scheherazade" and make it a bit better concerning my database. And if not, how did you manage to train a training data from scratch that manages to recognize both arabic letters and numerals?

Shreeshrii · 2022-03-10T09:20:28Z

It's been a while since I did that training. See https://github.com/Shreeshrii/tessdata_arabic/blob/master/build/tesstrain_zade.sh for the steps used at that time.

wolfassi123 · 2022-03-11T09:42:36Z

@Shreeshrii at some point in the file you shared above, in the lstmtraining that starts at line 126, you used $BaseLang.lstm and $BaseLang.traineddata and in the path there was that they should be found in the tessdata_best directory. Did you mean $Lang instead of $BaseLang? Because in my case, there is no Arabic.lstm or Arabic.traineddata found in the tessdata_best directory.

Shreeshrii · 2022-03-11T09:56:51Z

Look in tessdata_best/script directory

wolfassi123 · 2022-03-11T10:10:52Z

Look in tessdata_best/script directory

Yep, I would find the Arabic.traineddata file, but not the Arabic.lstm file.

Shreeshrii · 2022-03-11T10:11:01Z

Lstm file is extracted from traineddata file. Use combine_tessdata command.

wolfassi123 · 2022-03-11T11:06:15Z

Lstm file is extracted from traineddata file. Use combine_tessdata command.

I usually use this command when using combine_tessdata : "combine_tessdata -e ./$ModelName/$BaseLang/$BaseLang.traineddata $Lang.lstm"
In this https://github.com/Shreeshrii/tessdata_arabic/blob/master/build/tesstrain_zade.sh, you used the following "combine_tessdata -d ./ara-Scheherazade-train/ara/ara.traineddata".
I can't find the .lstm files generated. Where should I find them after executing combine_tessdata?? Because I know that in the tessdata_best directory, there is an ara.lstm file. But I'm still unable to find the Arabic.lstm file, even after running in the previous cell the following: "combine_tessdata -d /content/tesstutorial/tesseract/tessdata/best/script/Arabic.traineddata"

Shreeshrii · 2022-03-11T11:18:55Z

Combine_tessdata -d just displays the info. Use -e to extract the lstm file.

I haven't looked at this repo in a while. Extract Arabic.lstm from Arabic.traineddata and move to the required dir.

wolfassi123 · 2022-03-16T10:48:25Z

@Shreeshrii I have been using the pipeline shared above. I've been training for the past few days. I added up my own generated data lines as well, hoping for it to perform better, but I still did not manage to reach a good stage where I have a traineddata that performs well on letters and numbers. I managed to make it better than some of the previous ones, as it is managing to detect numbers, but it is detecting some of them wrong. Do you have any idea how to solve it? I have added around 10 000 lines that had arabic numbers and another 10 000 that had arabic dates, and I am still getting some issues when it comes to numbers. I came across this issue, #3233, as well, despite not having the "can't encode transcription error". There must be a way to train for both numbers and letters in Arabic and have the desired output trained data.

Shreeshrii · 2022-03-16T11:00:24Z

I haven't had success with training both Arabic text and numbers.

kylefoley76 · 2023-12-23T20:19:52Z

Hi Shreeshrii, I've been trying to do some Arabic OCR but not I can only get about 95% accuracy rate. Have you been able to do any better than that and if so, how?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is fine-tuning available for the available fonts? #3

Is fine-tuning available for the available fonts? #3

wolfassi123 commented Mar 10, 2022

Shreeshrii commented Mar 10, 2022

wolfassi123 commented Mar 11, 2022

Shreeshrii commented Mar 11, 2022

wolfassi123 commented Mar 11, 2022

Shreeshrii commented Mar 11, 2022

wolfassi123 commented Mar 11, 2022

Shreeshrii commented Mar 11, 2022

wolfassi123 commented Mar 16, 2022

Shreeshrii commented Mar 16, 2022

kylefoley76 commented Dec 23, 2023

Is fine-tuning available for the available fonts? #3

Is fine-tuning available for the available fonts? #3

Comments

wolfassi123 commented Mar 10, 2022

Shreeshrii commented Mar 10, 2022

wolfassi123 commented Mar 11, 2022

Shreeshrii commented Mar 11, 2022

wolfassi123 commented Mar 11, 2022

Shreeshrii commented Mar 11, 2022

wolfassi123 commented Mar 11, 2022

Shreeshrii commented Mar 11, 2022

wolfassi123 commented Mar 16, 2022

Shreeshrii commented Mar 16, 2022

kylefoley76 commented Dec 23, 2023