-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kur_ara does not have Arabic unicharset. #14
Comments
@Shreeshrii I understood correctly. |
There is no traineddata for kur in tessdata_fast.
I will unpack and convert the dawgs to word list and see if it is possible
to correct kur_ara files.
Please do not make any change yet.
…On Sat 24 Mar, 2018, 12:06 PM Alexander Pozdnyakov, < ***@***.***> wrote:
@Shreeshrii <https://github.com/Shreeshrii> I understood correctly.
trainedata need to change in packages?
tesseract-ocr-kur-ara -> tesseract-ocr-kur
tesseract-ocr-kur -> tesseract-ocr-kur-ara
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o0HXWqF78k3jpUbGt-n-WlMK9Fwzks5thelYgaJpZM4S4nxt>
.
|
Yes, the above change can be made. Currently kur_ara has Latin text only.
This cannot be done since there is no kur traineddata in tessdata_fast. |
Should I build kur_ara from the ara.traineddata eg. by replacing the wordlist? Or is there an updated set of Arabic script traineddatas that can be uploaded before 4.0.0 release? ref: tesseract-ocr/langdata#83 (comment)
|
Maybe it should be 'kur_lat'. |
ok |
Was this issue solved by the renaming? |
kmr is Kurdish in Latin script. Renaming has fixed that issue. kur was Kurdish in Arabic script in Tesseract3. We have still not restored kur or kur_ara. |
So you suggest to restore https://github.com/tesseract-ocr/tessdata/blob/3.04.00/kur.traineddata to the master branch of |
That will not work since Arabic script in 3.04 relied on cube which is no
longer in codebase.
…On Thu, Dec 19, 2019, 19:54 Stefan Weil ***@***.***> wrote:
So you suggest to restore
https://github.com/tesseract-ocr/tessdata/blob/3.04.00/kur.traineddata to
the master branch of tessdata?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14?email_source=notifications&email_token=ABG37IYSSSV6JYSTZESZLS3QZN7YPA5CNFSM4EXCPRW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHJYGLI#issuecomment-567509805>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABG37I5JBM674S5VNJE7JVDQZN7YPANCNFSM4EXCPRWQ>
.
|
I will try to recreate using the wordlist and training text by fine-tuning.
…On Thu, Dec 19, 2019, 19:59 Shree Devi Kumar ***@***.***> wrote:
That will not work since Arabic script in 3.04 relied on cube which is no
longer in codebase.
On Thu, Dec 19, 2019, 19:54 Stefan Weil ***@***.***> wrote:
> So you suggest to restore
> https://github.com/tesseract-ocr/tessdata/blob/3.04.00/kur.traineddata
> to the master branch of tessdata?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#14?email_source=notifications&email_token=ABG37IYSSSV6JYSTZESZLS3QZN7YPA5CNFSM4EXCPRW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHJYGLI#issuecomment-567509805>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ABG37I5JBM674S5VNJE7JVDQZN7YPANCNFSM4EXCPRWQ>
> .
>
|
https://github.com/Shreeshrii/tesstrain-ckb
ckb is the preferred prefix rather than kur_ara
My finetuned training gives improved results compared to official ara and
script/Arabic traineddata on the synthetic eval set.
On Fri, Dec 20, 2019 at 8:05 PM Shree Devi Kumar <[email protected]>
wrote:
… I will try to recreate using the wordlist and training text by fine-tuning.
On Thu, Dec 19, 2019, 19:59 Shree Devi Kumar ***@***.***> wrote:
> That will not work since Arabic script in 3.04 relied on cube which is no
> longer in codebase.
>
>
>
> On Thu, Dec 19, 2019, 19:54 Stefan Weil ***@***.***> wrote:
>
>> So you suggest to restore
>> https://github.com/tesseract-ocr/tessdata/blob/3.04.00/kur.traineddata
>> to the master branch of tessdata?
>>
>> —
>> You are receiving this because you were mentioned.
>> Reply to this email directly, view it on GitHub
>> <#14?email_source=notifications&email_token=ABG37IYSSSV6JYSTZESZLS3QZN7YPA5CNFSM4EXCPRW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHJYGLI#issuecomment-567509805>,
>> or unsubscribe
>> <https://github.com/notifications/unsubscribe-auth/ABG37I5JBM674S5VNJE7JVDQZN7YPANCNFSM4EXCPRWQ>
>> .
>>
>
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
|
Please see details at
tesseract-ocr/tessdata#88 (comment)
tesseract-ocr/langdata#116
tesseract-ocr/tessdata_best#23
@jbreiden @AlexanderP - FYI - regarding problem with packaged traineddata for kur_ara.
The text was updated successfully, but these errors were encountered: