kur_ara does not have Arabic unicharset. #14

Shreeshrii · 2018-03-23T12:15:59Z

Please see details at

@jbreiden @AlexanderP - FYI - regarding problem with packaged traineddata for kur_ara.

AlexanderP · 2018-03-24T06:36:08Z

@Shreeshrii I understood correctly.
trainedata need to change in packages?
tesseract-ocr-kur-ara -> tesseract-ocr-kur
tesseract-ocr-kur -> tesseract-ocr-kur-ara

Shreeshrii · 2018-03-24T10:05:10Z

There is no traineddata for kur in tessdata_fast. I will unpack and convert the dawgs to word list and see if it is possible to correct kur_ara files. Please do not make any change yet.

…

On Sat 24 Mar, 2018, 12:06 PM Alexander Pozdnyakov, < ***@***.***> wrote: @Shreeshrii <https://github.com/Shreeshrii> I understood correctly. trainedata need to change in packages? tesseract-ocr-kur-ara -> tesseract-ocr-kur tesseract-ocr-kur -> tesseract-ocr-kur-ara — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o0HXWqF78k3jpUbGt-n-WlMK9Fwzks5thelYgaJpZM4S4nxt> .

Shreeshrii · 2018-03-24T11:55:58Z

@AlexanderP

tesseract-ocr-kur-ara -> tesseract-ocr-kur

Yes, the above change can be made. Currently kur_ara has Latin text only.

tesseract-ocr-kur -> tesseract-ocr-kur-ara

This cannot be done since there is no kur traineddata in tessdata_fast.

Shreeshrii · 2018-03-24T12:07:13Z

@jbreiden @theraysmith

Should I build kur_ara from the ara.traineddata eg. by replacing the wordlist?

Or is there an updated set of Arabic script traineddatas that can be uploaded before 4.0.0 release?

ref: tesseract-ocr/langdata#83 (comment)

I was going to push until I discovered a bug with the RTL word lists.
Then I also need to integrate this issues list, that I haven't looked at in a while, and rerun training.

amitdo · 2018-03-24T12:09:34Z

Maybe it should be 'kur_lat'.

AlexanderP · 2018-03-25T09:12:08Z

There is no traineddata for kur in tessdata_fast.
I will unpack and convert the dawgs to word list and see if it is possible
to correct kur_ara files.
Please do not make any change yet.

ok

stweil · 2019-12-17T18:08:33Z

Was this issue solved by the renaming?

Shreeshrii · 2019-12-19T12:41:22Z

kmr is Kurdish in Latin script. Renaming has fixed that issue.

kur was Kurdish in Arabic script in Tesseract3. We have still not restored kur or kur_ara.

stweil · 2019-12-19T14:24:06Z

So you suggest to restore https://github.com/tesseract-ocr/tessdata/blob/3.04.00/kur.traineddata to the master branch of tessdata?

Shreeshrii · 2019-12-19T14:29:30Z

That will not work since Arabic script in 3.04 relied on cube which is no longer in codebase.

…

On Thu, Dec 19, 2019, 19:54 Stefan Weil ***@***.***> wrote: So you suggest to restore https://github.com/tesseract-ocr/tessdata/blob/3.04.00/kur.traineddata to the master branch of tessdata? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14?email_source=notifications&email_token=ABG37IYSSSV6JYSTZESZLS3QZN7YPA5CNFSM4EXCPRW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHJYGLI#issuecomment-567509805>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABG37I5JBM674S5VNJE7JVDQZN7YPANCNFSM4EXCPRWQ> .

Shreeshrii · 2019-12-20T14:35:38Z

I will try to recreate using the wordlist and training text by fine-tuning.

…

On Thu, Dec 19, 2019, 19:59 Shree Devi Kumar ***@***.***> wrote: That will not work since Arabic script in 3.04 relied on cube which is no longer in codebase. On Thu, Dec 19, 2019, 19:54 Stefan Weil ***@***.***> wrote: > So you suggest to restore > https://github.com/tesseract-ocr/tessdata/blob/3.04.00/kur.traineddata > to the master branch of tessdata? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#14?email_source=notifications&email_token=ABG37IYSSSV6JYSTZESZLS3QZN7YPA5CNFSM4EXCPRW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHJYGLI#issuecomment-567509805>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABG37I5JBM674S5VNJE7JVDQZN7YPANCNFSM4EXCPRWQ> > . >

Shreeshrii · 2020-01-29T10:48:29Z

https://github.com/Shreeshrii/tesstrain-ckb ckb is the preferred prefix rather than kur_ara My finetuned training gives improved results compared to official ara and script/Arabic traineddata on the synthetic eval set. On Fri, Dec 20, 2019 at 8:05 PM Shree Devi Kumar <[email protected]> wrote:

…

I will try to recreate using the wordlist and training text by fine-tuning. On Thu, Dec 19, 2019, 19:59 Shree Devi Kumar ***@***.***> wrote: > That will not work since Arabic script in 3.04 relied on cube which is no > longer in codebase. > > > > On Thu, Dec 19, 2019, 19:54 Stefan Weil ***@***.***> wrote: > >> So you suggest to restore >> https://github.com/tesseract-ocr/tessdata/blob/3.04.00/kur.traineddata >> to the master branch of tessdata? >> >> — >> You are receiving this because you were mentioned. >> Reply to this email directly, view it on GitHub >> <#14?email_source=notifications&email_token=ABG37IYSSSV6JYSTZESZLS3QZN7YPA5CNFSM4EXCPRW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHJYGLI#issuecomment-567509805>, >> or unsubscribe >> <https://github.com/notifications/unsubscribe-auth/ABG37I5JBM674S5VNJE7JVDQZN7YPANCNFSM4EXCPRWQ> >> . >> >

--

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

This was referenced Apr 21, 2018

Language request: Kurdish-Kurmanji tesseract-ocr/langdata#124

Closed

correct name kur_ara to kmr - Kurmanji (Latin script) #16

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kur_ara does not have Arabic unicharset. #14

kur_ara does not have Arabic unicharset. #14

Shreeshrii commented Mar 23, 2018

AlexanderP commented Mar 24, 2018

Shreeshrii commented Mar 24, 2018 via email

Shreeshrii commented Mar 24, 2018

Shreeshrii commented Mar 24, 2018

amitdo commented Mar 24, 2018

AlexanderP commented Mar 25, 2018

stweil commented Dec 17, 2019

Shreeshrii commented Dec 19, 2019

stweil commented Dec 19, 2019

Shreeshrii commented Dec 19, 2019 via email

Shreeshrii commented Dec 20, 2019 via email

Shreeshrii commented Jan 29, 2020 via email

kur_ara does not have Arabic unicharset. #14

kur_ara does not have Arabic unicharset. #14

Comments

Shreeshrii commented Mar 23, 2018

AlexanderP commented Mar 24, 2018

Shreeshrii commented Mar 24, 2018 via email

Shreeshrii commented Mar 24, 2018

Shreeshrii commented Mar 24, 2018

amitdo commented Mar 24, 2018

AlexanderP commented Mar 25, 2018

stweil commented Dec 17, 2019

Shreeshrii commented Dec 19, 2019

stweil commented Dec 19, 2019

Shreeshrii commented Dec 19, 2019 via email

Shreeshrii commented Dec 20, 2019 via email

Shreeshrii commented Jan 29, 2020 via email