-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve yor.traineddata for Yoruba #89
Comments
http://crubadan.org/languages/yo for Yoruba - An Crúbadán - Corpus Building for Minority Languages |
Thanks @Shreeshrii for creating an issue for this. I looked at the crubadan corpus. Most of the urls it scrapes from contain Yoruba that is not properly marked. Given the high noise to signal ratio, I don't think it will be good to train with that (or most web scraped data). I currently have 2 websites that reliably always have properly marked Yoruba. I am thinking of taking screen shots of the text and also passing in the text in text form. I think this will be a good starting point to improve the model. Does this idea sound good? |
Making screenshots is not very useful. You need the text itself. A web crawler is what you need to use. Please list the URLs of those two sites. Did you try to extract the wordlist from the yor traineddata and examine it? |
@amitdo I meant my last message in the context of useful training data for tesseract' yor.traineddata, not my project. Please confirm that this OCR system takes in only text and not also images to train its models to predict what texts an image contains. . The urls are: Wikipedia (3) only has marked Yoruba for that first page. Every page it links to (and every other page on yo.wikipedia.com that I've seen) is not properly marked. This is not the case for 1 and 2. |
The images for trained data are created by the text2image tool. It renders images from text files using variety of digital fonts. |
Ah, I see. I probably should have read the docs more carefully. But that's very interesting. I won't have thought to do that. The only thing to note is that I broke them down into one sentence per line. I hope that doesn't affect the model. I will keep adding more as I find them. |
Any updates on this? Anything I can be doing on my end? |
I am hoping that @theraysmith will include your resources for his next training. |
Any updates on this? |
@theraysmith
See https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/RF1rk3-z4uo/noQzBWbuCAAJ
Message from @Timilehin copied below
I am working on a side project in yoruba that might be helpful. It predicts the right diacritics on unmarked yoruba words. I imagine you could also run the OCR allowing only unmarked characters as output (maybe reduce the height of the scan window so it doesn't see the diacritics) and then pipe the marked characters through the tool I'm building and use the output as a fallback for when the image recognition is not sure.
My project right now needs more training data to make the model more robust. It is very tough to find properly marked yoruba text on the internet. I have physical books and some scanned pdfs in archive.org that I would want to transform to text but the yor.traineddata doesn't seem robust enough. It makes many mistakes such as ọdọ instead of ẹdẹ.
Other times, it just spits out gibberish.
What can I provide to help make yor.traineddata much better and what quantity? (e.g. 200 (pages) images of yourba text and the yoruba text it contains).I think both projects an reinforce each other. I look forward to hearing back.
link to proj -> https://github.com/Timilehin/Yoruba-Intonator
The text was updated successfully, but these errors were encountered: