Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty list returned when working with Devanagri Script #38

Open
jovidsilva opened this issue Jun 14, 2021 · 2 comments
Open

Empty list returned when working with Devanagri Script #38

jovidsilva opened this issue Jun 14, 2021 · 2 comments

Comments

@jovidsilva
Copy link

jovidsilva commented Jun 14, 2021

Hi im working with texts in Devanagri Script (A Popular script used in India unlike the Latin Script used by English like languages). When I try to generate keywords it returns an empty list. Code is below.

full_text="शेवणें आनी शेतकार एक आसलेलो शेतकार तेणें बरें शेत रोयलेलें रोयल्यार कितें जालें थाम वाडलें आनी इल्लें इल्लें करून पोटराक येयलें आनी थोडे दीस वयतकच कुचकुचीत गोट्याचें कणस सुटलें आनी वाऱ्याचेर बरें धोलूंक लागलें शेतकाराक सामकी उमेद जाली आतां म्हण लागलो रोकडेंच आपूण शेत लुंवतलो आनी भात घरा व्ह"

rake = Rake(max_words_unknown_lang=1)

keywords = rake.apply(full_text)

@vgrabovets
Copy link
Owner

It's hard for me to fix it without at least basic knowledge of this script.
I can point you to the problem in the code, though.
There is regexp \p{L}+ that processes input text in order to count words properly. It keeps only letters. hello, world! is transformed into hello world.
When I pass शेवणें आनी शेतकार, it is transformed into शे वणें आनी शे तका र. It introduces additional spaces that break subsequent logic. In order to keep it in line with the general logic, it should have stayed as शेवणें आनी शेतकार.
Maybe we don't need to use regexp for this script and split sentences on white spaces? I have no idea whether this is the right thing to do.

@jovidsilva
Copy link
Author

jovidsilva commented Jun 25, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants