-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Korean, Indonesian, and Hebrew support #47
Comments
I'm getting ready to run some Korean data. Do you have a recommendation for how to go about selecting elements for the process stage? Spacy has a Koren pipeline... perhaps moses would work but I'm not sure. If I had a Korean IR test collection I wouldn't be asking this question ;-) |
Stop words were merged in from pull request #48. I tested the pipeline on some Korean documents a few months back. I think Patapsco defaulted to the UD tokenization model. I had someone who reads Korean take a look and she thought it was reasonable. UD tokenization stats here: https://explosion.ai/blog/ud-benchmarks-v3-2 So short answer is that you can set the language code and Patapsco should just work for Korean. I will note that there is an issue when running Patapsco with multiple processes the first time it tries to download a model - basically a race condition - I need to restructure how the models get downloaded automatically in the multiprocessing setting. |
Sorry @isoboroff - forgot to tag you in my response |
Indeed, with an up-to-date repo Korean indexes fine and passes sanity-check searches. NTCIR 3-6 has Korean data in their CLIR track. I've reached out to Noriko Kando to get the data, but I bet Paul McNamee has it already. |
Support the above languages in Patapsco
The text was updated successfully, but these errors were encountered: