Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Korean, Indonesian, and Hebrew support #47

Open
dlawrie opened this issue Aug 22, 2022 · 4 comments
Open

Add Korean, Indonesian, and Hebrew support #47

dlawrie opened this issue Aug 22, 2022 · 4 comments

Comments

@dlawrie
Copy link
Collaborator

dlawrie commented Aug 22, 2022

Support the above languages in Patapsco

@isoboroff
Copy link

I'm getting ready to run some Korean data. Do you have a recommendation for how to go about selecting elements for the process stage? Spacy has a Koren pipeline... perhaps moses would work but I'm not sure.

If I had a Korean IR test collection I wouldn't be asking this question ;-)

@cash
Copy link
Member

cash commented Nov 22, 2022

Stop words were merged in from pull request #48. I tested the pipeline on some Korean documents a few months back. I think Patapsco defaulted to the UD tokenization model. I had someone who reads Korean take a look and she thought it was reasonable. UD tokenization stats here: https://explosion.ai/blog/ud-benchmarks-v3-2

So short answer is that you can set the language code and Patapsco should just work for Korean.

I will note that there is an issue when running Patapsco with multiple processes the first time it tries to download a model - basically a race condition - I need to restructure how the models get downloaded automatically in the multiprocessing setting.

@cash
Copy link
Member

cash commented Nov 22, 2022

Sorry @isoboroff - forgot to tag you in my response

@isoboroff
Copy link

Indeed, with an up-to-date repo Korean indexes fine and passes sanity-check searches.

NTCIR 3-6 has Korean data in their CLIR track. I've reached out to Noriko Kando to get the data, but I bet Paul McNamee has it already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants