Add Korean, Indonesian, and Hebrew support #47

dlawrie · 2022-08-22T15:24:18Z

Support the above languages in Patapsco

isoboroff · 2022-11-22T18:15:41Z

I'm getting ready to run some Korean data. Do you have a recommendation for how to go about selecting elements for the process stage? Spacy has a Koren pipeline... perhaps moses would work but I'm not sure.

If I had a Korean IR test collection I wouldn't be asking this question ;-)

cash · 2022-11-22T18:49:45Z

Stop words were merged in from pull request #48. I tested the pipeline on some Korean documents a few months back. I think Patapsco defaulted to the UD tokenization model. I had someone who reads Korean take a look and she thought it was reasonable. UD tokenization stats here: https://explosion.ai/blog/ud-benchmarks-v3-2

So short answer is that you can set the language code and Patapsco should just work for Korean.

I will note that there is an issue when running Patapsco with multiple processes the first time it tries to download a model - basically a race condition - I need to restructure how the models get downloaded automatically in the multiprocessing setting.

cash · 2022-11-22T19:19:10Z

Sorry @isoboroff - forgot to tag you in my response

isoboroff · 2022-11-23T13:10:34Z

Indeed, with an up-to-date repo Korean indexes fine and passes sanity-check searches.

NTCIR 3-6 has Korean data in their CLIR track. I've reached out to Noriko Kando to get the data, but I bet Paul McNamee has it already.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Korean, Indonesian, and Hebrew support #47

Add Korean, Indonesian, and Hebrew support #47

dlawrie commented Aug 22, 2022

isoboroff commented Nov 22, 2022

cash commented Nov 22, 2022

cash commented Nov 22, 2022

isoboroff commented Nov 23, 2022

Add Korean, Indonesian, and Hebrew support #47

Add Korean, Indonesian, and Hebrew support #47

Comments

dlawrie commented Aug 22, 2022

isoboroff commented Nov 22, 2022

cash commented Nov 22, 2022

cash commented Nov 22, 2022

isoboroff commented Nov 23, 2022