Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

量词与数词的歧义 #127

Open
GoogleCodeExporter opened this issue Apr 7, 2016 · 0 comments
Open

量词与数词的歧义 #127

GoogleCodeExporter opened this issue Apr 7, 2016 · 0 comments

Comments

@GoogleCodeExporter
Copy link

原文本:“两门衣柜”
期望结果:“两门 衣柜”
实际结果:“两 门 衣柜”
版本:2012u6

我在测试时发现,在中文量词分词器中,“两”首先被processC
Number方法处理,但因为无法判断后来字符是否会结合成更大��
�数,所以只记录了位置;接下来“两”被processCount方法处理�
��在量词词典中成功匹配并计入AnalyzeContext的orgLexemes集合。字
符游标++,现在要处理的字符是“门”,processCNumber里匹配数�
��失败,准备根据上一次录的数词位置添加数词,但orgLexemes��
�合的行为是重复则忽略,所以“两”这个词就固定为了量词�
��
不知道这个处理方式是特意这么做的吗?如果是的话,目的��
�什么呢?如果不是的话,建议这个地方做一下判断。
另外,建议提供词典的动态加载(最好是提供接口自己实现��
�这样就可以应对数据库等数据源)。另外在歧义处理时,也�
��以提供接口来扩展。我现在做词频分词,就把逻辑添加到Lex
emePath的compareTo方法中的,遇到类库升级就很麻烦。

Original issue reported on code.google.com by [email protected] on 15 Oct 2013 at 6:26

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant