Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ik fails to output the right offsets if the char fitlers apply to the input stream #136

Open
GoogleCodeExporter opened this issue Apr 7, 2016 · 0 comments

Comments

@GoogleCodeExporter
Copy link

Hi Team,
我发现ik 
tokenizer对html_filter处理过的字符串输出offsets有误。html_filter的
base class 
BaseCharFilter里包含了offsets和diffs两个数组,分别是stripped以后�
��tokens的offsets和相对于源string需要修正的delta。ik(我用的ik20
12 FF hotfix1,google 
code)的代码,没有对这个offsets和diffs处理。导致输出的offset�
��处理后的无html 
tag的string上的offset。我在我的github上做了修改,大致测了一��
�貌似可以了。主要修改在这个github的pull request上了。

https://github.com/xpandan/ik-analyzer/commit/7cc797ca78399cdae4f31181970e85db28
be4e5d

html_strip本身也不少bug,你也可以用mapping 
filter来测,原理一样的。有空帮我review下code吧。我是为了项�
��临时来研究lucene的,请多多指教。


Best,
Dan

Original issue reported on code.google.com by [email protected] on 12 Sep 2014 at 10:34

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant