-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tailoring: denormalized Japanese code points in the default FCE table #52
Comments
@KL-7 what does 'failing in ICU' mean in this context? Have you filed a ticket? |
@srl295, I believe that's a problem with test data and not the implementation. I was looking for tailoring tests and I was quite disappointed when I found this note saying that CLDR no longer provides conformance tests. Out of frustration I used tests from an older version of CLDR from here. I run ICU4J (as a reference implementation) on these tests, excluded those that were failed, and used the rest as a test suit for our implementation. |
@KL-7 Ouch.......... several times ouch. One thing that could have been done.. or, even, still done, would be to request generation of newer data. As I mentioned, we don't get much notice of others picking up the data period until they are in some sense 'done' (as with TwitterCLDR's announcement). I've never heard of anyone actually using that test data, besides CLDR's own tests. I don't want to scare you off by repeating myself, but.. please file tickets, use the mailing list, .. in any event, it may be better to use ICU's test cases. Below is not comprehensive (there are others), but is one start. I think this one is consumed by both C and J. http://source.icu-project.org/repos/icu/icu/trunk/source/test/testdata/DataDrivenCollationTest.txt I assume Ruby has some mechanism for calling/being called to/from C or Java, one could also consider testing by comparing results. Worst case you could execute my usort sample and compare the output. http://source.icu-project.org/repos/icu/icuapps/trunk/usort/ |
@srl295, I think we're pretty good even now, because I'm using only tests that are passed by ICU4J (it's basically results comparison that you mentioned). I found a lot of issues (hopefully, most of them) in my implementation that way. I had hard time trying to track down ICU's test data, so I used what I had at hands. Regarding people using this data, I know for sure that at least And about mailing lists... I don't use them a lot in general and at the time I didn't feel brave enough to write someting to Unicode or CLDR mailing list. But you made me believe that it's not that scary =) Next time I need help or spot an issue I won't hesitate. |
@KL-7 hm, ZTM does not seem to be active presently, but that would have been good to have their input. Glad to have made things a little less scary. |
It turned out that some code points occur in the default FCE table in denormalized form. As we always normalize given code points to NFD form, we completely ignore denormalized elements of the FCE table. If processing normalized and denormalized forms results in different collation elements, we get wrong collation order in the end.
This issue affects only one test for Japanese tailoring, but it's possible that we simply don't have enough tests to reveal a bigger impact of this problem.
More details in the gist.
The text was updated successfully, but these errors were encountered: