COMBINING GREEK YPOGEGRAMMENI case-folding #3

stevengj · 2014-07-16T13:00:10Z

The U+0345 combining character needs special handling, according to Jan Behrens (utf8proc author). In particular, you apparently need to do normalization both before and after case-folding (if you are doing normalization+casefolding on a string containing this character).

As a first pass, I'm not sure it's worth trying to solve this in a super-efficient manner. Just set a flag if the character is found (during decomposition?), and then run a second normalization pass after/before case-folding if necessary.

stevengj · 2014-07-16T13:01:36Z

Chapter 3 of the Unicode standard says:

The invocations of normalization before case folding in the preceding definitions are to catch very infrequent edge cases. Normalization is not required before case folding, except for the character U+0345 n combining greek ypogegrammeni and any characters that have it as part of their decomposition, such as U+1FC3 o greek small letter eta with ypogegrammeni.
In practice, optimized versions of implementations can catch these special cases, thereby avoiding an extra normalization.

jiahao · 2014-07-17T23:34:36Z

It seems like this issue is directly related to making use of the data in SpecialCasing.txt.

stevengj · 2015-03-11T15:42:48Z

@jiahao, it seems like SpecialCasing.txt is more specific to upper/lower/titlecase rules, rather than casefolding per se.

andersk · 2016-05-13T20:41:57Z

In case it’s useful, here is the original bug report I sent to Jan (reply, reply, reply), and my test case:

#include <stdio.h>
#include <utf8proc.h>

int main()
{
    const unsigned char *in = "\xcf\x89\xcd\x85\xcd\x82";  // U+03C9 U+0345 U+0342 (ω+◌ͅ+◌͂)
    unsigned char *out;
    utf8proc_map(in, 0, &out,
        UTF8PROC_CASEFOLD | UTF8PROC_DECOMPOSE | UTF8PROC_NULLTERM);
    printf("%s\n", out);  // Wrong: U+03C9 U+03B9 U+0342 (ω ι+◌͂)
    free(out);
    unsigned char *nfd = utf8proc_NFD(in);
    utf8proc_map(nfd, 0, &out,
        UTF8PROC_CASEFOLD | UTF8PROC_DECOMPOSE | UTF8PROC_NULLTERM);
    printf("%s\n", out);  // Right: U+03C9 U+0342 U+03B9 (ω+◌͂ ι)
    free(out);
    free(nfd);
    return 0;
}

Also, a hilarious graph.

stevengj added the bug label Jul 17, 2014

andersk mentioned this issue May 13, 2016

utf8proc case-folding+normalization bug with combining greek ypogegrammeni zephyr-im/zephyr#126

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

COMBINING GREEK YPOGEGRAMMENI case-folding #3

COMBINING GREEK YPOGEGRAMMENI case-folding #3

stevengj commented Jul 16, 2014

stevengj commented Jul 16, 2014

jiahao commented Jul 17, 2014

stevengj commented Mar 11, 2015

andersk commented May 13, 2016

COMBINING GREEK YPOGEGRAMMENI case-folding #3

COMBINING GREEK YPOGEGRAMMENI case-folding #3

Comments

stevengj commented Jul 16, 2014

stevengj commented Jul 16, 2014

jiahao commented Jul 17, 2014

stevengj commented Mar 11, 2015

andersk commented May 13, 2016