utf8proc seems difficult to use efficiently on strings #101

madscientist · 2017-02-19T21:48:06Z

Maybe this isn't an appropriate issue, if so please feel free to close it. I have a string implementation and I need to do some basic UTF8 operations on it: I need to compute the length (in characters not bytes), compare strings in a case-insensitive way (folding), and convert to upper or lowercase strings. I need these done as efficiently as possible as this has a real impact on my system. Then there are a few other more esoteric things I need like reverse a utf8 string etc. but these don't need to be done super-efficiently.

I really would like something small and I only need UTF8, so ICU is too much.

utf8proc seems like a great per-character interface, but it seems difficult to use efficiently on entire strings. For example, there's no simple, fast string length function. Also, the way that the map functions always allocate new memory and can't be used on existing buffers is a major drawback: it necessitates a lot of extra copying in many situations. It seems like a folded comparison function could be written inside utf8proc a good bit more efficiently. Etc.

Maybe that's a goal of utf8proc: to provide a character-based interface and have users compose their own higher-level (string-based) algorithms using them: simplicity taking priority over efficiency? And/or perhaps the way Julia uses utf8proc just matches well with the current interface; it doesn't have a need for writing into existing buffers etc.?

stevengj · 2017-02-19T22:31:44Z

(It is possible to use utf8proc_decompose with a pre-allocated buffer.)

As far as string length (in characters), you are right, it doesn't have a lot of basic functions like this (although you can easily compute string length by calling utf8proc_iterate repeatedly); it is also true that string length (in characters) is just not that useful an operation in most circumstances.

Most of the current utf8proc developers are mainly using it for Julia, so we have only been motivated to implement functions needed for Julia: string normalizations, grapheme detection, some character-related queries. Julia already has a string-length function and UTF8 string iteration and many other functions, so we haven't been motivated to add those to utf8proc. The "goal" of utf8proc, as far as we've been concerned, is mainly to implement functions that we need that require access to a Unicode character database, but we're open to contributions of other useful functions.

I agree that a fast case-folded/normalized comparison function that requires no buffers seems possible to write and could be useful, even for Julia; a PR would be welcome.

madscientist · 2017-02-20T15:20:59Z

(It is possible to use utf8proc_decompose with a pre-allocated buffer.)

Good point; I missed that.

it is also true that string length (in characters) is just not that useful an operation in most circumstances.

Length computations are useful when computing display sizes required, or maximum potential buffer sizes, or sorting by length, or similar things. I agree they're not nearly so useful as straightforward strlen() but still needed.

Thanks for your thoughts @stevengj !

stevengj · 2017-02-20T22:40:22Z

For display sizes, you definitely don't want the length in codepoints; e.g. combining characters are zero width. You maybe want the sum of the charwidths, but even this is somewhat ambiguous because the displayed charwidth also depends on the font and terminal. I'm also skeptical of sorting by length for much the same reasons.

stevengj · 2024-01-04T00:31:43Z

I agree that a fast case-folded/normalized comparison function that requires no buffers seems possible to write and could be useful, even for Julia; a PR would be welcome.

Note that such a function was implemented in Julia, and could be ported to C: https://github.com/JuliaLang/julia/blob/0f6c72c71bc947282ae18715c09f93a22828aab7/stdlib/Unicode/src/Unicode.jl#L268-L340

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utf8proc seems difficult to use efficiently on strings #101

utf8proc seems difficult to use efficiently on strings #101

madscientist commented Feb 19, 2017

stevengj commented Feb 19, 2017 •

edited

Loading

madscientist commented Feb 20, 2017

stevengj commented Feb 20, 2017

stevengj commented Jan 4, 2024

utf8proc seems difficult to use efficiently on strings #101

utf8proc seems difficult to use efficiently on strings #101

Comments

madscientist commented Feb 19, 2017

stevengj commented Feb 19, 2017 • edited Loading

madscientist commented Feb 20, 2017

stevengj commented Feb 20, 2017

stevengj commented Jan 4, 2024

stevengj commented Feb 19, 2017 •

edited

Loading