Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf8proc seems difficult to use efficiently on strings #101

Open
madscientist opened this issue Feb 19, 2017 · 4 comments
Open

utf8proc seems difficult to use efficiently on strings #101

madscientist opened this issue Feb 19, 2017 · 4 comments

Comments

@madscientist
Copy link
Contributor

Maybe this isn't an appropriate issue, if so please feel free to close it. I have a string implementation and I need to do some basic UTF8 operations on it: I need to compute the length (in characters not bytes), compare strings in a case-insensitive way (folding), and convert to upper or lowercase strings. I need these done as efficiently as possible as this has a real impact on my system. Then there are a few other more esoteric things I need like reverse a utf8 string etc. but these don't need to be done super-efficiently.

I really would like something small and I only need UTF8, so ICU is too much.

utf8proc seems like a great per-character interface, but it seems difficult to use efficiently on entire strings. For example, there's no simple, fast string length function. Also, the way that the map functions always allocate new memory and can't be used on existing buffers is a major drawback: it necessitates a lot of extra copying in many situations. It seems like a folded comparison function could be written inside utf8proc a good bit more efficiently. Etc.

Maybe that's a goal of utf8proc: to provide a character-based interface and have users compose their own higher-level (string-based) algorithms using them: simplicity taking priority over efficiency? And/or perhaps the way Julia uses utf8proc just matches well with the current interface; it doesn't have a need for writing into existing buffers etc.?

@stevengj
Copy link
Member

stevengj commented Feb 19, 2017

(It is possible to use utf8proc_decompose with a pre-allocated buffer.)

As far as string length (in characters), you are right, it doesn't have a lot of basic functions like this (although you can easily compute string length by calling utf8proc_iterate repeatedly); it is also true that string length (in characters) is just not that useful an operation in most circumstances.

Most of the current utf8proc developers are mainly using it for Julia, so we have only been motivated to implement functions needed for Julia: string normalizations, grapheme detection, some character-related queries. Julia already has a string-length function and UTF8 string iteration and many other functions, so we haven't been motivated to add those to utf8proc. The "goal" of utf8proc, as far as we've been concerned, is mainly to implement functions that we need that require access to a Unicode character database, but we're open to contributions of other useful functions.

I agree that a fast case-folded/normalized comparison function that requires no buffers seems possible to write and could be useful, even for Julia; a PR would be welcome.

@madscientist
Copy link
Contributor Author

(It is possible to use utf8proc_decompose with a pre-allocated buffer.)

Good point; I missed that.

it is also true that string length (in characters) is just not that useful an operation in most circumstances.

Length computations are useful when computing display sizes required, or maximum potential buffer sizes, or sorting by length, or similar things. I agree they're not nearly so useful as straightforward strlen() but still needed.

Thanks for your thoughts @stevengj !

@stevengj
Copy link
Member

For display sizes, you definitely don't want the length in codepoints; e.g. combining characters are zero width. You maybe want the sum of the charwidths, but even this is somewhat ambiguous because the displayed charwidth also depends on the font and terminal. I'm also skeptical of sorting by length for much the same reasons.

@stevengj
Copy link
Member

stevengj commented Jan 4, 2024

I agree that a fast case-folded/normalized comparison function that requires no buffers seems possible to write and could be useful, even for Julia; a PR would be welcome.

Note that such a function was implemented in Julia, and could be ported to C: https://github.com/JuliaLang/julia/blob/0f6c72c71bc947282ae18715c09f93a22828aab7/stdlib/Unicode/src/Unicode.jl#L268-L340

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants