-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support unicode strings #302
Comments
Oh this happess for simple character literals also
|
Right, our We want to avoid making the same mistake as Haskell, which is notorious for having too many string types: The Text/ByteString distinction is fundamental but the rest of Haskell's zoo of string types is just about navigating a performance-expressiveness tradeoff. The default Dex has it easier here. Our native list type is just a length-hidden table, whose physical representation is a contiguous array of unboxed data. So a Dex byte string represented as a But Unicode is harder. The simplest approach would be to treat strings as lists of UTF-32-encoded characters like this:
But that uses 4X more space than necessary for mostly-ASCII strings. Alternatively, we could use a wrapper around a UTF-8 encoded byte string:
But then we lose the list-of-characters structure. Specifically, we no longer have O(1) random access to characters, which all of Dex's native array-processing idioms are all built on. Or we could have both. Maybe a three-member zoo isn't so bad? |
To be honest even with UTF-32 encoding you don't get O(1) random access to characters, because the encoding of a single "charater" in the sense that we usually mean it might actually span multiple Unicode codes (they all form a single grapheme). Citing this SO answer:
In short, there is no way to reliably support O(1) indexing for Unicode strings, and IMO we shouldn't even pretend that we can do that. There are a lot of writeups from the Swift community about unicode. They've spent a long time thinking about it, and I tend to agree with most issues (although it can have surprising results too, like e.g. system updates changing which strings are equal or not, due to changes in Unicode). IMO a reasonable plan for now would be to rename |
JuliaLang went through this a while ago, and I think ended up with a pretty excellent state with just 1 string type (vs it had 2 when I first started using julia). It tooks me a long to to realize even characters are not stored the why i think they are; which I think is a important key trick to get performance out of UTF8. |
Hi, 👋🏻. I’m one of the co-creators of @JuliaLang and designed and implemented much of its string support. @oxinabox asked me to chime in here with some notes about that design. The observation that O(1) character (code point) access is not that useful since if you really want to deal with all the possible Unicode-awareness issues, you'll need to do O(n) grapheme iteration of strings anyway and probably do some kind of normalization in the process, is very much on point. Python decided to require O(1) indexing by character (code point) when they redesigned their strings in Python 3: CPython represents each string as a fixed-width encoding where each code point is represented with 1, 2 or 4 bytes depending on what the largest code point value is. As a result, Python needs to transcode every string on the way in and on the way out and unless it happens to be pure ASCII. A worse side effect of this design, in my opinion, is that it means that Python cannot handle invalid string data. There’s a purist perspective that says that it shouldn’t be possible create an invalid string, but in practice, invalid strings do happen quite a lot and if your string type cannot deal with invalid data, then robust programs that can handle arbitrary data safely cannot use strings, at which point what is the point if a string type? Typical Python programs ignore the possibility of invalid string data and just fail irrecoverably when they encounter it. For example, file systems don't enforce valid file names: Windows will perfectly happily give you an invalid UTF-16 string as a file name and UNIX will give you an arbitrary sequence of bytes that are meant to be displayed as if they're UTF-8, but might be anything. If your string types can't represent invalid strings, then you either can't use strings for file names or there will be situations where you language can't, say, list the contents of the current directory. Representing strings as an array of code points is tempting, but it has major problems:
Instead, I recommend leaving string data as-is — just a sequence of bytes. What you need to support strings well is to provide APIs for interpreting those bytes in various useful ways. The following are some of those useful ways: As a sequence of code units. For ASCII, UTF-8, and Latin-1, the code unit is bytes, or UInt8 as we denote them in Julia. For UTF-16 and UCS-2 the code unit is byte pairs, i.e. UInt16. For UTF-32, the code unit is four-byte sequences, i.e. UInt32. You want to work with code units in order to implement higher level views of strings, like code points, graphemes, etc. Code units are not something you generally want to expose to users, but indexes in terms of units points are the only ones that are O(1), so you do generally want to index in terms of code units. As a sequence of characters. This is perfectly well-defined for valid strings: each character is a code point as specified in Unicode. For invalid strings, it’s a bit less clear-cut, but the Unicode standard does specify how to split an invalid string into replacement characters, which implicitly gives a way to define what constitutes an invalid character sequence. The basic approach is this: decode well-formed code points (valid or invalid: some well-formed code points are invalid for other reasons, like surrogates in UTF-8 or overlong encodings); as soon as you encounter a code unit that cannot be a valid part of a sequence, the prior sequence is one invalid character and the next code units starts a new character (invalid or not). I can go into more detail if you need, but you can iterate an arbitrary code unit sequence in a well-defined way as a sequence of valid and invalid characters, which each correspond to a sequence of code units in the original string. For UTF-8 a valid or invalid character is 1-4 bytes. For UTF-16, a valid character is 1-2 bytes, and invalid character is always 1 byte (the only way for a UTF-16 character to be invalid is if it’s an unpaired surrogate). For UTF-32, a valid or invalid character is any four bytes code unit. In general, an invalid code unit sequence is at most as long as the longest valid code unit sequence. As a sequence of graphemes. Grapheme iteration is specified by the Unicode standard. We use the utf8proc library (which was abandoned, but we took over maintenance of it, so now it’s maintained by us). utf8proc is a C library for various Unicode UTF-8 operations like grapheme iteration, uppercasing and lowercasing, etc. A grapheme corresponds to a contiguous sequence of characters in the original string, which can in turn be mapped back to a contiguous sequence of code units in the original string. Any invalid character should be its own invalid grapheme; the latest Unicode standard tells you how to split a sequence of valid characters into graphemes. There are also “grapheme clusters”, so this isn’t even the end of the line for ways of iterating strings and there’s also normalization, which is something you may or may not want to do, but I think that’s enough for this post. I believe Rust, Go and Swift generally approach strings along similar lines with slightly different choices of terminology and API details. I don’t know how they handle invalid characters, but Julia’s approach of allowing both valid and invalid characters to be represented by their code unit sequences has been highly successful and makes it easy to write robust string code that can handle invalid data and let the user decide how to handle it. Julia only ships with UTF-8 strings built in and we represent characters as UInt32 values. That UInt32 doesn't represent a code point value, but rather a sequence of 1-4 string bytes corresponding to a valid or invalid character sequence taken directly from the original string. Not only does this allow representing both valid and invalid character sequences, but it also avoids needing to decode UTF-8 into code points when going from string to character and then encode back again when going from character to string. You can do both equality and inequality comparisons of UTF-8 characters directly without needing to decode them with this representation because the ordering of characters is the same as the ordering as byte sequences (this is a very handy little property of UTF-8 that also means you can sort UTF-8 strings with Julia strings are indexed by code unit index but what you get back is a character. This is a slightly odd arrangement since not all indices are valid. If you do something like get an index and then just add or subtract 1 from it, you might get an invalid string index and if you use that, you may get an error. The correct way to do arithmetic with string indices is to use the Depending on available compiler technology and performance requirements, there may be better approaches. One option is to let people index by character and just have indexing be O(n) or do some kind of index caching to amortize the O(n) cost. I believe Raku (the artist formerly known as Perl 6) does this. That seems too clever by half to me, but it’s definitely a very Perlish thing to do. Another approach that we may investigate for Julia 2 is to return a StringIndex type that captures the code unit index along with the string that the index refers to, which would allow writing It’s probably worth talking about Swift’s approach briefly. Take this with a grain of salt because I’m not a Swift prorgammer and I may be quite wrong or out of date about the details here. Swift essentially skipped all the way to the “grapheme” layer in the above break down of ways of interacting with strings: when you iterate a Swift string, what you get represents a grapheme (cluster?) and is a union of things — it can be a simple character or an entire sequence of bytes since grapheme clusters can be nearly arbitrarily big. And I believe that normalization is done on the fly and automatically when doing things like string comparisons. That approach makes sense if you view Swift as only a language for writing user-facing applications where strings are small and it’s important to get these fussy Unicode things right consistently — in that context you should favor correctness and not care about performance. But I think Swift has come to regret that as people have started using the language to write server applications and technical computing applications. After all, if you’re trying to imeplement an HTTP server, you really don’t want every string operation to involve grapheme iteration and Unicode normalization — it becomes a performance nightmare. And frankly, most people don’t care about normalization or graphemes. There’s also a question of whether normalization is always appropriate. Sure, people are surprised when two strings that print the same aren’t equal, but is it really right to say that they’re equal as strings? People also expect that if two strings are equal then they are represented by the same sequence of bytes, which isn’t true if you’re automatically normalizing strings. To me the Swift approach seems to take it too far: if I want normalization, I can do it explicitly. Just because two strings print the same way doesn’t mean they’re the same. I also want to be able to work with characters, which can be represented with a fixed size, and not be forced to deal with graphemes, which are arbitrary-length substrings. Again, if I really want to make sure that I’m working with how the user percieves the data, I’ll normalize it and use a grapheme API. To summarize:
I think that about covers it. |
Hi Stefan, thanks for your extremely thorough write-up. You point out that it's often possible to tolerate invalid Unicode-like data rather than throwing an error immediately. This is interesting because it's a counterpoint to the "parse, don't validate" philosophy. To me the obvious design, putting aside performance considerations, would have been something like this:
But your point is that this is too conservative. If your OS gives you an invalid sequence of bytes as a file name, there's no way to store it as a Unicode string like this, which is a shame if you just wanted to print it out. I think you're arguing for a notion of a "character" that includes invalid data, something like
We'll have to think about it. Part of the appeal of a static type system is precisely to avoid invalid data. If UNIX file names are truly byte strings rather than Unicode strings, then arguably we should be representing them as byte strings. The performance issues are another matter. As you say, for best performance, you'd like to represent |
Here's the thing: yes, UNIX file names are truly byte strings. So is any string you read from any input source until you validate it. You could just represent it as a vector of bytes. But that's not very convenient. If you read a file name (or a line) and want to do something with it before validating it, you still want it to behave like a string. I don't know if you have a REPL but if you do, you'd want this unvalidated string to print itself like a string in the REPL instead of printing itself like a byte vector — because reading a file name and seeing The only point where you really need to worry about validity is the point where you ask a question like "what is the code point of this character?" At that point, you would have an operation that returns
Option 1 seems like obviously a non-starter (despite Python taking this approach; it does at least provide options to return "byte strings" instead). Option 2 is possible and seems like what, @dougalm, you're suggesting. But it's fairly annoying. I can promise that users are going to be annoyed and wonder why this programming language returns vectors of bytes all over the place instead of returning strings like normal languages do. That leaves option 3. A crucial point regarding this option is that dealing with invalid string data is very much an "if" not a "when": there are a surprisingly few string operations that you cannot sensibly do with invalid string data. You can concatenate and search invalid string data just fine. If you concatenate a valid string with an invalid string, you get another invalid string, but it's still the right concatenation. Consider calling There are, of course, situations where you do need to deal with invalid string data. Suppose you're parsing a string into an integer: you have to iterate characters and check that each one is between You might also want to ensure that all strings are valid before saving them somewhere, which you can of course do — just test if they are valid or not. If you want to encode validity at the type level, you can also do that by having a valid string type, such that validity is checked when converting from unvalidated strings to the validated string type. Bottom line: it is normal, not exceptional, to deal with potentially invalid string data coming from external sources; don't make it unnecessarily annoying or difficult. In particular, don't force the user to deal with invalid characters until they actually have to do so. Allow string operations like concatenation, splitting and search to work on invalid strings — just concatenate the underlying data, etc. Only force users to deal with invalid data when they do something that cannot be sensibly defined for invalid data, like computing a code point. |
Wow, thanks a lot for putting in the time to write such extensive comments @StefanKarpinski! I do agree with the case you make about supporting invalid data in strings as a good default. This is crucial if we ever want Dex to interact with other systems (even the file system!). Those were often written even in pre-Unicode times, and so byte strings are the natural units they work with internally, even though this is not the abstraction that the users expect. But it doesn't make sense to be overly formal and bury everyone in a ton of explicit {en|de}coding. I think that a reasonable plan might be something like the following (we might want to wrap some types in newtypes, but I skipped those for the simplicity of presentation): -- Byte strings
Byte = Word8
ByteString = List Byte
-- (Usually) Human-readable strings
def StringIndex (s: String) = <an index set that's a compiler builtin;
the unit of iteration is a single code point
(or a sequence of invalid bytes)>
String = <compiler builtin, a list of bytes that represents a potentially invalid UTF-8 sequence>
Character = Word32 -- a code point, or a cluster of invalid bytes
CodePoint = Word32 -- a valid code point
indexString : (s: String s) -> StringIndex s -> Character
codepoint : Character -> Maybe CodePoint Then, if the We can easily emit good code for One interesting issue with those loops is that unlike with all other index sets, they are more difficult to parallelize, because the computation of |
Seems like a good plan. I think that as long as the Maybe is only at the point where you want to call |
@apaszke I would make a case for keeping binary strings, text strings which are not validated or whose encoding is unknown (this is a very frequent case, where people have data in 8-bit character sets such as ANSI Latin-1 or Windows CP1252), and then validated Unicode strings (possibly UTF-8 encoded) separate. I'd recommend looking at the Unicode standard section on conformance: http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G7404 Instead of the 3 options Stefan presented above, I'd recommend a 4th, which is to use a union type, so that each string returned from something like |
This seems incorrect
The text was updated successfully, but these errors were encountered: