Unfortunately, as long as you believe that you can index into a Unicode string, your code is going to break. The only question is how soon.
I actually like UTF-8 because it will break very quickly, and force the programmer to do the right thing. The first time you hit é or € or ️an emoji, you'll have a multibyte character, and you'll need to deal with it.
All the other options will also break, but later on:
- If you use UTF-16, then é and € will work, but emoji will still result in surrogate pairs.
- If you use a 4-byte representation, then you'll be able to treat most emoji as single characters. But then somebody will build é from two separate code points as "e + U+0301 COMBINING ACUTE ACCENT", or you'll run into a flag or skin color emoji, and once again, you're back at square zero.
You can't really index Unicode characters like ASCII strings. Written language is just too weird for that. But if you use UTF-8 (with a good API), then you'll be forced to accept that "str" is hopeless very quickly. It helps a lot if your language has separate types for "byte" and "Unicode codepoint", however, so you can't accidentally treat a single byte as a character.
Depending on the task at hand, iterating by UTF-8 byte or by code point can make sense, too. And the definition of these is frozen regardless of Unicode version, which makes these safer candidates for "fundamental" operations. There is no right unit of iteration for all tasks.
If I want to know how much memory to allocate, bytes are it. If I want to know how much screen space to allocate, font rendering metrics are it. If I want to do word-breaking, grapheme clusters are it.
None of these are fundamental.
There are languages whose orthographies don't fit the Unicode grapheme cluster specification, but they're complex enough that I doubt there's any way to deal with them properly other than having someone proficient in them looking over your text processing or pawning it off to a library. At least with grapheme clusters your code won't choke on something as simple as unnormalized Latin text.
And the way you take it all into account is by refusing to accept any defaults. So, for example, a string type should not have a "length" operation at all. It should have "length in code points", "length in graphemes" etc operations. And maybe, if you do decide to expose UTF-8 (which I think is a bad idea) - "length in bytes". But every time someone talks about the length, they should be forced to specify what they want (and hence think about why they actually need it).
Similarly, strings shouldn't support simple indexing - at all. They should support operations like "nth codepoint", "nth grapheme" etc. Again, forcing the programmer to decide every time, and to think about the implications of those decisions.
It wouldn't solve all problems, of course. But I bet it would reduce them significantly, because wrong assumptions about strings are the most common source of problems.
As for indexing, strings shouldn't require indexing period. That's the ASCII way of thinking, especially fixed width columns and such. You should be thinking relatively. For example, find me the first space then using that point in the string the next character needs to be letter. When you build you're code that way you don't fall for the trap of byte indexing or the performance hit of codepoint indexing (UTF-8) or grapheme indexing (all encodings).
For example, I work for a company that does business in the (US) Medicare space. Every Medicare beneficiary has a HICN -- Health Insurance Claim Number -- and HICNs come in different types which need to be identified. Want to know how to identify them? By looking at prefix and suffix characters in specific positions, and the length of what comes between them. For example, the prefix 'A' followed by six digits means the person identified is the primary beneficiary and was first covered under the Railroad Retirement Board benefit program. Doing this without indexing and length operations is madness.
These data types can and should be subjected first to some basic checks to ensure they're not nonsense (i.e., something expected to be a numeric value probably should not contain Linear B code points, and it's probably a good idea to at least throw a regex at it first, but then applying regex to Unicode also has quirks people don't often expect at first...).
For the prefix and suffix and how many characters between them you do the above but use the second iterator to find the suffix. Then you either keep track of how many characters you advanced or ask for how many characters between the two.
It's very easy to think about it this way as that's how a normal (non programmer) human would do it. Basically the code literally does what you wrote in english above.
My point being is that iterators are much faster than indexing when the underlying string system uses graphemes. You can do pretty much anyting just as easy or easier with iterators than with indexing. The big exception is fixed width columnar tet files. I've seen a lot of these in financial situations but fortuanately those systems are ASCII based so not an issue.
If you want to argue that there should be ways to iterate over graphemes and index based on graphemes, then that is a genuine difference, but splitting semantic hairs over whether you're indexing or iterating doesn't get you a solution.
My argument is that iterators are far superior to indexing when using graphemes (or code points stored as UTF-8 but grapheme support is superior). And they don't hurt when used on ASCII or fixed width strings either so the code will work with either string format. No hairs, split or otherwise here.
Now if the spec thinks identifiers are just a collection of code points then it's being imprecise. But things would still work if the lexer/parser you wrote returns identifiers as a bunch of graphemes because ultimately they're just a bunch of code points strung together.
It's only in situations where you need to truncate identifiers to a certain length that graphemes become important. Also normalizing them when matching identifiers would also probably be a good idea.
Based on your description, the correct solution is probably to use a structure or class of a more regular format to store the decoded HICN in pre-broken form. If they really only allow numbers in runs of text you might save space and speed comparison/indexing by doing this.
Doing these operations on sequences of code points can be perfectly safe and correct, and in 99.99%+ of real-world cases probably will be perfectly safe and correct. My preference is for people to know what the rare failure cases are, and to teach how to watch out for and handle those cases, while the other approach is to forbid the 99.99% case to shut down the risk of mis-handling the 0.001% case.
Just like it is better to have something like .nth(X) as a function for stepping to a numbered node, so to does a language string demand operations like .nth_printing(X) .nth_rune(X) and .nth_octet(X); to make it clear to any programmer working with that code what the intent is.
I prefer to have a system where 100% of the cases are valid and teaching people corner cases is not required. We all know how well teaching people about surrogate pairs went. And we're not forbidding the 99.99% case but providing an alternative way to accomplish the exact same thing. The vast majority of code uses index variables as a form of iterator anyways so it's not that big of a change.
The main reason people keep clinging to indexing strings is that's all they know. Most high level languages don't provide another way of doing it. People who program in C quickly switch from indexing to pointers into strings. Give a C programmer an iterator into strings and they'll easily handle it.
As for length in bytes, a good way to handle most use cases regarding that is to have a function that truncates the string to fit into a certain number of bytes. That way you can make sure it fits into whatever fixed buffer and the truncation would happen on a grapheme level.
Size in memory/bytes you could get trivially for any string (and this doesn't change with whether you chose bytes, graphemes or code points or whatever to iterate).
Screen space is irrelevant/orthogonal to encoding -- it appears at the font level and the font rendering engine that will give the metrics will accept whatever encoding it is.
Exactly. That's why measurements of string length shouldn't ever assume I'm looking for a unit-of-offset for a monospaced font.
The problem is that most naive programmers think that's what a string length is and should be.
I also wish Python would expose more of the Unicode database than it does; I've had to turn to third-party modules that basically build their own separate database for some Unicode stuff Python doesn't provide (like access to the Script property).
Depends on what you want to index into it for. I'll admit that once upon a time I opposed adding a "truncate at N characters" template helper to Django since there was a real risk it would cut in the middle of a grapheme cluster, and I don't particularly care for the compromise that ended up getting it added (it normalizes the string-to-truncate to a composed form first to try to minimize the chance of slicing at a bad spot).
But when you get right down to it, what I do for a living is write web applications, and sometimes I have to write validation that cares about length, or about finding specific things in specific positions, and so indexing into a string is something I have do to from time to time, and I'd rather have it behave as a sequence of code points than have it behave as a sequence of bytes in a variable-width encoding.
As to whether UTF-8 forces people to deal with Unicode up-front, I very strongly disagree; UTF-8 literally has as a design goal that it puts off your need to think about anything that isn't ASCII.
I had the need to write grapheme-level word wrap in Rust. Here it is. It assumes all graphemes have the same visible width. This is used mostly for debug output, not for general text rendering.
That is my use case for Python as well.
> sometimes I have to write validation that cares about length,
That's where a trucation function that understands grapheme clusters whould come in so handy. Tell it that you want to truncate to n bytes maximum and let it chop a bit more as to not split a grapheme cluster.
Fortunately my database does not have fixed with strings so I rarely bump into this one.
> or about finding specific things in specific positions, and so indexing into a string is something I have do to from time to time
I write my code to avoid this. Yes I still have to use an index because that's what Python supports but it would be trivial to convert it to another language that supports string iterators.
Combining accents are a great example: if you're an American, especially in the 80s, it's easy to assume that you only need a couple of accents like you used in Spanish and French classes and that's really simple for converting old data to a new encoding. Later, it becomes obvious that far more are needed but by then there's a ton of code and data in the wild so you end up needing the concept of normalization for compatibility.
(That's the same lapse which lead to things like UCS-2 assuming 2^16 characters even though that's not enough for a full representation of Chinese alone.)
I think it's also worth remembering the combination of arrogance and laziness which was not uncommon in the field, especially in the 90s. I remember impassioned rants about how nobody needed anything more than ASCII from programmers who didn't want to have to deal with iconv, thought encoding was too much hassle, claimed it was too slow, etc. as if that excused not being able to handle valid requests. About a decade ago I worked at a major university where the account management system crashed on apostrophes or accents (in a heavily Italian town!) and it was just excused as the natural order of things so the team could work on more interesting problems.
Take the country flag emoji. They're actually two seperate code points. The 26 code points used are just special country code letters A to Z. The pair of letters is the country code and shows up as a flag. So just 26 codes to make all the flags in the world. Plus new ones can be added easily without having to add more code points.
Another example is the new skin tone emoji. The new codes are just the colour and are put in front of the existing emoji codes. Existing software just shows the normal coloured emoji but you may see a square box or question mark symbol in front of it.
Still not answering the question though.
For one, when the unicode standard was originally designed it didn't have emoji in it.
Second, if it was limitations to the arbitrary addition of thousands of BS symbols like emoji that necessitate such a design, we could rather do without emojis in unicode at all (or klingon or whatever).
So, the question is rather: why not a design that doesn't need "normalization" and runes, code points, and all that...
Using less memory (like utf-8 allows) I guess is a valid concern.
Plus the fact that some visible characters are made up of many graphemes the number of single code points would be huge.
As to your second point it seems to me to be a little close minded. The whole point of a universal character set was that languages can be added to it whether they be textual, symbolic or pictographic.
Representing all languages is ok as a goal -- adding klingon and BS emojis not so much (from a sanity perspective, if adding them meddled with having a logical and simple representation of characters).
So, it comes to "the fact that some visible characters are made up of many graphemes the number of single code points would be huge" and "while some languages it's feasable to normalize them to single code points but other langagues it would not be".
Wouldn't 32 bits be enough for all possible valid combinations? I see e.g. that: "The largest corpus of modern Chinese words is as listed in the Chinese Hanyucidian (汉语辞典), with 370,000 words derived from 23,000 characters".
And how many combinations are there of stuff like Hangul? I see that's 11,172. Accents in languages like Russian, Hungarian, Greek should be even easier.
Now, having each accented character as a separate might take some lookup tables -- but we already require tons of complicated lookup tables for string manipulation in UTF-8 implementations IIRC.
I'm curious why you think that UTF-8 requires complicated lookup tables.
Because in the end it's still a Unicode encoding, and still has to deal with BS like "equivalence", right?
Which is not mechanically encoded in the err, encoding (e.g. all characters with the same bit pattern there are equivalent) but needs external tables for that.
And I added that while this might need some lookup tables, we already have those in UTF-8 too anyway (a non fixed width encoding).
So the reason I didn't mention UTF-16 and UTF-32 is because those are already fixed-size to begin with (and increasingly less used nowadays except in platforms stuck with them for legacy reasons) -- so the "competitor" encoding would be UTF-8, not them.
Because language is messy. At some point you have to start getting into the raw philosophy of language and it's not just a technical problem at that point but a political problem and an emotional problem.
Take accents as one example: in English a diaresis is a rare but sometimes useful accent mark to distinguish digraphs (coöperate should be pronounced as two Os, not one OOOH sound like in chicken coop) the letter stays the same it just has "bonus information"; in German an umlaut version of a letter (ö versus o) is considered an entirely different letter, with a different pronunciation and alphabet order (though further complicated by conversions to digraphs in some situations such as ö to oe).
Which language is "right"? The one that thinks that diaresis is merely a modifier or the one that thinks of an accented letter as a different letter from the unmodified? There isn't a right and wrong here, there's just different perspectives, different philosophies, huge histories of language evolution and divergence, and lots of people reusing similar looking concepts for vastly different needs.
Similarly the Spanish ñ is single letter to Spanish but the ~ accent may be a tone marker in another language that is important to the pronunciation of the word and a modifier to the letter rather a letter on its own.
There's the case of the overlaps where different alphabets diverged from similar origins. Are the letters that still look alike the same letters? 
Math is a language with a merged alphabet of latin characters, arabic characters, greek characters, monastery manuscript-derived shorthands, etc. Is the modern Greek Pi the same as the mathematical symbol Pi anymore? Do they need different representations? Do you need to distinguish, say in the context of modern Greek mathematical discussions the usage of Pi in the alphabet versus the usage of mathematical Pi?
These are just the easy examples in the mostly EFIGS space most of HN will be aware of. Multiply those sorts of philosophical complications across the spectrum of languages written across the world, the diversity of Asian scripts, and the wonder of ancient scripts, and yes the modern joy of emoji. Even "normalization" is a hack where you don't care about the philosophical meaning of a symbol, you just need to know if the symbols vaguely look alike, and even then there are so many different kinds of normalization available in Unicode because everyone can't always agree which things look alike either, because that changes with different perspectives from different languages.
 An excellent Venn diagram: https://en.wikipedia.org/wiki/File:Venn_diagram_showing_Gree...
> You can't really index Unicode characters like ASCII strings
But then why do strings-are-UTF8 languages like Go or D make it so easy? Why optimize your syntax with `len(txt)` and `txt` when, as you point out, you can't do that? Why make it trivial to split code points or composed character sequences, but doing something like proper string truncation is brutally hard?
UTF-8's fail-fast property has not enabled more Unicode-savviness. Instead it just lets programmers pretend that we still are in the land of C strings.
I like Rust's approach. It's a strings-are-UTF8 language but strings (both str and String):
- are not directly indexable
- force you to be explicit when iterating: you iterate over either `s.chars()` or `s.bytes()`
- are called out in the docs as being a vector of unsigned 8-bit integers internally
- support a len() method that is called out as returning the length of that vector
- can be sliced if you reaaaally need to get around inability to index directly but attempting to slice in the middle of a character causes a panic
They should have called that one bytelen() then.
And how do you get a proper offset for slicing? Do you then have to interpret the UTF-8 bytes yourself, or can you somehow get it via the chars() iterator or something similar?
To clarify: strings in Go are not necessarily UTF-8. String literals will be, because the source code is defined to be UTF-8, but strings values in Go can contain any sequence of bytes: https://blog.golang.org/strings
Note that this prints 2, because the character contains two bytes in UTF-8, even though the two bytes correspond to one codepoint: https://play.golang.org/p/BqGzW1O2WX
Go also has the concept of a rune, which is separate from a byte and a string, and makes this easier when you're working with raw string encodings.
The blog post I linked to explains this in more detail, but in short: the `strings` package provides essentially the same functions as the `bytes` package does, except applied to work on UTF-8 strings. There are other packages for dealing with other text encodings.
The `for range` syntax is the one "special case", and it was done because the alternative (having it range over bytes instead of codepoints) is almost never desirable in practice, and it's easier to manually iterate the few times you do need it than it it would be to import a UTF-8 package just to iterate over a string 99.9% of the time.
 iterating over bytes is done all the time, of course, but usually at that point you're dealing with an actual slice of bytes already that you want to iterate over, not a string.
A byte array is a representation of a string, for sure. But strings themselves are higher-level abstractions. It shouldn't be that easy to mix the two.
An equivalent situation would be if integers were byte arrays. So len(x) would give you 4, for example, and you could do x, x etc - except you would almost never actually do that in practice, and occasionally you'd end up doing the wrong thing by mistake.
If any language actually worked that way, everyone would be up in arms about it. Unfortunately, the same passes for strings, because of how conditioned we are to treat them as byte sequences.
Calling it "char" in C was probably the second million dollar mistake in the history of PL design, right after null.
Languages like Python 3 that try to be so Unicode-pure that they crash or ignore legal Linux filenames are insane.
But does POSIX require support for arbitrary byte sequences in filenames, or does it merely use bytes (in locale encoding) as part of its ABI? I suspect the latter, since OS X is Unix-certified, and IIRC it does use UTF-16 for filenames on HFS - so presumably their POSIX API implementation maps to that somehow. If that's correct, then that's also the sane way forward - for the sake of POSIX compatibility, use byte arrays to pass strings around, but for the sake of sanity, require them to be valid UTF-8.