UTF-8 is a non-starter because it encourages people to go back to pretending "byte == character" and writing code that will fall apart the instant someone uses any code point > 007F. Or they'll pat themselves on the back for being clever and knowing that "really" UTF-8 means "rune == code point == character", and also write code that blows up, just in a different set of cases.
And yes, high-level languages should have string types rather than "here's some bytes, you deal with it". Far too many real-world uses for textual data require the ability to do things like length checks, indexing and so on, and it doesn't matter how many times you insist that this is completely wrong and should be forbidden to everyone everywhere; the use cases will still be there.
The 'fundamental' problem here is that the average programmer doesn't understand 'strings' because it seems so easy but it's actually very hard (well, not even hard, just big and tedious). Even more so now that many people can have careers without really knowing about what a 'byte' is or how it relates to their code.
All the time. Want to truncate user text? You need grapheme clusters. Reverse text? Grapheme clusters. Get what the user thinks of as the length of text? Grapheme clusters. Not saying it’s a good idea to make them any sort of default because of that, though; you’re right that it should be explicit.
How often does one reverse text? And when do users care about text length? Almost always (again, in my experience) in the context of rendering - when deciding on line length or break points, so when you know and care about much more than just 'the string' - but also font, size, things you only care about in the context of displaying. Not something that should be part of the 'core' interface of a programming language.
I mean I think we agree here; my point was that I too used to think that grapheme clusters mattered, but when I started counting in my code, it turned out they didn't. Sure I can think of cases where it would theoretically would matter, but I'm talking about what do you actually use, not what do you think you will use.
Plus sometimes you’re the one building the text widget with the truncation option.
I don't know who expects emoji to count as one character, but they'd be surprised by Twitter's behavior: something like  counts as 4 characters (woman, dark skin tone, zero width joiner, school).
When validating many types of data submitted via the web.
> UTF-8 is a non-starter because it encourages people to go back to pretending "byte == character" and writing code that will fall apart the instant someone uses any code point > 007F.
These are somewhat different concerns, you can provide cluster-based manipulation as the default and still advertise that the underlying encoding is UTF-8 and guarantees 0-cost encoding (and only validation-cost decoding) between proper strings and bytes (and thus "free" bytewise iteration, even if that's as a specific independent view).
> Far too many real-world uses for textual data require the ability to do things like length checks, indexing and so on
This is not a trivial concern e.g. "real-world uses for length checks" might be a check on the encoded length, the codepoint length or the grapheme cluster length. Having a "proper" string type doesn't exactly free you from this issue, and far too many languages fail at at least one and possibly all of these use cases, just for length queries.
You can, but I'd avoid it. If you want to offer multiple levels of abstraction, you'll want to offer a code point abstraction in between graphemes and bytes. And for several reasons you'll probably want to either have, or have the ability to emit, fixed-width units.
Python does this internally (3.3 and newer): the internal storage of a string is chosen at runtime, based on the widest code point in the string, and uses whichever fixed-width encoding (latin-1, UCS-2 or UCS-4) will accommodate that. This dodges problems like accidental leakage of the low-level storage into the high-level abstraction (prior to 3.3 Python, like a number of languages, would "leak" the internal implementation detail of surrogate pairs into the high-level string abstraction).