Hacker News new | past | comments | ask | show | jobs | submit login

The most correct way to expose Unicode to a programmer in a high-level language is to make grapheme clusters the fundamental unit, as they correspond to what people think of as "characters". Failing that, exposing strings as sequences of code points is a second-best choice.

UTF-8 is a non-starter because it encourages people to go back to pretending "byte == character" and writing code that will fall apart the instant someone uses any code point > 007F. Or they'll pat themselves on the back for being clever and knowing that "really" UTF-8 means "rune == code point == character", and also write code that blows up, just in a different set of cases.

And yes, high-level languages should have string types rather than "here's some bytes, you deal with it". Far too many real-world uses for textual data require the ability to do things like length checks, indexing and so on, and it doesn't matter how many times you insist that this is completely wrong and should be forbidden to everyone everywhere; the use cases will still be there.




That's silly. How often have you had to work with grapheme clusters without also using a text rendering engine? But the number of times you need to know the number of bytes a string takes, even when using scripting languages, is much higher. The only way to deal with this is to not make assumptions, and not have a string.size() function, but specific accessors for 'size in bytes', 'number of code points' and (potentially, if the overhead is warranted) 'nr of grapheme clusters'.

The 'fundamental' problem here is that the average programmer doesn't understand 'strings' because it seems so easy but it's actually very hard (well, not even hard, just big and tedious). Even more so now that many people can have careers without really knowing about what a 'byte' is or how it relates to their code.


> How often have you had to work with grapheme clusters without also using a text rendering engine?

All the time. Want to truncate user text? You need grapheme clusters. Reverse text? Grapheme clusters. Get what the user thinks of as the length of text? Grapheme clusters. Not saying it’s a good idea to make them any sort of default because of that, though; you’re right that it should be explicit.


Truncating text is almost always (in my experience) a UI thing, where you pass a flag to some UI widget saying 'truncate this and that on overflow' and while rendering, it can then truncate using grapheme clusters.

How often does one reverse text? And when do users care about text length? Almost always (again, in my experience) in the context of rendering - when deciding on line length or break points, so when you know and care about much more than just 'the string' - but also font, size, things you only care about in the context of displaying. Not something that should be part of the 'core' interface of a programming language.

I mean I think we agree here; my point was that I too used to think that grapheme clusters mattered, but when I started counting in my code, it turned out they didn't. Sure I can think of cases where it would theoretically would matter, but I'm talking about what do you actually use, not what do you think you will use.


I’m biased towards websites, but truncating text server-side to provide an excerpt is something I need to do pretty often. Providing a count of remaining characters is maybe less common, but Twitter, Mastodon, etc. need to do it, and people expect emoji (for example) to count as one.

Plus sometimes you’re the one building the text widget with the truncation option.


Twitter's count of "characters" is code points after normalization[1].

I don't know who expects emoji to count as one character, but they'd be surprised by Twitter's behavior: something like ‍[2] counts as 4 characters (woman, dark skin tone, zero width joiner, school).

[1] https://developer.twitter.com/en/docs/basics/counting-charac... [2] https://emojipedia.org/female-teacher-type-6/


And when do users care about text length?

When validating many types of data submitted via the web.


> The most correct way to expose Unicode to a programmer in a high-level language is to make grapheme clusters the fundamental unit, as they correspond to what people think of as "characters".

> UTF-8 is a non-starter because it encourages people to go back to pretending "byte == character" and writing code that will fall apart the instant someone uses any code point > 007F.

These are somewhat different concerns, you can provide cluster-based manipulation as the default and still advertise that the underlying encoding is UTF-8 and guarantees 0-cost encoding (and only validation-cost decoding) between proper strings and bytes (and thus "free" bytewise iteration, even if that's as a specific independent view).

> Far too many real-world uses for textual data require the ability to do things like length checks, indexing and so on

This is not a trivial concern e.g. "real-world uses for length checks" might be a check on the encoded length, the codepoint length or the grapheme cluster length. Having a "proper" string type doesn't exactly free you from this issue, and far too many languages fail at at least one and possibly all of these use cases, just for length queries.


and still advertise that the underlying encoding is UTF-8

You can, but I'd avoid it. If you want to offer multiple levels of abstraction, you'll want to offer a code point abstraction in between graphemes and bytes. And for several reasons you'll probably want to either have, or have the ability to emit, fixed-width units.

Python does this internally (3.3 and newer): the internal storage of a string is chosen at runtime, based on the widest code point in the string, and uses whichever fixed-width encoding (latin-1, UCS-2 or UCS-4) will accommodate that. This dodges problems like accidental leakage of the low-level storage into the high-level abstraction (prior to 3.3 Python, like a number of languages, would "leak" the internal implementation detail of surrogate pairs into the high-level string abstraction).




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: