Hacker News new | past | comments | ask | show | jobs | submit login

> The most correct way to expose Unicode to a programmer in a high-level language is to make grapheme clusters the fundamental unit, as they correspond to what people think of as "characters".

> UTF-8 is a non-starter because it encourages people to go back to pretending "byte == character" and writing code that will fall apart the instant someone uses any code point > 007F.

These are somewhat different concerns, you can provide cluster-based manipulation as the default and still advertise that the underlying encoding is UTF-8 and guarantees 0-cost encoding (and only validation-cost decoding) between proper strings and bytes (and thus "free" bytewise iteration, even if that's as a specific independent view).

> Far too many real-world uses for textual data require the ability to do things like length checks, indexing and so on

This is not a trivial concern e.g. "real-world uses for length checks" might be a check on the encoded length, the codepoint length or the grapheme cluster length. Having a "proper" string type doesn't exactly free you from this issue, and far too many languages fail at at least one and possibly all of these use cases, just for length queries.




and still advertise that the underlying encoding is UTF-8

You can, but I'd avoid it. If you want to offer multiple levels of abstraction, you'll want to offer a code point abstraction in between graphemes and bytes. And for several reasons you'll probably want to either have, or have the ability to emit, fixed-width units.

Python does this internally (3.3 and newer): the internal storage of a string is chosen at runtime, based on the widest code point in the string, and uses whichever fixed-width encoding (latin-1, UCS-2 or UCS-4) will accommodate that. This dodges problems like accidental leakage of the low-level storage into the high-level abstraction (prior to 3.3 Python, like a number of languages, would "leak" the internal implementation detail of surrogate pairs into the high-level string abstraction).




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: