Hacker News new | past | comments | ask | show | jobs | submit login

> Grapheme clusters are overrated in their importance for processing. The list of times you want to iterate over grapheme clusters:

> 1. You want to figure out where to position the cursor when you hit left or right.

> 2. You want to reverse a string. (When was the last time you wanted to do that?)

You missed the big one:

3. You want to determine the logical (and often visual) length of a string.

Sure, there are some languages where logical-length is less meaningful as a concept, but there are many, many languages in which it's a useful concept, and can only be easily derived by iterating grapheme clusters.

Visual length of a string is measured in pixels and millimetres, not characters. In a font/graphics library, not in a text processing one.

Sorry, visual length as in visual number of "character-equivalent for purposes of word length" things. Those things are close to, but not exactly the same as, grapheme clusters, so the latter can often be used as an imperfect (but much more useful than unicode points or bytes) proxy for the former.

There's no perfect representation of number-of-character-equivalents that doesn't require understanding of the language being handled (and it's meaningless in some languages as I said), but there are many written languages in which knowing the length in those terms is both extremely useful and extremely hard to do without grapheme cluster identification.

>character-equivalent for purposes of word length

Serious question: why would you want to do this?

I know it's fashionable to limit usernames to X characters... but why? The main reason I've seen has been to limit the rendered length so there are some mostly-reliable UI patterns that don't need to worry about overflows or multiple lines. At least until someone names themselves:


Which is 20 characters, no spaces, and will break loads of things.

(I'm intentionally ignoring "db column size" because that depends on your encoding, so it's unrelated to graphemes)

Serious question: why would you want to do this?

Have you never, in your entire life, encountered a string data type with a length rule? All sorts of ID values (to take an obvious example) either have fixed length, or a set of fixed lengths such that every valid value is one of those lengths, and many are alphanumeric, meaning you cannot get round length checks by trying to treat them as integers. Validating/understanding these values also often requires identifying what code point, not what grapheme, is at a specific index.

Plus there are things like parsing algorithms for standard formats. To take another example: you know how people sometimes repost the Stack Overflow question asking why "chucknorris" turns into a reddish color when used as a CSS color value? HTML5 provides an algorithm for parsing a (string) color declaration and turning it into a 24-bit RGB color value. That algorithm requires, at times, checking the length in code points of the string, and identifying the values of code points at specific indices. A language which forbids those operations cannot implement the HTML5 color parsing algorithm (through string handling; you'd instead have to do something like turn the string into a sequence of ints corresponding to the code points, and then manually manage everything, and why do that to yourself?).

Yes. All instances I've seen have been due to byte-size restrictions (so it depends on encoding) or for visual reasons (based on fundamentally flawed assumptions). With exceptions for dubious science around word-lengths between languages / as a difficulty/intelligence proxy, or just having fun identifying patterns. (interesting, absolutely, but of questionable utility / bias)

But every example you've given have been around visuals, byte sizes, or code points (which are unambiguously useful, yes). Nothing about graphemes.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact