I read dozens (hundreds?) of Unicode-related blogpost for many different languages, with long debates and discussions about the hurdles of counting graphemes, but they always forget to explain why one should need it; it's just assumed that it's important or interesting. This specific post just says: "Let's say you want to count the number of symbols in a given string, for example. How would you go about it?" and then go into a multi-page explanation, which is even incomplete (as you correctly noticed).
I can't remember many cases in which it's been useful to count graphemes, in my programming activity. I usually need to either:
1) count the number of bytes of the Unicode encoding I'm using / going to use, for the purpose of low-level stuff like buffers/sockets/memory/etc.
2) ask a graphic library to tell me how big the string will be on the screen, in pixels (with the given fonts, layout, hints, and whatnot).
Counting graphemes only sounds useful for things like command-line terminal; e.g.: if I were to make a command-line user interface (ala getopt()) which automatically wordwraps text in the usage screen at the 80-th column, I would need to count graphemes, in the unlikely case I had to support Tamil or Korean for such a specialistic case.
tl;dr: counting grapheme is a very complicated problem you probably don't need to ever solve.
in the unlikely case I had to support Tamil or Korean for such a specialistic case.
Why is it "unlikely" that you would want your software to support users of other languages?
You mentioned the following examples in your grandparent post:
- 'நி' (Tamil letter NA + Tamil Vowel Sign I)
- Hangul made of conjoining Jamo (such as '깍': 'ᄁ' + 'ᅡ' + 'ᆨ')
I don't speak either language, but it doesn't seem unreasonable to me that pressing Delete would delete just the vowel sign in Tamil, or just the last component within the Hangul character. In fact, that might be just what the user wants?
My Korean is pretty poor, but I think that's exactly what one wants. If you mistype a letter, you want to retype that letter, not the whole syllable. However, this should work uniformly: it shouldn't matter if the syllable is represented as a single codepoint or made up of comjoining jamo.
I'm not, but I think it's the only sane thing for a text editor to do if you don't want it to incorporate a ton of language-specific rules. The UAX actually does make a distinction between "legacy" and "extended" grapheme clusters---if you're handing "delete", you'll want to use "legacy clusters" to separate the two Tamil marks; but for text selection, use "extended clusters" will combine them (it's a little bit more complicated than that, but there are properties of Unicode that allow you to handle the "preferred" method for editing a script, while remaining mostly language-agnostic).
Hangul is trickier, but input happens through an IME that "composes" the characters before they are committed to the editor.
The IME will perform component-wise deletion, but once it's committed, the editor will operate on the grapheme. It's not a perfect solution, but keeping the composition/decomposition rules for the language in the IME seems preferable.
I was specifically referring to the use case of translating a command-line usage text (ala --help). I'd assume that translating that in Tamil is not exactly common (statistically speaking), or otherwise all getopt()-like libraries would already support this for me.
The reality seems to be that the 'size' of text is entirely dependent on context and even forward thinking articles on the subject seem to get hung up on counting things that don't matter.
Heck, even in one graphemically-straightforward language you can get silliness: http://www.images.generallyawesome2.com/photos/funny/photos/...
The only reason counting graphemes is hard is because detecting grapheme boundaries is hard.
If so, sure, I agree. The counting example is contrived. Truncation, navigation, text selection, etc. are more interesting and practical applications.
The issues related to combining marks are not UTF-16 problems and are not solved by converting to codepoints.
If you have an easy solution of deprecating UTF-16 everywhere where it's used (while not breaking anything that currently works), I'm all ears. Unicode is a pragmatic, not a perfect, standard and there are historical mistakes. But for better or for worse they exist and will probably stay.
 e. g. http://dheeb.files.wordpress.com/2011/07/gbu.pdf
Fun times ahead...