Hacker News new | comments | show | ask | jobs | submit login

Perl 6 has invented its own normalisation called NFG which normalises at the grapheme level by creating synthetic code points for multi-char graphemes where necessary. This vastly simplifies operations on Unicode strings and gives semantics that are intuitive - producing results you would expect from the visual appearance of a string.

It feels to me like Unicode was designed for font renderers and other such software rather than programs that have to deal with Unicode input and output.

If you're a font renderer it makes sense to have separate codepoints for each grapheme, and it'd be more complex to split a single codepoint cluster into the individual components that need to be drawn. Having separate graphemes also allows reuse (though as the article shows, there's plenty of visual, non-semantic duplication).

But as a result, the operation "length of string in terms of what a user would consider as separate characters or grapheme clusters" is a hard problem that basically requires all the core aspects of a font renderer other than the actual display code.

Which is fine, and probably reasonable, but dear lord does it make it difficult to use.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact