Hacker News new | comments | ask | show | jobs | submit login

Perl 6 has invented its own normalisation called NFG which normalises at the grapheme level by creating synthetic code points for multi-char graphemes where necessary. This vastly simplifies operations on Unicode strings and gives semantics that are intuitive - producing results you would expect from the visual appearance of a string.

It feels to me like Unicode was designed for font renderers and other such software rather than programs that have to deal with Unicode input and output.

If you're a font renderer it makes sense to have separate codepoints for each grapheme, and it'd be more complex to split a single codepoint cluster into the individual components that need to be drawn. Having separate graphemes also allows reuse (though as the article shows, there's plenty of visual, non-semantic duplication).

But as a result, the operation "length of string in terms of what a user would consider as separate characters or grapheme clusters" is a hard problem that basically requires all the core aspects of a font renderer other than the actual display code.

Which is fine, and probably reasonable, but dear lord does it make it difficult to use.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact