In practice most applications that require a chars(str) function can get away wi...

cryptonector · on Nov 30, 2018

Sorry, that just generates garbage when dealing with things outside the BMP. That can be a lot more common than you think. E.g., when dealing with Chinese characters in a context where unification is not welcomed (e.g., in China).

avar · on Nov 30, 2018

Yes, you're right that it generates garbage, but that's besides the point.

The point is that a huge number of programmers, especially in the 90s and early 00s would argue for UTF-16 on the basis of it being a fixed width encoding in practice. Maybe they didn't know that it actually wasn't, or they knew and didn't care because they never had to deal with anything outside the BMP.

The overlap between Windows programmers producing software for e.g. in the U.S. or European market and those that would have ever encountered a non-BMP used to be tiny until emojis came along.

So yes, while not in theory, in practice you could get away with treating UTF-16 like fixed width encoding like UCS-2 for a huge number of applications where you could reap the benefits of constant-time chars(str) and charoffset(str, N).

cryptonector · on Nov 30, 2018

The garbage is super annoying. Please stop. Human scripts are O(N), too bad. You can build indices (must, for large documents), but you can't really avoid this being O(N).

And we're not even talking about normalization.

People get upset about these things and blame Unicode, but the problems are not with Unicode -- they are semantics problems with our scripts that Unicode deals with about as well as can be hoped for.

The only thing I'd remove from Unicode is pre-compositions and the associated normal forms NFC and NFKC. But note that that wouldn't remove the need for normalization.

Kwpolska · on Nov 30, 2018

What are the applications that don’t need to care about decomposed characters when doing counting characters? Twitter’s character limit does, for example. https://developer.twitter.com/en/docs/basics/counting-charac...