
The transition to multilingual programming - d0mine
http://www.curiousefficiency.org/posts/2014/08/multilingual-programming.html
======
d0mine
> Choosing UTF-8 aims to treat formatting text for communication with the user
> as "just a display issue". It's a low impact design that will "just work"
> for a lot of software, but it comes at a price:

> \- because encoding consistency checks are mostly avoided, data in different
> encodings may be freely concatenated and passed on to other applications.
> Such data is typically not usable by the receiving application.

> \- for interfaces without encoding information available, it is often
> necessary to assume an appropriate encoding in order to display information
> to the user, or to transform it to a different encoding for communication
> with another system that may not share the local system's encoding
> assumptions. These assumptions may not be correct, but won't necessarily
> cause an error - the data may just be silently misinterpreted as something
> other than what was originally intended.

> \- because data is generally decoded far from where it was introduced, it
> can be difficult to discover the origin of encoding errors.

It seems surrogateescape error handler introduces these issues back (only for
Unicode strings instead of bytes this time) i.e., a+b is no longer well-
defined (a, b Unicode strings may have lone surrogates due to different
reasons). JSON permits lone surrogates in its strings so the data can easily
spread over the network too.

> \- as a variable width encoding, it is more difficult to develop efficient
> string manipulation algorithms for UTF-8. Algorithms originally designed for
> fixed width encodings will no longer work.

> \- as a specific instance of the previous point, it isn't possible to split
> UTF-8 encoded text at arbitrary locations. Care needs to be taken to ensure
> splits only occur at code point boundaries.

It seems like a premature optimization to claim that UTF-8 being variable-
width encoding is a performance bottleneck in most applications.

And if we want to show the data to a user then we should handle user-perceived
characters (\X regex) that may span several Unicode codepoints e.g., to avoid
splitting a Unicode string inside a character.

\---

Unicode by default is more complex than the article makes it appear to be
e.g., see 🎅 𝕹 𝖔 𝕸 𝖆 𝖌 𝖎 𝖈 𝕭 𝖚 𝖑 𝖑 𝖊 𝖙 🎅 (it is for Perl but Unicode issues are
mostly universal)
[https://stackoverflow.com/a/6163129](https://stackoverflow.com/a/6163129)

