

When is a string not a string? - lelf
http://codeblog.jonskeet.uk/2014/11/07/when-is-a-string-not-a-string/

======
adamtj
edit: Well, aren't I a fool. I just read the article more carefully and
realized that a string isn't a string when it's not a string -- that is, when
it's not well-formed. Of course weird things happen given illegal input.
However, I'll leave this as most of it still applies.

A string is not a string when it's serialized and you can't distinguish
between the abstract data and its physical representation.

UTF-16 can in fact represent all Unicode characters. To say otherwise is to
demonstrate a very common misunderstanding -- a misunderstanding that I will
free you from. "Unicode" really isn't hard, if you know the right way to think
about it.

Unfortunately, it seems that Microsoft doesn't know the right way to think
about it. .NET "strings" aren't really strings at all. They're arrays of short
integers, not characters. The internal physical representation leaks through
the abstraction. The author of this article stumbled upon one place where the
abstraction leaks and is surprised because, like Microsoft, he doesn't
understand the distinction between strings and their serialized
representations.

What this article describes is essentially this: take a Unicode string and
serialize it to bytes as UTF-16, then deserialize those bytes with UCS-2.
UCS-2 is a different encoding than UTF-16 and mixing them can easily mangle
the original string. If you then serialize both the original string and the
mangled string with UTF-8, you'll get different bytes for each.

Serializing a string with UTF-16 then deserializing with UCS-2 is like trying
to save an image as a .JPG then load it as a .BMP. If it happens to work, you
won't have the same image you started with. It will be mangled. This should
surprise no one.

The only sane way to think about Unicode is this: A string is an abstract data
structure. It is like a object of some class: you can't write it directly to a
file or send it across a network without first serializing it to some physical
representation as a sequence of bytes. There are many possible ways to
serialize a particular object to a sequence of bytes and there are many ways
to serialize a particular string to a sequence of bytes. A byte is not a
character any more than a byte is a Date object.

As you might expect, you must use a matched serializer/deserializer pair or
the object you load may not be the same as the object you saved, if it works
at all. That's what happened in this article: he saved a string in UTF-16
format, then loaded those bytes with a UCS-2 reader. Like JPG vs BMP, the
results were scrambled.

A "string" is an abstract sequence of characters.

A "character" is a named and numbered element of a particular "character set".
Some characters are present in multiple different character sets and are often
assigned different numbers in each.

A "character encoding" is a bi-directional mapping from a physical sequence of
bytes to a abstract sequence of characters, i.e. a string object. A character
encoding is typically tied to a particular character set.

"Unicode" is a widely used standard that defines a particular character set
and a number of different character encodings such as UTF-8 and UTF-16. The
numbers assigned to each Unicode character are called "code points".

If you can understand the distinction between the abstract data that a string
represents and its many possible physical representations, you can peel back
the layers of confusion and make sense of any character encoding problem.

Conclusion: Text is not hard, if you have the right mental model for it.

