
Plain Text Doesn’t Exist: Unicode and encodings demystified - perseus323
http://10kloc.wordpress.com/2013/08/25/plain-text-doesnt-exist-unicode-and-encodings-demystified/
======
ghc
Contrary to the editorialized title (which I'm sure will be changed soon),
here's the best article on Unicode I've ever read:
[http://www.joelonsoftware.com/articles/Unicode.html](http://www.joelonsoftware.com/articles/Unicode.html)

------
stormbrew
I feel like we're at a point now where articles that just try to 'demystify'
unicode are almost teaching the controversy if they don't come out and
actually say how you should deal with encodings in new apps.

It's about time we actually start pressing for the idea that utf-16 was a
terrible idea and that utf-8 should be the dominant wire format for unicode,
with ucs4 if you really need to have a linear representation.

Utf-16 is confusing, complicated, and implementations are routinely broken
because the corner cases are rarer. I really hope we're not still stuck with
it in 50 years.

~~~
millstone
I agree about wire formats. However, programming languages that use UTF-8 as
the primary string representation tend to have inferior Unicode support to
those that use UTF-16. I think this is because with UTF-8, a lot of stuff just
seems to work and so it's easy to ignore the issues, while UTF-16 forces you
to come to grips with the realities of Unicode.

For example, consider a function to open a file by name. If your strings are
UTF-8, you can just pass your null terminated buffer to fopen() or something
and things will work fine for most files. But if your strings are internally
UTF-16, you have to think about which encoding to use, and you research it,
and you discover that holy crap, this stuff differs across OSes, and so we
better take this problem seriously.

~~~
stormbrew
> However, programming languages that use UTF-8 as the primary string
> representation tend to have inferior Unicode support to those that use
> UTF-16.

I'm really curious which languages drove you to this conclusion. Right now
we're in a state where the most mature implementations of unicode in general
are utf-16 because of the ucs2 accident, so it wouldn't exactly surprise me,
but I'd still like to see the proof.

------
joel_perl_prog
Interesting talk by Nick Patch of Shutterstock about Unicode, why they use
Perl, and what he calls best practices for working with encodings:
[http://www.youtube.com/watch?v=X2FQHUHjo8M](http://www.youtube.com/watch?v=X2FQHUHjo8M)

------
greenyoda
HN guidelines ask you to retain the original title of the article, not to
inject your own editorial opinion[1]. This article is a decent quick overview
for someone who has never studied Unicode before, but a bit light on the
details compared to what's available, say, in Wikipedia's articles on
Unicode[2].

[1]
[http://ycombinator.com/newsguidelines.html](http://ycombinator.com/newsguidelines.html)

[2]
[https://en.wikipedia.org/wiki/Unicode](https://en.wikipedia.org/wiki/Unicode)

------
KayEss
>Other common myths include: Unicode can only support characters up to 65,536

Not really a myth. This was UCS2 and was the situation when a load of
important early adopters started with Unicode. Winodws, Java, JavaScript all
got burnt by this and ended up with UTF-16 as a result. Even Python 2.x on
Linux is UTF-16 under the covers :(

>Unicode is just a standard way to map characters to magic numbers and there
is no limit on the number of characters it can represent.

Unicode now limits itself to 21 bits of data. This is what allows the
surrogate pair coding of UTF-16

------
cbr
At first there was no plain text. Then ASCII became standard enough and if you
passed a file to someone they would be able to read it without you needing to
tell them an encoding. Extensions of ASCII and other options spread, and there
was no way to have plain text include characters beyond the 128 of ASCII. Over
the past decade or so we've been consensing on utf8, however, and the
probability that some random piece of non-ASCII text is utf8 has been getting
sufficiently high.

There is plain text, and it is utf8.

------
Argorak
By the way: A very interesting article for those interested in a bit of
context around unicode and how it doesn't solve everything (especially from an
asian perspective) is this one:

[http://web.archive.org/web/20090627072117/http://www.jbrowse...](http://web.archive.org/web/20090627072117/http://www.jbrowse.com/text/unij.html)

------
taspeotis
Character encodings are a pain in the ass. If you want some examples, Michael
Kaplan posts [1] about these sorts of things in way too much detail.

[1]
[http://blogs.msdn.com/b/michkap/archive/tags/unicode+lame+li...](http://blogs.msdn.com/b/michkap/archive/tags/unicode+lame+list/)

------
TheZenPsycho
this seems strangely similar to Joel Spolsky's article on unicode.

[http://www.joelonsoftware.com/articles/Unicode.html](http://www.joelonsoftware.com/articles/Unicode.html)

Though not exactly plagiarised, it shares some obvious parallels.

------
derleth
> The original ASCII standard defined characters from 0 to 126.

0 to 127. 127 is a power of 2 minus 1, which should be a hint; in specific,
it's two to the seventh minus one, since ASCII defines codepoints for all
possible combinations of seven bits, which is 128 possible codepoints, so the
enumeration ends at 127 if you count starting from zero, as computer
programmers are wont to do.

> all possible 127 ASCII characters

128 characters, as mentioned above.

> the ASCII guys, who by the way, were American

ASCII stands for American Standard Code for Information Interchange. The
ethnocentrism was unfortunate but it isn't like you weren't warned.

> The numbers are called “magic numbers” and they begin with U+.

He can call them "magic numbers" but everyone else calls them "codepoints".

> UTF-8 was an amazing concept: it single handedly and brilliantly handled
> backward ASCII compatibility making sure that Unicode is adopted by masses.
> Whoever came up with it must at least receive the Nobel Peace Prize.

I'm sure Ken Thompson and Rob Pike will be happy to hear someone thinks that
way.

~~~
spacehunt
>> UTF-8 was an amazing concept: it single handedly and brilliantly handled
backward ASCII compatibility making sure that Unicode is adopted by masses.
Whoever came up with it must at least receive the Nobel Peace Prize.

> I'm sure Ken Thompson and Rob Pike will be happy to hear someone thinks that
> way.

If that's the case then I think the designers of GB18030 deserve it more,
because they achieved an encoding that is able to map all Unicode codepoints
while being backwards compatible with GB2312, which is itself backwards
compatible with ASCII.

But seriously UTF-8 is like sliced bread after having dealt with we-
thought-64K-is-enough-so-lets-all-use-16-bits UCS-2/UTF-16.

~~~
Dylan16807
And even more so, if the current 20.1-bit format turns out to be insufficient,
UTF-8 is happy to expand to at least 31 bits, where UTF-16 would be forced to
undergo very awkward redesign, such as dedicating an entire plane to double-
surrogates.

~~~
millstone
The flip side of this is that there is such a thing as invalid UTF-8, i.e. a
sequence of bytes that does not form a sequence of valid code units and so
cannot be decoded as UTF-8. UTF-16 does not suffer from that particular
problem.

Of course, there's a long ways to go from "a sequence of valid code units" to
"a valid string," but it's still relevant.

~~~
KayEss
>UTF-16 does not suffer from that particular problem

what about surrogate pairs? You can't have only one 16 bit word for a pair and
have a valid UTF-16 sequence. This problem is real easy to do if you substring
a UTF-16 sequence naively.

~~~
millstone
Right, there's still plenty of ways you can screw up a string. My point was
merely that UTF-8 has one failure mode (invalid code units) that UTF-16 does
not suffer from, and that has implications in encoding converters and APIs.

