

UTF-8: Bits, Bytes, and Benefits - alexkon
http://research.swtch.com/2010/03/utf-8-bits-bytes-and-benefits.html

======
js2
The wikipedia page on UTF-8 -- <http://en.wikipedia.org/wiki/UTF-8> \-- is
excellent, and includes a link to this wonderful anecdote from Rob Pike -
<http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt>

_UTF-8 was designed, in front of my eyes, on a placemat in a New Jersey diner
one night in September or so 1992.

What happened was this. We had used the original UTF from ISO 10646 to make
Plan 9 support 16-bit characters, but we hated it. We were close to shipping
the system when, late one afternoon, I received a call from some folks, I
think at IBM - I remember them being in Austin \- who were in an X/Open
committee meeting. They wanted Ken and me to vet their FSS/UTF design. We
understood why they were introducing a new design, and Ken and I suddenly
realized there was an opportunity to use our experience to design a really
good standard and get the X/Open guys to push it out. We suggested this and
the deal was, if we could do it fast, OK. So we went to dinner, Ken figured
out the bit-packing, and when we came back to the lab after dinner we called
the X/Open guys and explained our scheme. We mailed them an outline of our
spec, and they replied saying that it was better than theirs (I don't believe
I ever actually saw their proposal; I know I don't remember it) and how fast
could we implement it? I think this was a Wednesday night and we promised a
complete running system by Monday, which I think was when their big vote was._

------
pilif
Totally agreeing on the benefits. But remember one thing: You english speaking
people out there have it easy because of points 1-3 in the article.

This stops being the case the moment you are dealing with any non-ascii
character. At that point, some assumptions stop being valid, like the fact
that it stops being the case that every document is a valid UTF-8 encoded
document.

If you treat arbitrary encoded data as UTF-8, depending on your environment,
you will get thrown exceptions at or you will see question marks in various
designs all over the place.

Combine this with the fact that browsers sometimes are not quite sure of what
they are doing:

I've seen them sending latin1 but telling the server it's utf-8 or the other
way around.

The rails people tried to detect utf-8-ness using the snowman and lately a
checkmark (<http://railssnowman.info/>).

Finally, keep in mind that it's impossible to accurately detect the encoding
if it isn't utf-8 without actually doing language analysis.

Soon, you'll notice that utf-8 isn't magically solving problems.

It's funny how many english speaking people talking to english audiences think
that slapping a "; charset=utf-8" to their content-type headers suddenly makes
their site utf-8 compliant.

As long as their content is 7 bit english and their users send data in 7 bit
english, they could as well just have left the charset alone or set it to
ASCII as ASCII = utf-8 as long as the first bit isn't set.

The application I'm maintaining, while being multi lingual, is, thankfully,
targeting countries with languages that can be expressed in latin-1, so that's
what we are (still) using.

I made various attempts at going utf-8, but in the end, between browsers lying
and external third-party APIs still insisting on latin1, these efforts never
bore fruit.

~~~
qntm
I don't see any solution to character encoding problems while web browsers
actively lie about what they are sending. But that's not a problem with UTF-8,
or even with any specific encoding.

~~~
pilif
of course not. I was just saying that the fact that every ASCII document is
also a UTF-8 document only makes life easier for people dealing with ASCII.

The moment you leave that safe area, all the usual problems will come and
haunt you.

------
jharsman
If you're representing text which contains lots of characters that aren't in
ASCII, like say Chinese, UTF-8 will consume much more storage than necessary.
There are many languages where non-ASCII characters are extremely common.

He misses a very useful property of UTF-8 as well, it never contains null
bytes. This trips up all sorts of heuristics for detecting binary files in
various programs if you use e.g. UTF-16 for mostly ASCII text, since it then
will contain lots of nulls.

UTF-8 is a very clever way to avoid problems on systems suffering under the
mistaken assumption that text is 8-bit byte strings ( _cough_ UNIX _cough_ ),
but that doesn't make it the ideal choice every time.

It is still very common for cross platform tools to not handle file names with
non-ASCII characters for example. Both Mercurial and Git suffered from this
last I looked.

The reason is that they treat file names as byte strings instead of text in
some encoding, and therefore cannot translate to the proper encoding on
platforms which treat file names as Unicode text, like Windows and OS X. OS X
also uses a somewhat unconventional normalization form, which means you need
to handle normalization as well.

~~~
pornel
> _[…] characters that aren't in ASCII, like say Chinese, UTF-8 will consume
> much more storage than necessary._

I don't think that's a problem in practice. Taking for example:

<http://www.baidu.com/s?wd=%D0%C2+%CE%C5>

The page takes 37KB in GBK, 39KB in UTF-8 and 73KB in UTF-16! (UTF-16 doubles
cost of HTML markup but only saves ⅓ in text)

Even in pure Chinese plain text UTF-16 doesn't win by much. I've tested some
random article: 18KB in UTF-16, 25KB in UTF-8. It's down to 9.5KB vs 10KB
after gzipping.

~~~
jharsman
I'm not surprised it doesn't make much of a difference with HTML since the
markup contains so much ASCII. But 25 vs 18 kB is almost 40%. That might not
be insignificant depending on how much text you're storing.

But it's a nitpick really, I just thought it he should have noted some of the
disadvantages to UTF-8 as well.

------
justin_vanw
This article is retarded in lots of ways, but I'll quickly debunk one point.

 _5\. Substring search is just byte string search._

Nope. The same character sequences can have multiple utf-8 representations.
Yes, this comes up in the real world, especially on the web.

Reference:
[http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_form...](http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms)

The only way to sanely handle unicode is to load it using a codec in the
language of your choice, and manipulate it using the tools your language
supplies for manipulating strings OR read all the insane specs for unicode and
implement your own. Treating unicode as bytes will end in tears.

~~~
pornel
The article is correct if by substring you mean exact sequence of code points.
This is still remarkable, because even that is not always possible in
variable-width encodings.

Problem of searching for semantically equivalent text (taking into account
decomposed forms, ligatures, etc.) is higher-level and applies to all Unicode
encodings (i.e. using UCS-2 or UTF-32 doesn't solve it either).

~~~
chronomex
Danger, Danger Will Robinson! UTF-8 strings can include overlong encodings
(e.g. using 3 bytes to encode a code point that's normally encoded in 1 byte).

The Unicode standard disallows creation of these forms, but until recently
(<http://www.unicode.org/versions/corrigendum1.html>), only "strongly
discouraged" interpretation of them. I do not doubt that there is code in the
wild which will interpret them as characters.

------
endian
On the subject of cool byte-string properties of encodings:
<http://keyjson.org>

sorted(map(encode, values)) == sorted(values)

 _< / shameless plug >_

------
riffraff
what does

> UTF-8 sequences sort in code point order. You can verify this by inspecting
> the encodings in the table above. This means that Unix tools like join, ls,
> and sort (without options) don't need to handle UTF-8 specially.

mean? Isn't ordering a property that is external to the encoding and language
dependent? (Though AFAICT ls on osx seems to sort fine italian and hungarian
alphabets, even if it's unable to handle character length properly :) )

EDIT: got it _code point order_ I'm an idiot

~~~
pornel
OS X filenames use decomposed form, e.g. Å is represented as A˚ ("A" followed
by "combining ring"), which makes it sort just after A.

------
alecco
I used to think UTF was a great thing. But I came to realize it isn't good to
go through all this trouble just to keep using a 70s US-centric API. UTF is
error prone and the problems are often subtle and hard to catch.

Also, the same API has serious issues with buffer overflows and off by 1. It
would be great to move on.

~~~
alexkon
What do you suggest instead? UTF-32?

