
UTF-8 – "The most elegant hack" - raldu
http://hackaday.com/2013/09/27/utf-8-the-most-elegant-hack/
======
sytelus
I continue to be amazed by Ken Thompson's name popup in unexpected places even
after all these decades. When I first learned that co-creator of Unix also
invented Regular Expressions I was surprised. Then I came to know about that
elegant hack of inserting malware in compiler that's practically invisible.
And then Ed... Then Chess... And then UTF... And then Go... It feels like Ken
Thompson is Newton of Computer Science. If you randomly open up a section of
physics book 3-4 times, you are very likely to read something about Newton's
contribution. Even what's more cool about him is this humble down-to-earth
line from his Turing award lecture:

 _I am a programmer. On my 1040 form, that is what I put down as my
occupation. As a programmer, I write programs._

It would be great to create a website that showcases all of his contributions
in detail.

~~~
Demiurge
If Ken Thompson is Newton of Computer Science (as opposed to say, unix), who
is Niklaus Wirth?

~~~
enneff
Or indeed Claude Shannon.

~~~
tripzilch
Shannon is more like the Feynmann of Computer Science:
[http://en.wikipedia.org/wiki/Claude_Shannon#Hobbies_and_inve...](http://en.wikipedia.org/wiki/Claude_Shannon#Hobbies_and_inventions)

~~~
arithma
The fixed point is, of course, en.wikipedia.org/wiki/John_von_Neumann‎

------
mojuba
We should be grateful to UTF-8 for saving us from fixed multibyte Internet (so
persistently pushed by MS, IBM and others), which would have made little sense
in this predominantly 8-bit world. I remember someone saying at the time: the
future is 16-bit modems and floppy disks anyway, so why not switch to Unicode
now? Somehow that sounded absurd to me.

Anyway, 20 years later, hardware is still mostly 8-bit, and basically nobody
cares about Unicode apart from font designers and the Unicode Consortium.

(On a side note, UTF-8 as a hack is a distant relative to Huffman encoding,
itself a beautiful thing.)

~~~
contingencies
_Anyway, 20 years later, hardware is still mostly 8-bit, and basically nobody
cares about Unicode apart from font designers and the Unicode Consortium._

Err, failure to parse sarcasm?

 _Most_ people on most networks use it daily. On mobile networks, it's used
for tens of thousands of messages per hour as SMS messages in almost every
country with non-Romanized scripts uses UCS-2 encoding. I've had to write a
raw SMS PDU generator for a whole bunch of human languages, including right to
left ones like Farsi and Arabic, or Chinese, Japanese, Korean and Thai.
Furthermore, Unicode is now used by everyone in China, having finally mostly
migrated from GB encoding. That means "most of the internet uses it daily".

I guess your point may have been sarcastic, because the above is obvious to
me. Either that, or you have a _really_ bad case of a merry centrism.

~~~
ufmace
I think he means "nobody cares about Unicode" in the sense that nobody cares
about the technology itself. The guy in Japan probably loves that he can make
a website in Japanese and have it look right on any system in the world. He
probably loves even more that this just works without him having to know or
care about encoding systems. Even most programmers using modern languages and
frameworks can get away with not paying much attention to encodings and still
probably have good support for international characters.

~~~
contingencies
OK, thanks. So you've translated that "nobody cares" apparently means
"everyone uses it and it works really well". And that's supposed to be a
point? God, I must be getting old.

(Edit: I am not getting frustrated with you, rather my incapacity to grok the
original post.)

~~~
jlgreco
I don't know why you are getting frustrated with him, he just (correctly, as
far as I can tell) interpreted the above comment for you.

------
nilsbunger
One really cool property of UTF-8 I never realized before is that a decoder
can easily align to the start of a code word: every byte that has '10' in
bits7,6 is _not_ the beginning of a code. That's really useful if you want to
randomly seek in a text without decoding everything in between.

Between that and backwards compatibility with ASCII, I'd say it's a pretty
neat hack.

~~~
eridius
This was actually one of the design goals of UTF-8. A decoder should be able
to pick up a stream at any point and not lose a single valid codepoint.

~~~
kazagistar
This assumes you know the byte alignments. What if all you have is a stream of
zeros and ones?

~~~
ars
There's no way to align a bit stream containing exclusively 8 bits per byte.
(Whatever code you choose for the alignment mark could also be a valid code
since there are no extras.)

You need to add extra bits so you can find the alignment, and once you do that
you have the answer to your question.

~~~
wgd
While you may be correct, it's interesting to note the existence of Consistent
Overhead Byte Stuffing
([http://en.wikipedia.org/wiki/Consistent_Overhead_Byte_Stuffi...](http://en.wikipedia.org/wiki/Consistent_Overhead_Byte_Stuffing))
which sort of kind of if you squint manages to add a 257th symbol to the data
stream at a constant (worst-case) overhead per byte.

The original paper also discusses how to use the "Eliminate all zero bytes"
behavior in order to set up an unambiguous sync pattern to indicate the
beginning of transmissions in a bit-level wire protocol.

------
bbq
I wonder how computers would process text if societies with more complex
alphabets had been at the foundation of the industry instead of English-
speaking societies. What if Intel, Microsoft, IBM, and Apple were all Japanese
companies and grew in a global market where English were not dominant? A big
if, sure. Certainly, there must be glimpses at this in history of computing.

~~~
hrktb
You can look at Sharp or Toshiba building word processors, which were
dedicated electronic appliances, but slowly moving to generic computers as
they became cheap and good enough to manage complex input methods and non
english character display (i.e. enough pixels per screen to have recognizable
characters).

It's funny to think that the concept of a fully functional typewriter would be
foreign to Japan until dedicated computers could be built.

~~~
bbq
Huh, that's interesting. Perhaps computing simply took the path of least
resistance to express itself in society (for better or worse).

------
jcampbell1
As a self-taught newbie programmer, one of my early question was, "How do
letters get drawn on the screen?"

My limited understanding is that it works like:

bits (0's and 1's) -> encoding (e.g. UTF-8) -> glyphs (e.g. Unicode) -> Some
insanely complicated black box. This black box knows how to do all sorts of
things like kerning, combining cháracters, bizarre punctuation，and other
magic.

I understand UTF-8 and Unicode, but I have no idea how all the other magic
works. Why is AV nicely kerned, and 我. nicely spaced? Apparenty this is a
really hard problem because my trusty old code editor Textmate didn't get it
right. Unicode to screen is a terribly hard problem.

~~~
gonnakillme
One way to think about text encoding is that it provides a nice way to
represent indexes into a big array of glyphs: "unicode". Fonts map (some) of
these indexes to graphic representations and provide "font hinting" \-- clues
as to how different letters should be rendered at different sizes -- and
kerning information.

Usually your desktop environment is the thing that handles all of this,
providing widgets for other programs to combine to create GUIs. I can't find
it right now, but you can probably find e.g. GNOME/GTK's implementation.

(Re: Textmate: code is typically rendered in monospace font -- where every
glyph has the same size -- so that it has the same alignment in every font
with every renderer.)

~~~
MarcusBrutus
I think the terminology you use is not right. Unicode maps integers to
graphemes (not glyphs) and vice-versa (i.e. it's a 1-1 mapping). Fonts map
graphemes to glyphs.

~~~
ygra
This is even more wrong. Unicode maps integers to abstract characters. Fonts
map characters to glyphs and layout engines map glyphs to graphemes. Keep in
mind that graphemes can be composed of multiple parts, e.g. ligratures or
combining characters.

------
purpleturtle
This is a Computerphile video. You can find the rest here:
[http://www.youtube.com/user/Computerphile/videos](http://www.youtube.com/user/Computerphile/videos)

Really fascinating interviews with luminaries.

~~~
GyrosOfWar
Brady Haran's videos are in general very good. All of his videos are in this
style, concentrated on a single topic and in general shorter than ten minutes.
I especially like the SixtySymbols channel (Physics and Astronomy). He seems
to have a talent for asking the right questions and/or finding the right
people to answer his questions.

------
jacquesm
UTF-8 is a beautiful hack but the way applications handle UTF-8 text and deal
with corruption ranges from excellent to horrible. I've had a text editor balk
at me after an hour of work and trying to save with a 'xx character can not be
encoded' and simply refusing to save without any hint as to _where this
character was_. Playing manual divide and conquer without being able to save
(just undo the changes) is pretty scary. Finally it turned out to be a quote
that looked just like every other quote.

~~~
webreac
The UTF-8 hack is beautifull. All the problem of corruption range come from
standardization commities. The only purpose of corruption range is to not
admit that UCS2 and UTF16 were very bad ideas that should be killed.

------
symisc_devel
Note that the UTF-8 encoding mechanism has inspired the variable length
integers encoding
([http://sqlite.org/src4/doc/trunk/www/varint.wiki](http://sqlite.org/src4/doc/trunk/www/varint.wiki))
introduced by the SQLite author (DRH) which I hope will be popular among
programmers.

~~~
nly
Unfortunately many variable length integer encodings aren't well thought out.
The integer encoding used in Protocol Buffers for instance allows for multiple
valid encodings of any particular integer (there's no one true canonical
form)... which can make the format difficult for use with things like hashing
and digital signatures

~~~
masklinn
FWIW UTF-8 allows multiple encodings of any particular integers. Decoders
_should_ reject non-minimal encodings (they are a security risk, as it becomes
possible to smuggle ASCII payload in a non-ASCII form which blindsides some
security systems) but don't always do so. And of course if you don't decode
UTF-8 data at any point and don't validate it either, you're still fucked.

------
ppierald
UTF-8 wins for the internet because most of the payload of an HTTP
request/response is 7-bit ascii characters, so it is the most efficient even
in languages where UTF-16 may be more efficient because of the complexity of
their character set.

------
sambeau
I once wrote a programming language that used UTF-8. I was rather proud of
this hack that succinctly allowed UTF-8 variable / function names. I went
searching for it and, sure enough, I still like it.

    
    
      BAD_UTF8 = [\xC0\xC1\xF5-\xFF];
      UTF8_CB  = [\x80-\xBF];
      UTF8_2B  = [\xC2-\xDF];
      UTF8_3B  = [\xE0-\xEF];
      UTF82    = UTF8_2B UTF8_CB;
      UTF83    = UTF8_3B UTF8_CB UTF8_CB;
      UTF8     = UTF82 | UTF83 ;
      
      ATOM     = ([_a-zA-Z]|UTF8)([_a-zA-Z0-9]|UTF8)*;

~~~
ygra
This looks limited to the BMP in the same way the broken utf8 charset for
MySQL is. You're simply ignoring the existence of the other 16 planes. While
not necessarily a problem in practice, especially with identifiers in
programs, it still hints at a greater problem that you nonetheless chose to
name this UTF-8 which it isn't.

~~~
sambeau
Yes, I chose to stop there for pragmatic reasons, however, the intention was
always to add the other 4, 5 & 6 byte forms if/when needed.

This was as much as I needed to test my wide char code (and to parse written
languages e.g. Chinese). At least left like this my code would error properly
if it encountered a character it didn't know.

------
cma
I thought 7 bits were used in ASCII because terminals needed the 8th as a
parity bit, not because machines dealt much in 7bit entities.

~~~
Patrick_Devine
That is definitely true for early terminals, however the parity bit got tossed
after things got less lossy. There was an 8 bit ASCII definition called
"Extended ASCII" which was (is?) used for things like Curses which describes a
lot of characters which have been remapped in UTF-8.

------
chris_wot
I once wrote a potted history to the precursors of Unicode, starting from the
telegraph codes:

[http://randomtechnicalstuff.blogspot.com.au/2009/05/unicode-...](http://randomtechnicalstuff.blogspot.com.au/2009/05/unicode-
and-oracle.html)

~~~
derleth
Tom Jennings did something similar which only goes up through ASCII:
[http://www.wps.com/J/codes/](http://www.wps.com/J/codes/)

You might also know Tom Jennings from having created Fidonet and having
written the BIOS which became the basis for Phoenix Technologies BIOS.

[http://en.wikipedia.org/wiki/Tom_Jennings](http://en.wikipedia.org/wiki/Tom_Jennings)

~~~
chris_wot
That's awesome! Thanks for sharing this, it's really detailed. I got in a
mention of the Gauss-Weber Telegraph Alphabet though :-)

Everything else he's far more detailed!

------
omn1
This always reminds me of the Microsoft long filename support for Windows NT.
They basically use the unused bits of the old DOS 8.3 file entry (8 characters
for the filename, 3 characters for the extension) to signal a long file name.
Then they allocate enough space to store the file name depending on its
length. The genius part is, that older systems only see the shorter filenames
and keep on working.

The specification of the format is at [1], although I would love to see a nice
drawing of the hack.

[1]:
[http://www.cocoadev.com/index.pl?MSDOSFileSystem](http://www.cocoadev.com/index.pl?MSDOSFileSystem)

~~~
justin66
Your URL is broken but it sure sounds like you're describing the FAT32 long
filename hack which appeared in Windows 95. That has nothing to do with NTFS.

~~~
jychang
VFAT [1] existed in WinNT 3.5, which was released in 1994.

[1]
[http://en.wikipedia.org/wiki/File_Allocation_Table#Long_file...](http://en.wikipedia.org/wiki/File_Allocation_Table#Long_file_names)

------
smagch
You may also want to see UTF-8 Everywhere manifesto.
[https://news.ycombinator.com/item?id=3906253](https://news.ycombinator.com/item?id=3906253)

------
elwell
Describing a hack that uses a varying number of bytes for different characters
as elegant makes me cringe. However, I have no better proposition.

~~~
crystaln
Seems perfectly sensible to me. Why should characters be of fixed size?

~~~
lukasLansky
Getting i-th char in O(1) is a nice thing to have.

~~~
simias
UTF-8 is a nice format for storage/transmission. If you're going to do some
heavy processing with your text you're supposed to convert it to a fixed-width
format in memory (typically UTF-32).

~~~
nostrademons
Most heavy text-processing applications I know actually tokenize text to words
(well, technically terms), keep a lexicon mapping from the term ID to textual
representation, and then work in term space. Individual letters are usually
not semantically meaningful in most languages (both human and machine), and so
your analysis becomes much easier if you operate in a space that _is_
semantically meaningful.

------
userulluipeste
I never understood why there wasn't adopted a simpler solution, having bytes
like:

    
    
         0xxxxxxx
    

the backward-compatible ASCII range

    
    
         1xxxxxxx
    

means this doesn't stand alone, but is followed by another one (or more)
complementary byte(s).

This way, the ASCII range would be defined like the "first order", and the

    
    
         1xxxxxxx 0xxxxxxx
    

would be the "2nd order" character range and so on!

    
    
         1xxxxxxx 1xxxxxxx ... 0xxxxxxx
    

as N-bytes length sequence would form the "Nth order" character range (with an
extensible N, of course).

This kind of encoding would have accommodated any range! If I am missing
something, I'm all ears.

------
t_hozumi
I think that there is still a fundamental problem of string encoding.

The problem is that decoders cannot know what encoding a byte stream was
encoded in without additional information. Such information are often lost or
omitted as you can see in web world.

In such a situation, what decoders can do is just guessing. This is the reason
why we still suffer Mojibake.

A possible solution was to attach encoding information to a head of bytes as
one or two byte.

For example:

UTF-8 = 0b00000001

UTF-16 = 0b00000002

Shift_JIS = 0b00000003

EUC-JP = 0b00000004

and so on.

Of course this is not actual and reasonable solution because everyone must
switch decoder/encoder to this protocol at once.

------
anonymouz
I wouldn't really call it a "hack", rather an instance of a standard way of
producing variable length codes, namely a prefix code [1]. That it was also
made to be self-synchronizing is of course neat, but again, as a hack I would
rather describe some one-off thing rather than applications of well known
concepts to solve a problem they were designed to solve.

[1]
[https://en.wikipedia.org/wiki/Prefix_codes](https://en.wikipedia.org/wiki/Prefix_codes)

