
UTF-8 history (2003) - antipropaganda
https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
======
Camillo
With all due respect, it seems to me that the documents contradict the claim
in the initial email.

Rob's initial claim is that Ken came up with UTF-8 entirely from scratch,
without even looking at IBM's proposal.

But Ken's oldest document from Sep 2, 1992 has his changes simply appended
after the original FSS-UTF document from IBM (starting at "We define 7 byte
types", as mentioned in the email), and notes the changes from original spec.

Then the final document that he sent out (Sep 8) is basically the FSS-UTF
document with his changes applied (this is also mentioned in the email!).

There are two changes, basically:

1\. Use 10 instead of 1 as the prefix for continuation bytes, so you can
synchronize from an arbitrary location.

2\. Once the bits are reassembled, use the value as-is instead of adding a
bias constant, which simplifies the code at the cost of a tiny bit of packing
efficiency.

Given the documentation provided, I would say that the fairest description of
the development of UTF-8 is that IBM came up with the initial design, and Plan
9 made two improvements on it to produce the final UTF-8 design.

~~~
leshow
One thing I was confused about. The document says there are 7 byte types, but
I thought UTF-8 was variable width up to only 4 bytes. Did I misunderstand
something?

~~~
hyperman1
Both are correct: This original UTF-8 encoding can encode values up to 2^32.
But because UTF-16 encoding limits possible values to 16 planes of 64K values,
unicode has a hard limit of 2^20 codepoints.

This means UTF-8 encoded values of more than 4 bytes can never represent a
valid unicode codepoint even if they produce a valid 32 bit numerical value.

------
kstenerud
"Why didn't we just use their FSS/UTF? As I remember, it was because in that
first phone call I sang out a list of desiderata for any such encoding, and
FSS/UTF was lacking at least one - the ability to synchronize a byte stream
picked up mid-run, with less that one character being consumed before
synchronization."

I'm of two minds about this. On the one hand, such an ability is pretty
useless in modern systems filled with checksums at every stage, and reduces
the bit efficiency of UTF-8. On the other hand, this was the only valid-y
reason at the time to rewrite a UTF implementation from scratch.

If not for that requirement, we would have just had UTF-8 implemented as
regular VLQs.

Edit: Actually, now that I think about it, VLQ already does satisfy the
synchronization requirement. Just scan for the next cleared high bit. At most
1 character consumed, and far less bit wastage.

~~~
jstimpfle
The ability to pick up byte streams mid-run is not useless. It's required for
example to jump to arbitrary locations in a file and make sense of the data
you find there.

Imagine a text editor displaying a large CSV file. Wouldn't you mind having to
read everything betseen two locations if you jump forward from the one to the
other? Or read everything from the start if you jumping backwards? Even if the
text editor stores its own synchronization points, it has to read the file
completely at least once, which can be annyoing for very large files.

Also, many of the simpler text tools that only look for ASCII bytes wouldn't
work. For example, printing the last lines.

Also, I don't know that other sytem, but what about robustness if there's one
bad byte somewhere in the file?

~~~
kstenerud
You can do the same thing using VLQ, with less wastage. Just scan for the next
cleared high bit.

~~~
mirekrusin
You can also construct malicious payloads which will DoS many implementations
with architecture like that.

~~~
kstenerud
You can also trivially defend against that by having a max travel distance of,
say, 4.

~~~
mirekrusin
Exactly, so you arrived at the same conclusion - fixed distance so you can
safely pick up stream at arbitrary position.

~~~
kstenerud
The difference is that the distance limits would be inherent to the codec
itself, rather than taking up extra space in the data stream.

------
twsted
"So we went to dinner, Ken figured out the bit-packing..."

I cannot find the words to describe how much I admire these guys (ken, rob,
drm, etc).

------
Pulletwee12549
Nice post, quite interesting.

I think Rob Pike may have one of the coolest email addresses in the world
(r@google.com).

~~~
mci
Back when I worked at Google, a friend of mine got an unpleasant email from r@
because his Python script had a bug: it sent alerts to the individual
characters of my-friend@google.com

------
jquast
Markus Kuhn did a great deal of work for unicode for unix, see the rest of the
same folder,
[https://www.cl.cam.ac.uk/~mgk25/ucs/](https://www.cl.cam.ac.uk/~mgk25/ucs/)

such as examples/UTF-8-demo.txt a beautiful demo file for testing terminal
emulators

or the wcwidth.c file I discovered was the origin of the same c function found
in all OS's, i forked & maintain in python form as the public "wcwidth"
module,

anyway just wanted to point out the treasure trove of the parent folder!

------
daveslash
I believe these are photos of the exact diner. I can't confirm though.
[https://www.flickr.com/photos/ajstarks/albums/72157631470798...](https://www.flickr.com/photos/ajstarks/albums/72157631470798..).

Source:
[https://news.ycombinator.com/item?id=19565980](https://news.ycombinator.com/item?id=19565980)

~~~
filleokus
Fixed flickr url:
[https://www.flickr.com/photos/ajstarks/sets/7215763147079887...](https://www.flickr.com/photos/ajstarks/sets/72157631470798870)

~~~
daveslash
Thank you.

------
janvdberg
Obligatory Computerphile link:
[https://www.youtube.com/watch?v=MijmeoH9LT4](https://www.youtube.com/watch?v=MijmeoH9LT4)

------
microtonic1
back from 2003... how is possible? even nowadays I still find places which
miss the Ñ when printing documents...

~~~
jandrese
Behold the power of corporate inertia.

------
xvilka
Obligatory UTF-8 Everywhere[1] link.

[1] [http://utf8everywhere.org/](http://utf8everywhere.org/)

------
lugg
There was a really good talking the entire history of unicode recently:

[https://youtu.be/IXNIqThaSs8](https://youtu.be/IXNIqThaSs8)

Some might find it more digestable than the linked page.

