
I � Unicode [pdf] - beefburger
http://seriot.ch/resources/talks_papers/i_love_unicode_softshake.pdf
======
jrochkind1
Unicode is just about the most technically successful standard I have ever
seen, it's pretty amazing.

The weird and complicated parts are all a result of the weirdness and
complexity of the domain -- the universe of human written language. All the
solutions are amazingly elegant for the domain they are in -- including
solutions to legacy backwards compatibility where possible, which have made
unicode as successful at catching on as it has been. The decisions on
compromises between practial legacy compatibility and pure elegance were _just
right_.

The only major mis-step was the "UCS-2" mistake, before they realized more
bytes really were going to be needed, sadly now stuck in Java and making
proper unicode support in Java way harder than it should be.

But in general, if only all our standards dealing with very complex problems
could be as elegantly designed and executed as unicode.

~~~
thristian
Something that often gets missed out of the Unicode story was that originally
there were two groups. The first was the Unicode consortium, who wanted to
combine all the world's existing character encodings, and had picked 16-bit
units as a comfortable representation, which would have been more than enough
for their stated goal.

When Unicode 1.0 came out, there were a bunch of people forming the ISO 10646
committee to produce a character encoding that would cover every human-written
character ever, even the ones that weren't already part of an existing
encoding, but 16 bits would definitely _not_ be sufficient for that. On the
other hand, creating two entirely separate standards wouldn't be a great idea
either, so they joined forces and created Unicode 2.0 with astral planes and
surrogate pairs and all that expansion business.

The point is, we shouldn't blame the Unicode consortium for short-sightedness,
we should blame scope-creep.

~~~
jrochkind1
Hm, except I think the scope broadening was a _great_ idea. A standard that
only covered the human-written characters that had encodings created for them
at that time would have been much less useful than what we've got.

Thanks for your explanation of the history, that it wasn't exactly short-
sightedness. Still, I wouldn't "blame" scope-creep -- maybe it's just another
unusual example of standards-makers involved here managing to make the right
decision at almost every point, even when it involved 'competition' between
standards bodies.

The UCS 2 leftover stuff is one of the biggest problems in practical unicode
at the moment, alas.

~~~
thristian
Oh, certainly. In an ideal world we would have had the ISO10646 scope from the
start, combined with maybe UTF-8. I do occaionally come across people
"explaining" UTF-16 by saying the Unicode consortium couldn't count, which I
feel is unfair even if it's a lie-to-children.

------
guard-of-terra
It's a shame they use ISO-8859-5 as an example because it was never used by
anyone in practice. It's a stillborn standard.

First, we had IBM-866 and КОИ-8 aka KOI8-R, then painfully switched to
WINDOWS-1251, and then Unicode. ISO-8859-5 was never adopted by anyone.

~~~
Moru
That's funny, this survey says it's still in use:
[http://w3techs.com/technologies/details/en-
iso885905/all/all](http://w3techs.com/technologies/details/en-
iso885905/all/all) Quelle.ru uses iso-8859-5.

~~~
guard-of-terra
[http://w3techs.com/technologies/details/en-
windows1251/all/a...](http://w3techs.com/technologies/details/en-
windows1251/all/all)

Three orders of magnitude more for the de-facto standard. Which is, according
to the site, third popular after unicode and latin1.

Of course, it's technically possible to use ISO-8859-5, some misguided
software might even default to it. But it's exceedingly rare and doesn't have
a point. Even KOI8-R is ten times more widely used.

------
TheLoneWolfling
Here's my thoughts on unicode:

Options:

1) Use UTF-32 everywhere. When space is an issue, just compress it -
especially on disk. If you need random access to a string, use a seekable
compression algorithm on it on-the-fly. Alternatively, use a compression
algorithm with checkpoints and maintain a sorted list of where checkpoints
start and how far along in the associated decompressed text you are.
(Effectively rolling your own.) Note that this method doesn't work well with
writes.

2) Use an interesting variant of a rope. Use a rope, but a) keep track of
"logical characters" instead of code points - what unicode calls graphimes,
IIRC, and b) have each node have an encoding - and restrict that all
characters within a specific node have the same width. This allows for pretty
much everything being sublinear. If you allow a bit in a node for "special"
nodes (i.e. reversed, lazy-loaded, slice of another node, that sort of thing),
reversing, among other things, is actually truly O(1). Bunches of
optimizations here - you want to fall back to a "node" that's a flat array for
small strings, you want to potentially use overlong encodings internally where
appropriate (i.e. if you have 1 1-byte character in a bunch of 2-byte
characters, that sort of thing), you want to have some encodings that aren't
fixed-width (for things like reading a bunch of bytes from a file), you want
to have an encoding that's "unknown" / binary data.

Thoughts:

1) Why on earth does any higher-level language still use byte or codepoint
counts for length? And why don't lower-level languages at least have a way to
count / index by graphimes?

2) I do not like UTF-8 / 16\. It's effectively bad huffman encoding. It's an
attempt to save space, but it doesn't even do that well. About the only
advantage of UTF-8 is that ASCII maps to it reasonably well. And it has a
bunch of disadvantages, chief among them being that if you write a single
miltibyte character, you potentially have to rewrite the entire string.

~~~
nabla9
User-perceived characters are not graphemes, they are grapheme clusters.

You can look at unicode stings in at least four different levels of
abstraction: bytes, code points, code units and grapheme clusters. Only
advantage UTF-32 has over others is that allcode units fit into single code
point (atleast I think so)

If you want a vector where each user-perceived character and whitespace
matches one element in the vector, probably the easiest way is to create
vector where each element is short unicode string that matches grapheme
cluster.

~~~
deathanatos
> Only advantage UTF-32 has over others is that allcode units fit into single
> code point (atleast I think so)

All code points can be encoded as a single code unit in UTF-32. Code points
are the things like U+0065 LATIN SMALL LETTER E; code units are what you
encode code points as in a given Unicode encoding — i.e., octets in UTF-8 and
32-bit integers in UTF-32.

~~~
nabla9
Yes. Thank you.

Things that don't necessarily fit into single UTF-32 code unit: combining
character sequence and grapheme cluster.

------
grimgrin
Love Unicode? Then Butts Institute may be for you!

[http://butts.institute](http://butts.institute)

But in all seriousness, you may enjoy this thing my friend cooked up.

"With over a million billion codepoints, Unicode offers a vast array of unique
characters — perfect for microblogging. [Butts Institute] helps you keep your
own personal Unicode character updated, instantly, as often as you like! It's
fast, convenient, fun, social, and totally free!"

Just make an options request to:

curl -X OPTIONS [http://u.butts.institute](http://u.butts.institute)

[https://gist.githubusercontent.com/shmup/e92dad275bcca9287aa...](https://gist.githubusercontent.com/shmup/e92dad275bcca9287aa8/raw/6a2c7474b5dca418a01768b8e993fedb9d901e18/gistfile1.txt)

------
asgard1024
Unicode kinda jumped the shark with all the emoji.. they might as well encode
all frequent words/meanings.

Though I like them. The only emoticon(s) I always miss is for one "shrug",
something like either "I don't know" or "I don't care".

~~~
Animats
I've seen Unicode characters for Facebook, Twitter, etc. icons. So far,
they've been in user-defined fonts, in user-defined expansion space. But I
suspect there will be pressure to put them in the standard.

~~~
thristian
The original set of emoji used in Japanese phones had ten characters reserved
for country flags, but as you can imagine ISO were very much not keen to
include one particular set of countries to the exclusion of all the others, or
to specify a particular flag design for a particular country. The solution
they went with was to add 26 "regional indicator symbol" characters, and if
you use those flag-code characters to spell an ISO country-code, rendering
software is supposed to look up the current flag of the named country and
display it.

If they went to such lengths to avoid adding flags to the spec, imagine how
much pushback you'd get for trying to add company logos.

------
rwg
Another fun Unicode-related bug in OS X 10.9:

    
    
        % printf 'Unicode strike\xcd\x9bs again' | LANG=en_US.UTF-8 od -tc
        Assertion failed: (width > 0), function conv_c, file /SourceCache/shell_cmds/shell_cmds-175/hexdump/conv.c, line 137.
        0000000    U   n   i   c   o   d   e       s   t   r   i   k   e 
        zsh: done       printf 'Unicode strike\xcd\x9bs again' | 
        zsh: abort      LANG=en_US.UTF-8 od -tc
    

I don't know if this is fixed in OS X 10.10 — I filed a bug with Apple a year
ago, but it was marked as a duplicate of another bug. The only thing I can see
about that other bug is that it's now closed.

~~~
kalleboo
Not crashing for me in 10.10

    
    
        ~$ printf 'Unicode strike\xcd\x9bs again' | LANG=en_US.UTF-8 od -tc
        0000000    U   n   i   c   o   d   e       s   t   r   i   k   e    ͛  **
        0000020    s       a   g   a   i   n                                    
        0000027

------
mathias
More details on the Unicode regex problems in JavaScript (slide 62) and how
ES6 will solve most of these issues:
[https://mathiasbynens.be/notes/es6-unicode-
regex](https://mathiasbynens.be/notes/es6-unicode-regex)

------
walrus
To whoever changed the title: it really was supposed to be "I � Unicode", not
"I Love Unicode". The "�" symbol is embedded in the document as a raster
image, so the author really meant for it to be that; it wasn't just a font
rendering issue on your end.

~~~
dang
We changed it back.

Edit: Normally we take attention-grabbing Unicode glyphs out of titles since
they disrupt the placid bookishness of HN's front page. But this is one is so
tasteful and content-appropriate that it seems obviously a special case.

~~~
walrus
Thanks!

