
Dark corners of Unicode - zeitg3ist
http://eev.ee/blog/2015/09/12/dark-corners-of-unicode/
======
LeoNatan25
I am not sure whether these are dark corners of Unicode or just complex
language support. When dealing with text, you need to learn and educate
yourself about text. International text is much more complex than localized
text, but if you intend to support that, learn it. Sure, frameworks exist that
hide a lot of the complexity, but then you hit a specific bug/corner case
where the framework fails you, and since you are not familiar with the
intricacies of the language-specific problem, you are clueless how to proceed
and fix it, whereas if you invested the time to learn about it, it would have
been much easier.

A similar (albeit rather simpler and more limited) problem is calendrical
calculus, where people who have little to no grasp of how to perform correct
date operations do complex calendar applications and fail spectacularly in
some edge cases.

Call me crazy, but if you are dealing with text, have some time set for
research before you start your development.

~~~
chipsy
As you point out, many of these features are dependent on how well they're
handled by the application developer. If Unicode were "just" about rendering
pre-made glyphs consistently it would not be nearly so hard, but it aims to do
more - sorting, capitalization, string length - and that's also why it tends
to fall over in production systems. Nobody is ever going to get Unicode
completely right, just a subset of it for the languages they've successfully
localized.

The simplest character encoding you could hope to work with is something like
a single-line calculator or vending machine display - fixed-width, no line
breaks, just the Arabic numerals and maybe a decimal point, some mathematical
symbols, or a limited English alphabet to display "INSERT CASH". Any
featureset above that produces issues. Just line breaks alone are responsible
for all sorts of strange behaviors.

I think it's a bit magical that we've managed to do so much with text given
the starting situation. At each step - from early telegram encodings through
the proliferation of emoji - the implementations had to codify a thing that
was previously left open, develop rules around its use, etc. We've made
language more systematic than it ever was in history, for the benefit of
machines to parse and process it.

~~~
ygra
Unicode is complex because language is complex. Before we had that complexity
in Unicode there were a lot of languages and scripts that couldn't be
represented accurately (or at all) in computers. Mixing scripts in one
document was nigh impossible. I'm not sure that's a world we want back.

It's funny though, how people are not even aware of the fact that different
languages are different until they see something break with Unicode. Casing
and collation rules have been language-specific before. It's just that before
lots of software didn't even _attempt_ to do it right. Again, a world I'd
rather not have back again.

------
jrochkind1
For locale-aware correct sorting of unicode strings, based on the Unicode
Collation Algorithm, some open source libraries twitter released are pretty
awesome.

ruby: [https://github.com/twitter/twitter-cldr-
rb](https://github.com/twitter/twitter-cldr-rb)

javascript: [https://github.com/twitter/twitter-cldr-
js](https://github.com/twitter/twitter-cldr-js)

Human written language is pretty complicated. The Unicode standards (including
the Common Locale Data Repository, the Unicode Collation Algorithm,
normalization forms, associated standards and algorithms, etc) -- is a pretty
damn amazing approach to dealing with it. It's not perfect, but it's amazing
it's as well-designed and complete as it is. It's also not easy to implement
solutions based on the unicode standards from scratch, cause it's complicated.

------
wereHamster
I don't agree with the section about JavaScript strings. Those are proper
strings, just encoded in UTF-16.

> JavaScript’s string type is backed by a sequence of unsigned 16-bit
> integers, so it can’t hold any codepoint higher than U+FFFF and instead
> splits them into surrogate pairs.

You just contradicted yourself. Surrogate pairs is exactly what allows UTF-16
to encode any codepoint.

Once you start talking about in-memory representation, you need to agree on an
encoding. UTF-8, UTF-16 being the most common. wchar_t could be UTF-16 or
UCS-2.

~~~
masklinn
Javascript strings are not UTF-16, you'd only ever see codepoints if that were
the case. Javascript "strings" are UCS2, it's trivial to demonstrate: "\ud83c"
is a valid Javascript string, it's not valid UTF-16.

Here's the relevant section of the Unicode FAQ on the subject:

> UCS-2 does not describe a data format distinct from UTF-16, because both use
> exactly the same 16-bit code unit representations. _However, UCS-2 does not
> interpret surrogate code points_ , and thus cannot be used to conformantly
> represent supplementary characters.

A correct UTF-16 implementation would interpret surrogate code point, validate
that they're paired and prevent access to either surrogate via _string_
operations.

~~~
wereHamster
ES6 did get some new functions to correctly deal with surrogate pairs in
strings. In then end, JS strings are just an sequence of 16 bit values, with
the unfortunate case that many string functions interpret those as UCS-2 and
only some new functions as UTF-16.

When you come across an invalid sequence while decoding particular input (like
"\ud83c") then you generally have three choices: throw an exception, skip the
invalid part, or replace it with a replacement character. The default
JavaScript behaviors is to be lenient. But if you need more control over the
decoding behavior then you can use StringView or TextDecoder which is part of
this spec:
[https://encoding.spec.whatwg.org/](https://encoding.spec.whatwg.org/)

~~~
masklinn
> ES6 did get some new functions to correctly deal with surrogate pairs in
> strings. In then end, JS strings are just an sequence of 16 bit values

Which is exactly why they are not and can not be UTF-16.

> The default JavaScript behaviors is to be lenient.

The javascript behaviour is to have UCS2 "strings".

------
heycam
Interesting that Firefox takes the decomposed Hangul and renders it as whole
syllables, while Chrome shows them as the sequence of individual jamos.
[http://mcc.id.au/temp/hangul.png](http://mcc.id.au/temp/hangul.png)

------
jhallenworld
He's rendering normalized text, but normalization is only for string
comparisons...

I don't understand why emoji are width 1 either.. really the
EastAsianWidth.txt from the Unicode standard needs to match fixed with
terminal emulators.

I've been dealing with all of this recently in JOE:
[http://sourceforge.net/p/joe-
editor/mercurial/ci/default/tre...](http://sourceforge.net/p/joe-
editor/mercurial/ci/default/tree/NEWS.md)

In particular JOE now finally renders combining characters correctly. It now
stores a string for each character cell which includes the start character and
any following combining characters. If any of them change, JOE re-emits the
entire sequence.

But which characters are combining characters? I expect \p{Mn} and \p{Me}, but
U+1160 - U+11FF needs to be included as well but isn't. It's crazy that these
are not counted as combining characters. Now I'm going to have to check how
zero-width joiner is handled in terminal emulators. JOE is not changing the
start character after a joiner into a combining character, ugh..

~~~
jrochkind1
Well, there are multiple normalization forms in unicode. The OP isn't clear
about this. (Perhaps because the python library he's using also isn't as clear
as it ought to be? I dunno)

'compatibility' normalizations are mainly for comparison (including
indexing/search/retrieval) and sorting, although there might be other uses.
But indeed you should not expect a 'compatibility' normalization to render the
same as the not-normalized input that produced it under normalization.

The 'canonical' normalization outputs ought to render the same as de-
normalized input, but rendering systems don't always get it quite right.

For the web, the WWW consortium recommends a canonical normalization.

The Unicode documentation on normalization forms is actually pretty readable
and straightforward, for being a somewhat confusing topic.
[http://unicode.org/reports/tr15/](http://unicode.org/reports/tr15/)

> Normalization Forms KC and KD [compatibility normalizations] must not be
> blindly applied to arbitrary text. Because they erase many formatting
> distinctions, they will prevent round-trip conversion to and from many
> legacy character sets, and unless supplanted by formatting markup, they may
> remove distinctions that are important to the semantics of the text. It is
> best to think of these Normalization Forms as being like uppercase or
> lowercase mappings: useful in certain contexts for identifying core
> meanings, but also performing modifications to the text that may not always
> be appropriate.

The compatibility normalizations are pretty damn useful for
indexing/search/retrieval though. Anyone storing non-ascii text in Solr or
ElasticSearch (etc) probably wants to be familiar with them -- as a general
rule of thumb, you probably want to do a compatibility normalization before
indexing and again on query input.

~~~
eevee
OP here. You say this all, _and yet_ , if I google for "unicode strip
accents"...

Top-voted answer uses NFD, one below it uses NFKD:
[http://stackoverflow.com/questions/517923/what-is-the-
best-w...](http://stackoverflow.com/questions/517923/what-is-the-best-way-to-
remove-accents-in-a-python-unicode-string)

NFKD:
[http://www.perlmonks.org/?node_id=835238](http://www.perlmonks.org/?node_id=835238)

NFD:
[http://www.perlmonks.org/?node_id=1105025](http://www.perlmonks.org/?node_id=1105025)

NFD:
[http://www.perlmonks.org/?node_id=485681](http://www.perlmonks.org/?node_id=485681)

NFD: [http://drillio.com/en/software/java/remove-accent-
diacritic/](http://drillio.com/en/software/java/remove-accent-diacritic/)

NFKD:
[https://gist.github.com/j4mie/557354](https://gist.github.com/j4mie/557354)

Two and a half of the first six results blindly apply NFKD to arbitrary text.
_All_ of them use normalization.

Sad state of affairs.

~~~
taejo
"Strip accents" is not a well-defined operation outside of a specific locale.
Does "Ö" have an accent or not? In German, yes: it's an O with an umlaut. In
English, yes: it's an O with some funny dots on it (heavy metal umlauts?). In
the "New Yorker" dialect of English, it's an O with a dieresis. But in
Hungarian, Finnish, Turkish, and many others, it's _not_ : it's the letter
between O and P, or between O and Ő, or after Z, or...

If you do want to do this, you should know that it only makes sense in your
own locale, and you shouldn't be surprised that the methods are somewhat ad-
hoc (I'm not saying you _shouldn 't_ do this: I've done it myself).

~~~
ygra
In German, history of the letter and rules even dictate that ö should be
written as oe in such cases (that's what it evolved from and that's what the
two dots are; e.g. it's not a diaeresis in German, despite looking the same).

------
arm
_“Also, I strongly recommend you install the Symbola font, which contains
basic glyphs for a vast number of characters. They may not be pretty, but
they’re better than seeing the infamous Unicode lego.”_

I disagree with the notion of Symbola not being a pretty font. As I mentioned
here¹, the glyphs Symbola has for the Mathematical Alphanumeric Symbols block
are quite beautiful². (It may help that I’m using the non-hinted version on a
HiDPI display though… still that implies it will look even better when printed
on paper with an inkjet or laser printer since they still produce more DPI
than the typical HiDPI monitor).

――――――

¹ —
[https://news.ycombinator.com/item?id=10198620](https://news.ycombinator.com/item?id=10198620)

² —
[http://f.cl.ly/items/2h2p0r1F1h2E1y2o2y0c/Screen%20Shot%2020...](http://f.cl.ly/items/2h2p0r1F1h2E1y2o2y0c/Screen%20Shot%202015-09-10%20at%2011.51.55%20AM.png)

------
VLM
Its a good article. There is a direct analogy with the article asking if HTML
is a semantic markup language or a binary graphics art format, and the groups
not overlapping very much other than in failure while mostly not being very
interested in each other.

~~~
eponeponepon
That sounds like an article I'd like to read - can anyone provide a link?

------
alister
> _I strongly recommend you install the Symbola font, which contains basic
> glyphs for a vast number of characters. They may not be pretty, but they’re
> better than seeing the infamous Unicode lego._

Well, I installed the Symbola font as he suggested but I'm still seeing lots
of Unicode lego in the article.

I'm using Windows 7 and the latest version of Firefox, and I set the Symbola
as the default font in Firefox and unchecked the box that says, "Allow pages
to choose their own fonts, instead of my selections above".

What could I be doing wrong? I would assume that if the author recommends
Symbola font, he's checked that Symbola has representations for all the
symbols he's using.

~~~
arm
Symbola certainly does support most of the characters used in that article
(not all though). Specifically, it doesn’t have glyphs for the CJK characters
or the regional indicator symbols used on the page, so you’ll need another
font for those. (OS X v10.7+ includes the 'Apple Color Emoji' font which has
glyphs for the regional indicator symbols, while on Windows 7, 'Segoe UI
Symbol' includes glyphs for them (you need to have installed the update¹ for
the font first though)).

Anyways, firstly, there’s no need to uncheck 'Allow pages to choose their own
fonts, instead of my selections above'. You can go ahead and let the page
specify whatever fonts it wants. If you don’t have the font(s) specified in
the webpage’s stylesheet, or the webpage contains a character that the current
font doesn’t have a glyph for, your OS/browser will substitute it for another
font on your system that does have a glyph for that character (if it exists),
so you can recheck that option. In fact, you don’t have to explicitly even
pick Symbola to be used as a font for any type of text in Firefox at all,
since your OS should use font substitution automatically if any of those fonts
chosen there don’t have a glyph for a character on whatever webpage you’re on.
In fact, to begin with, it’s impossible for any one font to contain all of
Unicode right now, since even OpenType fonts can only contain a maximum of
65,536 glyphs, while Unicode has more than 120,000 assigned codepoints, so
font substitution is absolutely necessary (so you can change the fonts in
Firefox back to the defaults if you like).

Secondly, after you install the font, you may have to restart the computer or
close Firefox before it actually picks up on the new font.

Thirdly, the version of Symbola that was linked to in the article is an old
one. I’d recommend this² one instead (covers more codepoints).

――――――

¹ — [https://support.microsoft.com/en-
us/kb/2729094](https://support.microsoft.com/en-us/kb/2729094)

² —
[https://web.archive.org/web/20150625020428/http://users.teil...](https://web.archive.org/web/20150625020428/http://users.teilar.gr/~g1951d/)

------
TazeTSchnitzel
I predict someone will complain, as usual, that Unicode could and should be
regular and programmer-friendly and everything.

My response would be this: [http://xkcd.com/1576/](http://xkcd.com/1576/)

Unicode is merely as complex that which it encodes: human language.

------
huuu
This is why you should provide a locale for most sting functions in C# for
example. I think this article is mostly about bad unicode support in
programming languages.

