
Dark corners of Unicode (2015) - madars
https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/
======
Manishearth
This article is pretty great, however I would like to caution against using
the word "character" in discussions about Unicode. The post does say that it's
a fuzzy concept, but also says that it's "Basically a thing in the Unicode
table somewhere.", which is the definition of a code point. Unicode itself
gives multiple incompatible meanings for the term
([http://unicode.org/glossary/#character](http://unicode.org/glossary/#character))
and only seems to use it in non-normative text -- so the post isn't wrong, but
has the potential to cause confusion.

Often when you want to say character you really mean "grapheme cluster"[1]. A
Devanagri consonant cluster (of one or more consonants) with optional vowel
(and misc diacritics) is a grapheme cluster. A latin letter with accent
mark(s) is a grapheme cluster. A Hangul Jamo (precomposed or otherwise) is a
grapheme cluster. A flag or multicultural family emoji is a grapheme cluster.

When talking about characters, the question arises if a Hangul Jamo made of
decomposed characters is one character or three. Or if the combination of a
[devanagri consonant + virama] "character" and [devanagri consonant + vowel]
"character" is one character or two. This is specified in the case of grapheme
clusters, and this usually maps to where the notion of a character is
important -- text selection and offsets in editing, etc. "code point" does not
universally map to any tangible concept -- it's a concept made up for the sake
of specifying unicode. You only care about code points when dealing with UTF32
strings or when implementing operations on unicode text. "glyph" is also
sometimes what we mean when we say "character", though that's more useful on
the rendering end of things.

[1]: to be pedantic, "extended grapheme cluster", because unicode gave a
rigorous definition of grapheme cluster and later decided to change it.

~~~
d0mine
I find the following hierarchy helpful: bytes -> code units -> code points ->
extended grapheme clusters -> user-perceived characters

In Python, Unicode string is an immutable sequence of Unicode code points. It
has nothing to do with UTF32 (Python uses flexible internal representation).
To get "user-perceived characters" (approximated by eXtended graphemes
clusters):

    
    
      chars = regex.findall('\X', unicode_text)
      width_in_terminal = wcwidth.wcswidth(unicode_text)
    

In practice, whatever is produced by the default method of iterating over a
string is often called a character in that programming language (code point,
or UTF16 code unit, or even a byte).

[https://pypi.python.org/pypi/regex](https://pypi.python.org/pypi/regex)
[https://pypi.python.org/pypi/wcwidth](https://pypi.python.org/pypi/wcwidth)

~~~
labster
In Perl 6:

    
    
        $width-in-terminal = $text.chars;
        $codepoints = $text.codes;
        $bytes = $text.encode('utf8').bytes;
    

The .chars method should be the fastest, because Perl 6 internally uses
strings of fully composed characters (normalized form grapheme). It's much
better than having to do regex hacks like in Python.

~~~
dom0

        $width-in-terminal = $text.chars;
    

^- very likely wrong.

wcswidth _computes_ the cell-width (or column-count) of a Unicode string,
which is unrelated to the count of graphemes, EGCs or code points. For
example, Latin characters are one cell/column wide, while for example many CJK
characters occupy two cells/columns, while they are still one EGC.

A typical application is printing CJK things to a terminal mask, progress
display or similar.

~~~
labster
Ah right. I guess we were meaning different things by width. And you're
probably more correct here.

------
gotthemwmds
I ran into this at a consulting job recently...

MySQL claims to support utf8, but in reality, it doesn't. You need utf8mb4 to
support certain common Kanji characters.

This company had spent untold thousands (possibly millions) trying to convert
gigantic databases (and I don't use the term gigantic loosely...) from utf8 to
utf8mb4 because some of their Japan-based clients were using Kanji.

Sounds easy right? Wrong. utf8mb4 comes with some technical "gotchas" (google
it) that had delayed the attempt to change to it by almost a year.

Anyway, I found this pretty amusing, and got a huge paycheck to explain to
them just how screwed they were.

~~~
DCoder
> _utf8mb4 comes with some technical "gotchas" (google it)_

I know InnoDB limits index sizes to 767 bytes, meaning VARCHAR(255) using utf8
can have all 255 characters indexed, but VARCHAR(255) using utf8mb4 can only
index 191 characters (floor(767/4) == 191).

After a quick Google search, that seems to be the most common gotcha. What
other gotchas did you have in mind?

~~~
gotthemwmds
This was definitely the first thing that came up, as you found.

To be honest, I just don't remember. There was something about something that
made something scary to the PM who was in charge of it all? That is about the
best I can come up with.

I want to say the needed to index more than 191 chars, but that seems like a
stupid thing to say. Who needs to index that many chars?

If I remember, I'll edit :)

edit: I guess I should say I was consulted to do some unrelated things, then
helped them with some MySQL stuff that came up towards the end of the
contract, then the utf8mb4 stuff came up, and I spent some time going through
it with them. It was not the main focus of the contract, which is part of why
I don't remember it very well. Just something that came up in the day to
day...

------
saghm
> How’s that for depending on global mutable state?

This is quite possibly the greatest pun I've ever encountered

------
xg15
My takeaway is that turkish "I" and English "I" as well as English "æ" and
Icelandic "æ" are crammed into the same character even though that causes all
sorts of problems, but there are about 20 different characters that represent
"x"...

~~~
maxlybbert
The solution to the problem, by the way, is to determine what the user expects
and give it to them. That is, you can't just sort words without defining what
sort you want to do (according to German phone book sort? Portuguese
dictionary sort? etc.).

------
nonsince
Where do you sort the Dutch "ij"? The obvious answer is "between 'ii' and
'ik'" but it's actually just the print representation of "ÿ", a letter that is
essentially only used in freehand nowadays. So, do you sort it in place of
"ÿ", when in the Dutch locale? What if you have a borrowed word that happens
to have "ij" in it, like "hijack" (which really is used in Dutch)? In
practice, you sort it as if it's English (i.e. between "ii" and "ik"), but
that leads to confusion because when capitalising you treat it as a single
letter. The titlecase form of "ijzer" (iron) is "IJzer". I guess the correct
way would probably be whatever they do in dictionaries and phone books, but
I'm an expat so I have no idea what ordering those use.

~~~
xg15
Based on other cases, I guess the "unicode-style" solution would be to add a
new codepoint DUTCH CHARACTER IJ OR ÿ which is cased and sorted accordingly.

Of course then you'd have to teach people to use that character when writing
instead of just typing "ij". And one day, there will be someone who, for
stylistic reasons, needs tight control over when it's rendered as "ij" and
when as "ÿ"...

~~~
pitdicker
It already exist: U+0132 and U+0133.

Her is a link with some nice information:
[http://www.uazone.org/multiling/euroml/annex02.html](http://www.uazone.org/multiling/euroml/annex02.html)

------
elFarto
One of my favourite dark corners of Unicode is IDS, or Ideographic Description
Sequence. It's allows you to describe characters that are not encoded in
Unicode, but can be described by a combination of existing ones.

For example, the Chinese character for the word Biang[1] can be describe with:

⿺辶⿳穴⿲月⿱⿲幺言幺⿲長馬長刂心

[1]
[https://en.wikipedia.org/wiki/Biangbiang_noodles#Chinese_cha...](https://en.wikipedia.org/wiki/Biangbiang_noodles#Chinese_character_for_bi.C3.A1ng)

~~~
username223
> "Made up of 58 strokes in its traditional form (43 in simplified
> Chinese)..."

I'm glad I don't have to try to type that with 8 fingers and 2 thumbs. Writing
Chinese with a keyboard sounds insane:
[http://www.slate.com/articles/news_and_politics/explainer/20...](http://www.slate.com/articles/news_and_politics/explainer/2006/02/what_does_a_chinese_keyboard_look_like.html)

~~~
amake
The very first input scheme the article talks about is Pinyin, with which
you'd input biang by typing... "b" "i" "a" "n" "g" then selecting from a list
(most likely a list of length 1). Not insane or even difficult.

(Except that biang is not encoded in Unicode yet so you can't type it anyway.)

~~~
username223
Think about it, though. You type phonetically in one script, then select a
character in another script that might be pronounced in a similar way. That's
like entering Hangul syllables that sound similar to what you want, then
choosing the right English character sequences.

~~~
masklinn
> Think about it, though

What's there to think about? How else would you input a script with more than
50k characters?

> You type phonetically in one script, then select a character in another
> script that might be pronounced in a similar way.

Sure. Japanese works the same way, you input in kana or romaji, then select
the suitable kanji (or kanji sequence).

Of course it only works when you have a regular phonology, that would be
completely impossible for english since by and large orthography and
pronunciation have no relation.

------
michaelvoz
As someone who wrote the LTR/RTL text shaping for Uber Maps, I know the pains
of this way too well. Off by one Unicode errors were causing Chinese
characters to get appended to the end of Arabic words! Great read.

------
tingletech
I made this once to browse unicode symbols
[http://tingletech.github.io/unicodetoy/](http://tingletech.github.io/unicodetoy/)

~~~
delan
That’s neat! I made this tool for Unicode in general:
[https://www.azabani.com/labs/charming/](https://www.azabani.com/labs/charming/)

------
adamnemecek
Dark corners of Unicode seems to be a tautology.

~~~
paulddraper
I'd say "truism", but yes.

------
Animats
From the comments on that page, some registrars are now registering domains
with emoji in them. RFC 3490 didn't contemplate that, so it is not,
apparently, disallowed. No idea if IDNA does normalization for the color
modifiers.

~~~
my_first_acct
RFC 3490 has been replaced by RFC 5890. Sadly, emojis are disallowed. But some
registries (for example, .ws, according to rumors) may allow them to be
registered anyway.

~~~
ubernostrum
You could always try to register the corresponding punycode and see if it
works...

------
pm24601
Just as an aside, as humans we invented the 'weirdness' of these different
human languages.

In the pencil-and-paper era, we had no problems ever with this.

It's only because we tried to 'simplify' things that we ran into problems :-)

As an additional aside, the OP just talked about sorting words within the same
language.

What about sorting across languages? (For example, names?)

~~~
gizmo686
Its worse than that. None of this weirdness is part of natural human language.
It is all part weirdness of the writing system, which is entirely invented.
What we are facing now is difficulties in interfacing different
implementations of the same technology (writing).

------
pja
The double-width emoji bug has at least been fixed as of 2016 & the Unicode 9
release. That string works fine in my VTE based gnome-terminal today.

(I was the one who finally prodded the relevant Unicode cttee into fixing this
bug. They did all the heavy lifting of writing proposals and steering the
change through though: Thanks Ken et al!)

------
BuuQu9hu
(2015)

~~~
leonatan
This will be relevant in 2025 as well.

~~~
gpvos
Yes, but it describes the Unicode situation in 2015. Unicode has probably
already gained a few more dark corners since then, what with the proliferation
of emoji, but that is not reflected in the article.

~~~
username223
Oh God yes. Skin tone combining characters, anyone?

[http://www.unicode.org/L2/L2014/14173-emoji-skin-
tone.pdf](http://www.unicode.org/L2/L2014/14173-emoji-skin-tone.pdf)

(I was fine with unrealistic, inhuman, Simpsons-style yellow...) I imagine
fine gradations of locale-dependent zero-width gender identity modifiers will
be added at some point. Unicode is a horror-show that will be producing bugs
for decades to come. Every time you see a bug caused by "\r\n" vs. "\n",
double-encoded HTML entities, or "smart" quotes, remember that Unicode is
orders of magnitude more complex.

