
Unicode in five minutes (2013) - jstanley
https://richardjharris.github.io/unicode-in-five-minutes.html
======
necovek
I originally laughed at "in five minutes", but even though I do not think the
article reads in five minutes, it does a surprisingly good job of covering the
basics: so good job!

I do wonder if it is clear for people who are unfamiliar with Unicode? Anyone
who is mostly unfamiliar with the details article covers who can say how
comprehensible the article is?

I would also add a mention of the standard Unicode collation table that does a
passable job for many languages at the same time (though Unicode Collation
Algorithm is mentioned, which this is the default for, I think it's worth
highlighting this property of most UCA implementations).

As for the article gotchas, multilingual text is even more complex when go
past 5 minutes even for "simple" European scripts. Eg. in
Bosnian/Croatian/Serbian in Roman/Latin alphabet, "nj" will be capitalized to
"Nj" or "NJ" depending on the rest of the word — eg. "Njegoš" or "NJEGOŠ";
confusingly, Unicode also includes digraphs for both capitalization forms (the
eternal tension in Unicode between encoding letters, glyphs or characters),
even though they are linguistically equivalent — in practice, they are never
used, which makes their inclusion even more perplexing (they are always
spelled out using two characters, and there was no historical reason since
none of the 8-bit encodings had them)! It will also sometimes be two distinct
letters, especially in loanwords like "konjugovan" — this makes things harder
when you need to collate texts since the proper order would be "konjugovan",
"kontakt", "konj".

All of this is why I like to joke how Cyrillic script is technically much
better for all of these languages, even though it is basically in official use
only for the Serbian language — in Cyrillic, there is no conundrum in either
of the above examples since nj=њ (or нј), Nj/NJ=Њ, and the order is clear:
конјугован, контакт, коњ.

~~~
pmiller2
> I originally laughed at "in five minutes", but even though I do not think
> the article reads in five minutes, it does a surprisingly good job of
> covering the basics: so good job!

Slightly off topic, but just to riff on this a bit: maybe books and articles
called "$THING in $NUMBER_OF $TIME_PERIODS" or "Learn $THING in $NUMBER_OF
$TIME_PERIODS" should be retitled "$NUMBER_OF $TIME_PERIODS with $THING." It
would be more accurate, not imply any sort of mastery, and, on top of that,
sound a little more dignified. But, maybe it wouldn't sell as many books,
so... ¯\\_(ツ)_/¯.

------
Wistar
Joel Spolsky's 2003 Joel On Software piece: "The Absolute Minimum Every
Software Developer Absolutely, Positively Must Know About Unicode and
Character Sets (No Excuses!)"

[https://www.joelonsoftware.com/2003/10/08/the-absolute-
minim...](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-
every-software-developer-absolutely-positively-must-know-about-unicode-and-
character-sets-no-excuses/)

~~~
bmn__
This meanwhile fell behind with the times. I would recommend the submitted
article rather than Spolsky's.

~~~
Rels
I'm not so sure about that, Spolsky's article seems way better at introducing
someone to Unicode if they don't know anything about it. The OP article goes
way deeper and has more interesting insights about Unicode itself though.

Disclaimer: might be biased because I've discovered Unicode through Spolsky's
article.

------
herodotus
This is a really great summary of Unicode. I wish it had been available when I
first started getting into the complexities of multilingual string searching
and normalization. Ultimately, reading the official documentation
(unicode.org) was necessary, but a succinct and clearly written introduction
like this would have saved me hours (if not days) of effort.

~~~
hombre_fatal
Yeah, this is the kind of cut-to-the-damn-chase I want 90% of the time as an
experienced developer touching technology I don't necessarily deep-dive every
day, like an actual example of what NFKC does.

Even if it's too topical to be actionable in every case, it gives you the
general idea and vocabulary to put together useful search queries when you
want to know more.

~~~
andrepd
Honestly. It's so frustrating when people go on tangents and say in three
paragraphs what you could say in two short sentences. An example: the rust
book.

------
begriffs
More unicode, in more minutes:

[https://begriffs.com/posts/2019-05-23-unicode-
icu.html](https://begriffs.com/posts/2019-05-23-unicode-icu.html)

------
wcarss
Something complementary -- because this article just takes a moment to talk
about the different encoding schemes -- a wonderful, terse, very informative
video describing how utf-8 encoding works and why (with a little history) by
Tom Scott/Computerphile:
[https://www.youtube.com/watch?v=MijmeoH9LT4](https://www.youtube.com/watch?v=MijmeoH9LT4)

------
jermier
I always loved the whimsy present in Unicode. For nostalgia, here's a HN post
from 2010 pointing to the `Unicode Snowman for You` site (which is still up!)

[https://news.ycombinator.com/item?id=2035572](https://news.ycombinator.com/item?id=2035572)

And the site:

[http://xn--n3h.net/](http://xn--n3h.net/)

~~~
mysterypie
According to the HTML source, the original site was:

[http://unicodesnowmanforyou.com/](http://unicodesnowmanforyou.com/)

I wish I understood what the keepers of Unicode were thinking by including so
much bloat in a _character set_ (or character encoding). I realize that
Unicode is going to have a huge number of symbols no matter what, if they're
going to represent all the world's languages and math and punctuation, but I'd
draw the line at emoticons, emojis, playing card symbols, and snowmen.

~~~
Dylan16807
Well right now it's about two percent of unicode, right?

And people use them as text, so there's a reason to add them and not much
reason to refuse them.

~~~
mysterypie
You might be right, but where are you getting the 2% from? Are you thinking of
_just_ emoticons, emojis, playing card symbols, and snowmen? There's more than
that I'd question.

~~~
Dylan16807
I looked up how many emoji there were, added some for wingdings, and rounded
up a bit.

What else would you question? Would it be more than 1500 more, which would
bump it from 2 to 3 percent?

------
rurban
It misses the security considerations for names. Almost nobody knows about nor
implements that. Eg for filenames or variable names.

~~~
pacaro
If you're interested in this...
[https://en.wikipedia.org/wiki/Homoglyph](https://en.wikipedia.org/wiki/Homoglyph)

~~~
jermier
Also recently spotted as an avenue for attack in the wild:

Magecart group uses homoglyph attacks to fool you into visiting malicious
websites: [https://www.zdnet.com/article/magecart-group-uses-
homoglyph-...](https://www.zdnet.com/article/magecart-group-uses-homoglyph-
attacks-to-fool-you-into-visiting-malicious-websites/)

Homoglyph attacks used in phishing campaign and Magecart attacks
[https://securityaffairs.co/wordpress/106916/hacking/homoglyp...](https://securityaffairs.co/wordpress/106916/hacking/homoglyph-
attacks-phishing-campaign.html)

[https://cisomag.eccouncil.org/homoglyph-
attacks/](https://cisomag.eccouncil.org/homoglyph-attacks/)

------
UncleEntity
Unicode is weird...this prints out backwards (including the comma and space)
in the python3 repl:

    
    
      >>> [chr(0x07c0+i) for i in range(10)]
      ['߀', '߁', '߂', '߃', '߄', '߅', '߆', '߇', '߈', '߉']
    

0..9 in the N'Ko script BTW...

~~~
hombre_fatal
I don't get what you mean by backwards.

    
    
        py3> [chr(0x07c0+i) for i in range(10)]
        ['߀', '߁', '߂', '߃', '߄', '߅', '߆', '߇', '߈', '߉']
    
        
        js> [...Array(10)].map((_,i)=>String.fromCodePoint(0x07c0+i))
        ['߀', '߁', '߂', '߃', '߄', '߅', '߆', '߇', '߈', '߉']

~~~
diath
I believe he's trying to print the 0..9 range by providing the proper start
and end point for those characters but instead gets 9..0 (I don't know the
script but I'm basing it off by the 0 at the end). So for instance 0x07c0
stands for 0 in Nko script, and this is his starting point, but the entire
sequence ends up being reversed. I'm not sure how comparing it to JS helps
here other than I guess pointing out that it's also behaving unexpectedly.

~~~
hombre_fatal
Wait, I just realized the results in my repl (0..9) are reversed from what I
pasted into HN (9..0). And if you shrink the width of the browser to force my
HN snippets to wrap, it changes the order. And it selects in the reverse order
on click and drag.

I spoke way too soon. Unicode _is_ weird. My apologies to our friend
UncleEntity.

~~~
diath
It looks like it depends on how your terminal (or the browser, or anything
that renders it) handles Unicode (which I guess just means that Unicode is
hard to get right):
[https://i.imgur.com/8FPNYMP.png](https://i.imgur.com/8FPNYMP.png)

~~~
necovek
It's how the directionality (right to left or left to right) is decided that
is complicated for mixed texts (and always nothing but a heuristic).

I must admit that I was surprised that the following snippet kept the LTR
order in my terminal:

>> [(chr(ord('0')+i), chr(0x07c0+i)) for i in range(10)] [('0', '߀'), ('1',
'߁'), ('2', '߂'), ('3', '߃'), ('4', '߄'), ('5', '߅'), ('6', '߆'), ('7', '߇'),
('8', '߈'), ('9', '߉')] >>> [(chr(0x07c0+i), chr(ord('0')+i)) for i in
range(10)] [('߀', '0'), ('߁', '1'), ('߂', '2'), ('߃', '3'), ('߄', '4'), ('߅',
'5'), ('߆', '6'), ('߇', '7'), ('߈', '8'), ('߉', '9')]

------
wingi
The variation selector link is dead, but is archived.

[https://web.archive.org/web/20160417233039/http://babelstone...](https://web.archive.org/web/20160417233039/http://babelstone.blogspot.co.uk/2007/06/secret-
life-of-variation-selectors.html)

~~~
obelos
I've worked with Unicode for years and thought I had a good handle on its
mechanics until I discovered this feature of the system last year. I was
puzzling out why some symbol code points sometimes render in flat character
style and other times as more graphic emoji, even when the same font and same
code point is used in each case. Turned out it was a matter of applying VS15
or VS16 as a combining character, and which was the default for a given code
point. Incredibly detailed stuff that this archived BabelStone article goes
into in much greater depth than the bit I wrote about my exploration:
[https://khephera.net/posts/a-unicode-woe-
solved/](https://khephera.net/posts/a-unicode-woe-solved/)

------
jitteriest
> it gives a (double-story) and a (single-story) the same codepoint.

But they did see fit to have ɑ (LATIN SMALL LETTER alpha)which is distinct
from α (GREEK SMALL LETTER ALPHA).

~~~
bmn__
[https://www.unicode.org/faq/basic_q.html#5](https://www.unicode.org/faq/basic_q.html#5)

------
nabla9
This is first short intro to Unicode I have seen where the reader does not
leave thinking that one user perceived character must be just one code point.

~~~
a1369209993
Although they do mistakenly refer to ffi (U+FB03) as a character. Still better
than most intros though.

~~~
bmn__
It is a character (using Unicode's nomenclature).

~~~
a1369209993
It's not up to Unicode to decide; "ffi" is three distinct characters, not one.

~~~
a1369209993
And for that matter, even [unicode 88] admits that it isn't a character.

unicode 88:
[http://www.unicode.org/history/Unicode88.pdf](http://www.unicode.org/history/Unicode88.pdf)
search for "A ligature is a glyph"

------
ngcc_hk
Just reading an article about Japanese Saito surname and how hard the idea of
“uni”code (or possibly dropped idea of Hans unification) is problematic in
real life situation. Yes you may have a codepoint but it is only part of the
problem especially related to human name.

------
jariel
This is a good introduction, unfortunately, Unicode may ultimately be a
problem in and of itself.

To start, consider that the term 'character' used in the article, though
'generally correct' ... is definitely not correct in the broadest sense.

Western, Cyrillic and Asian scripts boil down to 'characters' with some
complexity maybe with ligatures ('Straße'), but it falls apart quickly for
other languages.

Unfortunately, rather than creating rigorously applied definitions for things,
and applying them consistently, even Unicode falls into this bureaucratic trap
of vagaries with their own definitions.

So Unicode works well for most things, but then it falls off a cliff.

Here is the definitions section [1]

Even have a look at the definitions of 'Character' and 'Grapheme' and
'Grapheme Cluster' \- and you start to see how confusion sets in very quickly.

Consider that in Unicode ... there isn't really such a thing as a 'character'
\- it's just an unspecific word we use that has no technical application!
(When we say 'character' generally what we mean is 'Grapheme Cluster').

Language is itself a rabbit hole of complexity, so any standard trying to
manage it will be painful - but it feels as though the true corner cases of
Unicode are actually unbounded.

In short, too many pragmatic loose ends. Given any scenario where you think
you have an alg sorted out ... and probably there are holes in it if you cared
to try to find them for a specific language.

It's not 'bad', but it's not the uber solution, it's frayed at the edges.

[1] [https://unicode.org/glossary/](https://unicode.org/glossary/)

~~~
matvore

      > Consider that in Unicode ... there isn't really such a thing as a 'character' 
    

This is a really important consideration, since it helps you realize the
immense difficulty of wrapping your own logic for character-aware handling -
unless you are deliberately limiting your scope, like only handling NFC-
normalized text of a limited number of languages.

