
A Programmer’s Introduction to Unicode - ingve
http://reedbeta.com/blog/programmers-intro-to-unicode/
======
kps
Anyone familiar with Unicode development know why combining characters
_follow_ the base character? Prefix combining characters would have had two
nice properties: [1] Keyboard dead keys could simply generate the combining
character, rather than maintain internal state, and [2] it would be possible
to find the end of a combining sequence without lookahead.

~~~
Manishearth
A bunch of characters are prepended combining marks (so lookahead doesn't
quite work)

    
    
        0600..0605    ; Prepend # Cf   [6] ARABIC NUMBER SIGN..ARABIC NUMBER MARK ABOVE
        06DD          ; Prepend # Cf       ARABIC END OF AYAH
        070F          ; Prepend # Cf       SYRIAC ABBREVIATION MARK
        08E2          ; Prepend # Cf       ARABIC DISPUTED END OF AYAH
        0D4E          ; Prepend # Lo       MALAYALAM LETTER DOT REPH
        110BD         ; Prepend # Cf       KAITHI NUMBER SIGN
        111C2..111C3  ; Prepend # Lo   [2] SHARADA SIGN JIHVAMULIYA..SHARADA SIGN UPADHMANIYA
    

While Unicode-specced grapheme boundaries (UAX 29) do not deal with these,
Indic scripts have an "infix" combining mark that takes two regular code
points on either side to form a consonant cluster. (Also, Hangul is based on
conjoining code points where none of the code points is really a "combiner",
they're all equally "regular" characters)

I suspect the actual answer to your question is "historical accident". But,
the dead key argument doesn't quite hold -- I can equally ask why keyboard
dead keys aren't typed after the character.

~~~
kps

      > I can equally ask why keyboard dead keys aren't typed after the character.
    

That does have an answer: keyboards predate electronics. A mechanical
typewriter key press slams a type bar into the ribbon, and on the way back,
catches a mechanism that releases the carriage (which is pulled along by a
spring) for one column. A dead key is very simple; it just omits the catch. A
postfix dead key would be more complicated, since it would have to add an
early backspace action, and harder to press, since backspacing pulls against
the carriage spring.

~~~
Manishearth
This is pretty cool and interesting. Thanks!

------
scrollaway
Just recently had to deal with Unicode line-breaking (aka word wrapping) in JS
(Canvas).

I love that Unicode is very nicely and strictly defining it, but god damn is
it a complicated beast.

[http://www.unicode.org/reports/tr14/](http://www.unicode.org/reports/tr14/)

~~~
raphlinus
I think "nicely and strictly" is an overstatement here. You still have to deal
with southeast Asian scripts (which require a dictionary to find line break
opportunities), and then there's the whole complex regex for numeric
expressions (Example 7 in the Examples section), which ICU implements. I
didn't bother with that in xi-unicode (it's not clear it improves matters
much), but I do want to get Thai breaking nicely.

On top of that, the Unicode rules do a very poor job with things like URLs and
email addresses. The Android text stack has its own layer (WordBreaker in the
minikin lib) on top of that which recognizes those and implements its own
rules.

But TR14 is a good start, for sure.

------
jcranmer
A link to the Unicode Roadmap would be helpful to also understand some of how
Unicode is allocated.

The smallest allocatable block of Unicode is a 16 code point block of
characters. In the BMP, there are just 8 such blocks remaining--and 1 of them
is scheduled for use in Unicode 10 and 3 of them are tentatively reserved for
a current proposal. Note however that there are ~10k unassigned code points
within the BMP.

After the BMP, the next most-full block is the SIP (plane 2), which is
basically entirely reserved for "random rare characters in Chinese/Japanese"
(although only about 85% or so of it is considered assigned as of Unicode 10).
Plane 3, the TIP, is more or less reserved for SIP overflow and historical
Chinese scripts, although only about 25% of it has tentative reservations.

Around half of the SMP is already tentatively reserved for scripts, although
I'm not sure if the Script Encoding Initiative's list of scripts to encode
([http://linguistics.berkeley.edu/sei/scripts-not-
encoded.html](http://linguistics.berkeley.edu/sei/scripts-not-encoded.html))
are all on the Unicode roadmap pages. There's about 200 scripts left to
encode, although some of them may be consolidated in Unicode's script
terminology (for example, Unifon is proposed for Latin Extensions D).

I think the set of remaining historical scripts to encode is considered more
or less complete, although several scripts definitely need a lot more research
to actually encode (Mayan hieroglyphics is probably the hardest script left,
since it requires rather complex layout constraints).

------
keithgabryelski
Here is my take on this subject:

A Practical Guide to Character Sets and Encodings or: What’s all this about
ASCII, Unicode and UTF-8?

[https://medium.com/@keithgabryelski/a-practical-guide-to-
cha...](https://medium.com/@keithgabryelski/a-practical-guide-to-character-
sets-and-encodings-b5362447456f#.61il3o5k2)

~~~
nabla9
>Character Sets: a collection of characters associated with numeric values.
These pairings are called “code points”.

This is very ambiguous definition and it can be very confusing. I sure there
are many people who have read Joel Spolsky's Unicode intro and left confused.

Using ASCII as an example is confusing because ASCII character maps into
several different Unicode concepts:

1\. byte

2\. code point

3\. encoded character

4\. grapheme

5\. grapheme cluster

6\. abstract character

7\. user perceived character

Mapping from user perceived character to abstract characters is not
total,injective, or surjective. Some abstract characters need more than one
code point to express them. You can't split sequence of Unicode code points
arbitrarily in code point boundaries, you must use grapheme clusters instead.

~~~
keithgabryelski
it's a practical guide, not comprehensive -- for most people this is a great
start, especially if they are familiar with ASCII

you'll notice I didn't cover collation -- why? because explaining that would
dilute process of understand UTF-x and UNICODE

~~~
nabla9
It's pedagogically wrong and extremely misleading.

When people start with introduction like this, they end up thinking they have
learned more than they actually have.

I point this out because I was one of those mislead by several previous
articles explaining Unicode encoding the same wrong way as you did. When I ran
into trouble and asked help, everybody around me was misguided the same way. I
didn't know I had to dig into manuals because everybody explained that this is
how it is. Then I had to teach everyone else that we had all learned it wrong.

Many people deal only with ascii or other easy western alphabets and they can
work years with Unicode before they hit into trouble.

------
stirner
U+0041 LATIN CAPITAL LETTER A is "A", not "a" (in the Unicode Codespace
section towards the beginning).

~~~
anonymousiam
Yes, and for those who don't have the ASCII collating memorized, the lower
case A ("a") is U+0061.

------
faragon
UTF-8 encoding could use up to 6 bytes per character, addressing up to 2^32
codes (UTF-8, 1993 version [1]). So even being currently limited to 4 bytes,
it could be expanded to 6 anytime.

[1]
[https://en.m.wikipedia.org/wiki/UTF-8](https://en.m.wikipedia.org/wiki/UTF-8)

~~~
lasthemy
In 2003, RFC 3629 removed all 5 and 6 byte encodings, effectively limiting it
to 4-bytes. Of course it could be expanded at any time, but that would be a
significant change to established practice, and directly contradict the
rationale in RFC 3629 (that because most people use 4 bytes in practice,
allowing 5 and 6 constituted a security flaw).

Source: the same Wikipedia article you linked.

~~~
faragon
Sure, that's why I pointed it could be expanded anytime, because the encoding
already supports its expansion, by design :-)

~~~
jcranmer
The limiting factor on Unicode is UTF-16. There's only enough surrogates for
16 astral planes, which is why Unicode has 17 16-bit planes.

~~~
faragon
UTF-16 has reserved codes as well, so it could be expanded for covering 2^32
codes, too.

~~~
jcranmer
The range U+D800-DFFF is reserved for UTF-16 surrogates, specifically in two
pairs of low and high surrogates. That means every surrogate pair can encode
10 + 10 bits of information, which is where the 16 astral planes (4 bits of
16-bit planes) comes in. Otherwise, there are just 128 code points in
unallocated blocks in the BMP.

There is no space for expansion without reassigning private use areas or
changing the encoding mechanism of surrogates--which is currently completely
specified (each surrogate pair will produce a valid code point).

------
chickenbane
Reminds me a lot of Jesse Wilson's excellent talk, "Decoding the secrets of
binary data". It's a fun video to watch on a lazy Saturday =D

[https://www.youtube.com/watch?v=T_p22jMZSrk](https://www.youtube.com/watch?v=T_p22jMZSrk)

------
alblue
You might also like my presentation on the history of Unicode, explaining
where it (and other codes like it) came from.

[https://www.infoq.com/presentations/unicode-
history](https://www.infoq.com/presentations/unicode-history)

------
srean
Is there any reasonably popular encoding that is "hole free / complete" in the
sense any sequence of bytes has a valid decoding ? I use B85 or B64 when
needed but was wondering if there is a Unicode encoding that will do the job.

~~~
masklinn
> Is there any reasonably popular encoding that is "hole free / complete" in
> the sense any sequence of bytes has a valid decoding ?

That's a property of the ISO-8859 encodings, it turns out to be pretty
terrible as you have no way of distinguishing utter garbage/nonsense from
actual text.

> I use B85 or B64 when needed but was wondering if there is a Unicode
> encoding that will do the job.

Base64 and 85 are pretty much the opposite of Unicode encodings. Encodings
turn text (abstract unicode codepoints) into actual bytes you can store,
Base64 and Base85 turn bytes into ASCII or ASCII-compatible text.

~~~
srean
> Base64 and 85 are pretty much the opposite

I know and that's the purpose I use it for: to smuggle in binary data as text,
but no reason why an encoding scheme should not be dual purpose. Thanks for
the tip about ISO-8859

~~~
edblarney
In b64, you're going to want to pick a short set of characters that can be
encoded clearly and simply, i.e. ascii chars. And the shorter the better, i.e.
even in 6 bits.

You take the first six bits of your binary data and convert to some ascii char
mapping. And then the the next 6 bits and so on.

You can't do that with unicode, and it wouldn't make sense for any other
encoding standard.

They are completely different things.

~~~
Moru
Mabe srean wants a more efficient way of base64 a binary file into some text-
only media like email. This depends totally on where you want to put it. If
it's an email you are pretty stuck with base64, if it's a string that nothing
else touches, you can use the binary data directly (eg: iso-8859) :)

~~~
srean
That's indeed right. B85 is already a little more efficient than B64, but was
wondering if one could abuse Unicode for this.

Its a really silly, stupid situation I need this for, exchanging data with
Python and I can only use Unicode strings.

~~~
masklinn
> That's indeed right. B85 is already a little more efficient than B64, but
> was wondering if one could abuse Unicode for this.

Sure, kinda: [https://github.com/pfrazee/base-
emoji](https://github.com/pfrazee/base-emoji) (it's a base256 using emoji),
but then you still need to encode that text, which is going to require 4 bytes
per symbol, so I'm not sure you're going to get any actual gain over B64/B85
in the end. There's also the option of using a subset of the U+0100~U+07FF
range (though it contains a diacticial block which may not be ideal) as it
encodes to 2 bytes in both UTF-8 and UTF-16 (though there are diacritics in
these blocks, and some of the codepoints are reserved but not allocated so…).

------
majestic8
Can someone help me understand why prefixes used in UTF-8 jump from "0" to
"110", "1110", "11110" and so on? Why is "10" missing?

~~~
pkaye
"10" is used as a prefix for the bytes after the first. This gives it the
self-synchronization property if it somehow ends up in the middle of a
sequence. See the first table in this Wikipedia link:
[https://en.wikipedia.org/wiki/UTF-8](https://en.wikipedia.org/wiki/UTF-8)

~~~
jcranmer
Additionally, 110xxxxx tells you that the character is two bytes, 1110xxxx
three bytes, and 11110xxx four bytes, i.e., number of bytes in number =
leading 1 count.

------
GnarfGnarf
Excellent article! Best explanation of UTF-16 I've seen.

------
andrewl
There's also Joel Spolsky's _The Absolute Minimum Every Software Developer
Absolutely, Positively Must Know About Unicode and Character Sets (No
Excuses!)_ :

[https://www.joelonsoftware.com/2003/10/08/the-absolute-
minim...](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-
every-software-developer-absolutely-positively-must-know-about-unicode-and-
character-sets-no-excuses/)

~~~
thomk
Haha, I was going to post the same thing! If you like Joel Spolsky you might
like this video where he angrily abuses everyone about how bad they are at
Excel and shows you some very cool tricks:

[https://www.youtube.com/watch?v=0nbkaYsR94c&t=1206s](https://www.youtube.com/watch?v=0nbkaYsR94c&t=1206s)

~~~
slededit
For most of that I was getting so annoyed "JUST USE A TABLE!!!", but he got
there in the end.

For those that don't know Joel Spolsky used to be a PM on Excel back in the
early 90s.

