A Programmer’s Introduction to Unicode

kps · on March 4, 2017

Anyone familiar with Unicode development know why combining characters follow the base character? Prefix combining characters would have had two nice properties: [1] Keyboard dead keys could simply generate the combining character, rather than maintain internal state, and [2] it would be possible to find the end of a combining sequence without lookahead.

Manishearth · on March 4, 2017

A bunch of characters are prepended combining marks (so lookahead doesn't quite work)

    0600..0605    ; Prepend # Cf   [6] ARABIC NUMBER SIGN..ARABIC NUMBER MARK ABOVE
    06DD          ; Prepend # Cf       ARABIC END OF AYAH
    070F          ; Prepend # Cf       SYRIAC ABBREVIATION MARK
    08E2          ; Prepend # Cf       ARABIC DISPUTED END OF AYAH
    0D4E          ; Prepend # Lo       MALAYALAM LETTER DOT REPH
    110BD         ; Prepend # Cf       KAITHI NUMBER SIGN
    111C2..111C3  ; Prepend # Lo   [2] SHARADA SIGN JIHVAMULIYA..SHARADA SIGN UPADHMANIYA

While Unicode-specced grapheme boundaries (UAX 29) do not deal with these, Indic scripts have an "infix" combining mark that takes two regular code points on either side to form a consonant cluster. (Also, Hangul is based on conjoining code points where none of the code points is really a "combiner", they're all equally "regular" characters)

I suspect the actual answer to your question is "historical accident". But, the dead key argument doesn't quite hold -- I can equally ask why keyboard dead keys aren't typed after the character.

kps · on March 4, 2017

  > I can equally ask why keyboard dead keys aren't typed after the character.

That does have an answer: keyboards predate electronics. A mechanical typewriter key press slams a type bar into the ribbon, and on the way back, catches a mechanism that releases the carriage (which is pulled along by a spring) for one column. A dead key is very simple; it just omits the catch. A postfix dead key would be more complicated, since it would have to add an early backspace action, and harder to press, since backspacing pulls against the carriage spring.

Manishearth · on March 7, 2017

This is pretty cool and interesting. Thanks!

derefr · on March 4, 2017

Unicode is in large part about being resilient in the face of partial corruption or https://en.wikipedia.org/wiki/Bit_slip.

So the answer probably lies in what the Unicode codepoint lexing algorithm must do to disregard "dependent" combining characters, if it hits a corrupted base character.

scrollaway · on March 4, 2017

Just recently had to deal with Unicode line-breaking (aka word wrapping) in JS (Canvas).

I love that Unicode is very nicely and strictly defining it, but god damn is it a complicated beast.

http://www.unicode.org/reports/tr14/

raphlinus · on March 4, 2017

I think "nicely and strictly" is an overstatement here. You still have to deal with southeast Asian scripts (which require a dictionary to find line break opportunities), and then there's the whole complex regex for numeric expressions (Example 7 in the Examples section), which ICU implements. I didn't bother with that in xi-unicode (it's not clear it improves matters much), but I do want to get Thai breaking nicely.

On top of that, the Unicode rules do a very poor job with things like URLs and email addresses. The Android text stack has its own layer (WordBreaker in the minikin lib) on top of that which recognizes those and implements its own rules.

But TR14 is a good start, for sure.

jcranmer · on March 4, 2017

A link to the Unicode Roadmap would be helpful to also understand some of how Unicode is allocated.

The smallest allocatable block of Unicode is a 16 code point block of characters. In the BMP, there are just 8 such blocks remaining--and 1 of them is scheduled for use in Unicode 10 and 3 of them are tentatively reserved for a current proposal. Note however that there are ~10k unassigned code points within the BMP.

After the BMP, the next most-full block is the SIP (plane 2), which is basically entirely reserved for "random rare characters in Chinese/Japanese" (although only about 85% or so of it is considered assigned as of Unicode 10). Plane 3, the TIP, is more or less reserved for SIP overflow and historical Chinese scripts, although only about 25% of it has tentative reservations.

Around half of the SMP is already tentatively reserved for scripts, although I'm not sure if the Script Encoding Initiative's list of scripts to encode (http://linguistics.berkeley.edu/sei/scripts-not-encoded.html) are all on the Unicode roadmap pages. There's about 200 scripts left to encode, although some of them may be consolidated in Unicode's script terminology (for example, Unifon is proposed for Latin Extensions D).

I think the set of remaining historical scripts to encode is considered more or less complete, although several scripts definitely need a lot more research to actually encode (Mayan hieroglyphics is probably the hardest script left, since it requires rather complex layout constraints).

keithgabryelski · on March 4, 2017

Here is my take on this subject:

A Practical Guide to Character Sets and Encodings or: What’s all this about ASCII, Unicode and UTF-8?

https://medium.com/@keithgabryelski/a-practical-guide-to-cha...

nabla9 · on March 4, 2017

>Character Sets: a collection of characters associated with numeric values. These pairings are called “code points”.

This is very ambiguous definition and it can be very confusing. I sure there are many people who have read Joel Spolsky's Unicode intro and left confused.

Using ASCII as an example is confusing because ASCII character maps into several different Unicode concepts:

1. byte

2. code point

3. encoded character

4. grapheme

5. grapheme cluster

6. abstract character

7. user perceived character

Mapping from user perceived character to abstract characters is not total,injective, or surjective. Some abstract characters need more than one code point to express them. You can't split sequence of Unicode code points arbitrarily in code point boundaries, you must use grapheme clusters instead.

pwdisswordfish · on March 4, 2017

> 1. byte

Actually 'code unit', which may have any number of bits, depending on the encoding. Otherwise spot on.

keithgabryelski · on March 4, 2017

it's a practical guide, not comprehensive -- for most people this is a great start, especially if they are familiar with ASCII

you'll notice I didn't cover collation -- why? because explaining that would dilute process of understand UTF-x and UNICODE

nabla9 · on March 4, 2017

It's pedagogically wrong and extremely misleading.

When people start with introduction like this, they end up thinking they have learned more than they actually have.

I point this out because I was one of those mislead by several previous articles explaining Unicode encoding the same wrong way as you did. When I ran into trouble and asked help, everybody around me was misguided the same way. I didn't know I had to dig into manuals because everybody explained that this is how it is. Then I had to teach everyone else that we had all learned it wrong.

Many people deal only with ascii or other easy western alphabets and they can work years with Unicode before they hit into trouble.

Manishearth · on March 4, 2017

Please don't call code points "characters". This is wrong and/or confusing.

http://manishearth.github.io/blog/2017/01/14/stop-ascribing-...

derefr · on March 4, 2017

The problem is mostly caused with how programming languages present text strings. There's usually a String class, with methods that say they manipulate characters. Usually, though, they either manipulate bytes, or manipulate code-points, and it's often not clear which.

Which is usually a symptom of a deeper problem: using the same ADT to represent "known-valid decoded text" and "a slice of bytes that would maybe decode to text in some unspecified encoding", such that the methods manipulating that ADT are completely incoherent.

Honestly, I'm surprised we don't see more programming languages like Objective-C, that have very clear distinctions between their "Data" type and "String" type, where encoded text is an NSData (a buffer of bytes) while decoded, valid text is an NSString, and all the methods on NSStrings operate on the grapheme clusters that decoded text is composed of, rather than on the bytes or codepoints or "characters" that are only relevant to encoded text.

Manishearth · on March 5, 2017

Swift is especially good at this because almost all string ops are high level. Splitting is on EGCs (inherited from objc no doubt), and equality is normalized equality. You need to explicitly ask for other operations.

wwalexander · on March 4, 2017

U+0041 LATIN CAPITAL LETTER A is "A", not "a" (in the Unicode Codespace section towards the beginning).

anonymousiam · on March 4, 2017

Yes, and for those who don't have the ASCII collating memorized, the lower case A ("a") is U+0061.

faragon · on March 4, 2017

UTF-8 encoding could use up to 6 bytes per character, addressing up to 2^32 codes (UTF-8, 1993 version [1]). So even being currently limited to 4 bytes, it could be expanded to 6 anytime.

[1] https://en.m.wikipedia.org/wiki/UTF-8

lasthemy · on March 4, 2017

In 2003, RFC 3629 removed all 5 and 6 byte encodings, effectively limiting it to 4-bytes. Of course it could be expanded at any time, but that would be a significant change to established practice, and directly contradict the rationale in RFC 3629 (that because most people use 4 bytes in practice, allowing 5 and 6 constituted a security flaw).

Source: the same Wikipedia article you linked.

faragon · on March 4, 2017

Sure, that's why I pointed it could be expanded anytime, because the encoding already supports its expansion, by design :-)

jcranmer · on March 4, 2017

The limiting factor on Unicode is UTF-16. There's only enough surrogates for 16 astral planes, which is why Unicode has 17 16-bit planes.

faragon · on March 4, 2017

UTF-16 has reserved codes as well, so it could be expanded for covering 2^32 codes, too.

jcranmer · on March 4, 2017

The range U+D800-DFFF is reserved for UTF-16 surrogates, specifically in two pairs of low and high surrogates. That means every surrogate pair can encode 10 + 10 bits of information, which is where the 16 astral planes (4 bits of 16-bit planes) comes in. Otherwise, there are just 128 code points in unallocated blocks in the BMP.

There is no space for expansion without reassigning private use areas or changing the encoding mechanism of surrogates--which is currently completely specified (each surrogate pair will produce a valid code point).

chickenbane · on March 5, 2017

Reminds me a lot of Jesse Wilson's excellent talk, "Decoding the secrets of binary data". It's a fun video to watch on a lazy Saturday =D

https://www.youtube.com/watch?v=T_p22jMZSrk

alblue · on March 4, 2017

You might also like my presentation on the history of Unicode, explaining where it (and other codes like it) came from.

https://www.infoq.com/presentations/unicode-history

srean · on March 4, 2017

Is there any reasonably popular encoding that is "hole free / complete" in the sense any sequence of bytes has a valid decoding ? I use B85 or B64 when needed but was wondering if there is a Unicode encoding that will do the job.

masklinn · on March 4, 2017

> Is there any reasonably popular encoding that is "hole free / complete" in the sense any sequence of bytes has a valid decoding ?

That's a property of the ISO-8859 encodings, it turns out to be pretty terrible as you have no way of distinguishing utter garbage/nonsense from actual text.

> I use B85 or B64 when needed but was wondering if there is a Unicode encoding that will do the job.

Base64 and 85 are pretty much the opposite of Unicode encodings. Encodings turn text (abstract unicode codepoints) into actual bytes you can store, Base64 and Base85 turn bytes into ASCII or ASCII-compatible text.

srean · on March 4, 2017

> Base64 and 85 are pretty much the opposite

I know and that's the purpose I use it for: to smuggle in binary data as text, but no reason why an encoding scheme should not be dual purpose. Thanks for the tip about ISO-8859

edblarney · on March 4, 2017

In b64, you're going to want to pick a short set of characters that can be encoded clearly and simply, i.e. ascii chars. And the shorter the better, i.e. even in 6 bits.

You take the first six bits of your binary data and convert to some ascii char mapping. And then the the next 6 bits and so on.

You can't do that with unicode, and it wouldn't make sense for any other encoding standard.

They are completely different things.

Moru · on March 5, 2017

Mabe srean wants a more efficient way of base64 a binary file into some text-only media like email. This depends totally on where you want to put it. If it's an email you are pretty stuck with base64, if it's a string that nothing else touches, you can use the binary data directly (eg: iso-8859) :)

srean · on March 6, 2017

That's indeed right. B85 is already a little more efficient than B64, but was wondering if one could abuse Unicode for this.

Its a really silly, stupid situation I need this for, exchanging data with Python and I can only use Unicode strings.

masklinn · on March 9, 2017

> That's indeed right. B85 is already a little more efficient than B64, but was wondering if one could abuse Unicode for this.

Sure, kinda: https://github.com/pfrazee/base-emoji (it's a base256 using emoji), but then you still need to encode that text, which is going to require 4 bytes per symbol, so I'm not sure you're going to get any actual gain over B64/B85 in the end. There's also the option of using a subset of the U+0100~U+07FF range (though it contains a diacticial block which may not be ideal) as it encodes to 2 bytes in both UTF-8 and UTF-16 (though there are diacritics in these blocks, and some of the codepoints are reserved but not allocated so…).

doubleunplussed · on March 5, 2017

Python hit this issue in the last few years after the transition to unicode-everything. Since the filesystem is supposed to have an encoding, but doesn't enforce it, Python needed to choose what to do with a filename that contained bytes that were invalid for the encoding. In order to make a round trip of decoding and encoding result in the same bytes, they decided to use certain surrogates to represent the invalid bytes:

https://www.python.org/dev/peps/pep-0383/

So with most systems being UTF8, this means that the encoding is "UTF8, except with bytes that would be invalid UTF8 mapping to this range of surrogates". So you've got an encoding with no holes, which is compatible with all valid UTF8 (since those surrogates are invalid UTF8).

It's not common, so it doesn't match that criterion of yours,but I think it's a great solution and hope it catches on. We need a solution for "text, except with graceful fallback when someone decided to put arbitrary bytes in there"

srean · on March 6, 2017

Aha ! thanks for the anecdote. My situation is indeed very similar, and yeah Python is involved, but that's not where I would put the blame.

doubleunplussed · on March 10, 2017

Oh, I'm not saying Python is to blame in any sense, rather, that the core devs had the same problem as you and hit on a neat solution.

srean · on March 13, 2017

Oh ! sorry never thought you were saying Python was the source of the problem.

Since Python happened to be a common feature common between our two scenarios, wanted to caution other HN'ers not to put blame on Python. In my case Python was indeed doing the right thing.

jcranmer · on March 4, 2017

Every 1-byte charset that uses high bits (i.e., not UTF-7, ASCII, or EBCDIC) has this property. Depending on how you feel about whether unpaired surrogates constitute a valid decoding, UTF-16 has that property as well.

It's not a very good property to have. UTF-8's invalid decodings means that it's very easy to detect if you're not using UTF-8: iconv's charset detector will rule certainty of UTF-8 if it sees just 3 valid UTF-8 multibyte sequences and no invalid sequences, and that's pretty much the only reliable detector.

Manishearth · on March 4, 2017

Javascript. It uses bad UTF-16, where unpaired surrogates are allowed. They're meaningless, but allowed.

majestic8 · on March 4, 2017

Can someone help me understand why prefixes used in UTF-8 jump from "0" to "110", "1110", "11110" and so on? Why is "10" missing?

Someone · on March 4, 2017

https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt:

"Below are the guidelines that were used in defining the UCS transformation format: [...] 6) It should be possible to find the start of a character efficiently starting from an arbitrary location in a byte stream."

If they used "10" as a marker for "this is the start of a two-byte sequence", it could not have been used for "this is a byte in a multi-byte sequence, but not the first one"

pkaye · on March 4, 2017

"10" is used as a prefix for the bytes after the first. This gives it the self-synchronization property if it somehow ends up in the middle of a sequence. See the first table in this Wikipedia link: https://en.wikipedia.org/wiki/UTF-8

jcranmer · on March 4, 2017

Additionally, 110xxxxx tells you that the character is two bytes, 1110xxxx three bytes, and 11110xxx four bytes, i.e., number of bytes in number = leading 1 count.

nemoniac · on March 4, 2017

It's called a prefix code. It's a fundamental idea in coding theory.

https://en.wikipedia.org/wiki/Prefix_code

https://en.wikipedia.org/wiki/Coding_theory

nicwolff · on March 4, 2017

The goal is that by reading any byte you can tell if you are at the start of a character sequence, so we have to start each byte with some prefix – otherwise continuation bytes might sometimes look like start bytes. If we did as you suggest, we'd have to prefix continuation bytes with "111110", leaving only two bits of data in each!

GnarfGnarf · on March 4, 2017

Excellent article! Best explanation of UTF-16 I've seen.

andrewl · on March 4, 2017

There's also Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!):

https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

rspeer · on March 4, 2017

Can an article be honorably retired? Joel's article was important 14 years ago, but that was a time when it was not clear that UTF-8 was going to become the default encoding of the Internet, and a time before emoji. In 2003, codepoints outside the Basic Multilingual Plane could be safely ignored unless you were a historical linguist; now, people use them every day.

This article is much better at telling you relevant things (with nice data-driven visualizations!) about how Unicode is used now.

ryandrake · on March 4, 2017

The content might be a little long in the tooth, but the exasperation about how programmers don't seem to give a shit about character encoding is, sadly, still relevant today. Unicode has been around since, what, the 90's? And UTF-8 has pretty much won the encoding war since 2008 or so. Yet, I still encounter software that fails to handle anything but ASCII, and it's 2017! Pick a random open source library, or look at a random company's software, and you're likely to find software that doesn't handle character encodings well. I think web software tends to be much better than desktop/mobile/embedded for some reason. Maybe an artifact of web guys tending to work with more recent/modern libraries and frameworks.

Who's seen this before: You're getting up to speed on your new company's legacy code base. Q: "Hey, senior developer, This API takes a string as an argument and we need to display it, fine. What encoding are we using?" A: ¯\_(ツ)_/¯

stevekemp · on March 4, 2017

I don't disagree, but when you're an old-school developer it's hard to get this stuff right!

I wrote a console-based mail client, and I struggled with UTF for months before it was all done, and even then there were niggles caused by libraries - for example passing a C++ string to a Lua script, and then using the `string:len()` lua function would return the number of bytes, rather than the length of the rendered string. Something I had to work around:

https://github.com/lumail/lumail2/issues/277

chrisseaton · on March 4, 2017

> rather than the length of the rendered string

Do unicode strings really have a rendered length?

https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/#comb...

> So the “length” of ... characters depends ... on the display font

rspeer · on March 4, 2017

Well, when you're writing a console program, you can assume your font is monospaced and you can assume you have a function like wcwidth() that will tell you how many monospaced cells a string will take up.

I know that eev.ee points out that wcwidth() is inconsistent (emoji screwed it up, then Unicode 9 gave a consistent width to emoji, but most libraries aren't on Unicode 9 so the situation is currently more inconsistent than before). But the situation is a lot better than just saying "I don't know, blame fonts".

ryandrake · on March 5, 2017

And don't forget, it's not sum(glyph sizes) either. You ought to be taking into account kerning, too.

vorg · on March 5, 2017

> Can an article be honorably retired?

Putting a link to that Spolski article in the comments on HN, reddit, etc whenever some other article on Unicode is posted has become an in-joke. You'd find it impossible to "retire" it, even honorably. Though andrewl didn't follow the apparent protocol of referring to that article with some adjective like "excellent" or "brilliant".

nabla9 · on March 4, 2017

Joel's article confuses characters and code points and can be very misleading.

>In Unicode, a letter maps to something called a code point which is still just a theoretical concept.

This is incorrect. Letter or character (user perceived character) can be multiple code points. Code points don't have intrinsic meaning across languages. If you want to count number of letters in a text or edit text, you must work at grapheme cluster level. Absolute minimum about Unicode must at least mention grapheme clusters and user perceived characters.

Manishearth · on March 4, 2017

I wrote about this a bit in http://manishearth.github.io/blog/2017/01/14/stop-ascribing-...

blipmusic · on March 4, 2017

Yes! Vietnamese and International Phonetic Alphabet says hello (among others).

thomk · on March 4, 2017

Haha, I was going to post the same thing! If you like Joel Spolsky you might like this video where he angrily abuses everyone about how bad they are at Excel and shows you some very cool tricks:

https://www.youtube.com/watch?v=0nbkaYsR94c&t=1206s

slededit · on March 5, 2017

For most of that I was getting so annoyed "JUST USE A TABLE!!!", but he got there in the end.

For those that don't know Joel Spolsky used to be a PM on Excel back in the early 90s.

drewmate · on March 4, 2017

I considered myself reasonably good at Excel prior to watching this. Thanks for posting this; I learned a lot!