
The Absolute Minimum Every Software Developer Must Know About Unicode - llambda
http://www.joelonsoftware.com/articles/Unicode.html
======
pjscott
A few things I wish I could tell all newbies about staying out of trouble with
Unicode:

* Use UTF-8 for external text, whenever possible. If your collaborators have other ideas, bribe them with tasty cookies or something, because this right here solves a lot of hassles. There are some circumstances in which a different encoding might have its advantages, but it is tremendously reassuring to be able to say "Ah, text! I shall decode it as UTF-8!" and be right. This has the advantage of being compatible with ASCII input, and avoiding the perennial UTF16/UCS2 confusion.

* Make it explicit that you're using UTF-8. For example, if you're making a web page, be sure to set "Content-Type: text/html; charset=utf-8" in the HTTP headers, to make the browser's content encoding detection trivially correct.

* When dealing with strings in your favorite programming language, always know whether it's an array of Unicode code-points, or of bytes in UTF-8, or some third messed-up thing. Not all strings are the same kind of thing! Unless your programming language has this distinction enforced by its type system, of course.

* Be aware when you're crossing the boundary between Unicode code points and Unicode in some external encoding. Decoding can fail, so be prepared. It's best to reject invalid text as early as possible.

* When in doubt, use other people's code for Unicode handling. For most of the crazy crap you run into in the wild, there is well-tested crazy-crap-handling code.

~~~
Afton
> * When dealing with strings in your favorite programming language, always
> know whether it's an array of Unicode code-points, or of bytes in UTF-8, or
> some third messed-up thing.

What is an 'array of Unicode code-points'. I actually reread the article
thinking that my unicode chops were getting rusty if I didn't know what that
meant, but after rereading it, I still don't know what you mean.

[I wrote a half a dozen 'do you mean _____' suggestions, but I think I'll just
let you explain. :) ]

~~~
pjscott
Sorry for the unclear wording; I meant "an array of fixed-width values, each
of which stores a single Unicode code-point". For example, any program that
stores text in UCS-4: an array of 32-bit values, each holding a single code
point.

This is how way too many people think that UTF-16 works: each code point gets
16 bits, and you have an array of them, so you can count characters, do O(1)
random indexing, and so on. This is a harmful myth, of course. Code points do
not correspond neatly to glyphs, and UTF-16 is a variable-width encoding,
although most people don't use code points outside the basic multilingual
plane, so a lot of people can get away with pretending that it's fixed-width,
until they can't.

The most maddening instance of this confusion that I've seen so far is in
Python's Unicode string handling. Guess what happens when you run this Python
code to find the length of a string containing a single Unicode code-point:

    
    
        print len(u"\U0001d11e")
    

This will print either 1 or 2, depending on what flags the Python interpreter
was compiled with! If it was compiled one way (the default on Mac OS X), then
it uses an internal string representation that it sometimes treats as UTF-16
and sometimes as UCS-2. With another set of flags (default on Ubuntu, IIRC) it
will use UCS-4 and do the Right Thing. For the same task, Java gets the string
length right, but requires you to explicitly write the string as a UTF-16
surrogate pair: "\uD834\uDD1E".

The redeeming virtue that both share is that they will do the right thing if
you treat everything as variable-width encoded, use the provided methods for
encoding and decoding, and avoid the hairy parts left over from when people
naively assumed that UTF-16 and UCS-2 were the same thing and that they ought
to be enough for anybody.

~~~
Afton
Thanks, that's a great clarification.

------
jwr
Amusingly enough, Joel's company does not follow Joel's preaching.

I tried to convince Fog Creek to abandon the obsolete, non-standard and
proprietary 8-bit Windows-CP1252 character encoding in the E-mails that
FogBugz sends. They refused, reason given: "joelonsoftware.com is a blog. Fog
Creek Software is a business."

~~~
spolsky
Not sure what you're talking about; FogBugz sends UTF-8 email.

~~~
solutionyogi
I just looked at my FogBugz email from today morning, it is indeed using
CP1252 character set. See: <http://i.imgur.com/nV3V6.png>

I think this could be because you use messagelabs.com to send out the emails.

~~~
tedunangst
It sends CP1252 emails if all the characters used fit. If they don't, it uses
UTF8. That way, people with old mailers can still read them (more
realistically, people reading the raw email can read them out of the database
or from traffic dumps), because .NET has a tendency to base64 encode the body
the instant you set charset to UTF8, and I don't unbase64 in my head.

Try including the funny character of your choice in an email. It should show
up just fine.

~~~
solutionyogi
You are right. I created another test case with Chinese character and the
email is now UTF-8. See: <http://i.imgur.com/LeaX8.png>

I think this is a very good compromise.

------
toyg
I remember when this was first posted. RSS was new, Movable Type 2.x was hot
shit, the future of the web was XML and real hackers were parsing it with
regular expressions and duct tape. Everything was supposed to be Unicode,
except the webdev community was 99.999% NorthAmerican and everybody knew
plaintext "is supposed to be ASCII" anyway.

Fast forward 8 years... I've joined Stack Overflow about a week ago and my
first accepted answer was about Unicode and string handling in Python 2.x.
Just yesterday I was thinking I should reread this exact post, to refresh a
few points and keep shooting fish in that barrel. I guess I should be grateful
Python 3 has gone full-Unicode... except that now people ask how to emit ASCII
with it. And the Python community is among the most clued-up on the subject
(probably as a reaction to how bad it was handled in 2.x).

And I'm not even a fucking software developer.

------
cydonian_monk
"EBCDIC is not relevant to your life. We don't have to go that far back in
time."

I wish this was the case. And for 99% of you it might be. But there's still
lots of EBCDIC (and antique COBOL) out in the wild that increasingly has to
interact with the real world. One of the chunks of telephony software I'm
responsible for has to interface with an EBCDIC system, requiring some basic
translation to UTF-16. (And then eventually has to pass some of this now UTF
data back to the EBCDIC system.) Not exactly difficult until you get to the
various numeric encodings and no one can decide if they want Binary-Coded
Decimal, Pic9s, pure int, signs, which endianess (if they even know about
endianess), et cetera.

EBCDIC wasn't dead in 2003 when Joel wrote this and it certainly (and
unfortunately) isn't dead in 2012.

~~~
leftnode
Yup, if you have to work with old iSeries servers, or just about anything
dealing with old retailers in eCommerce, you've probably touched it one time
or another.

Fortunately, the recode program makes it easy to switch between different
encodings.

------
mattdeboard
I don't mind the repost. Every time I read this I've learned more, gotten more
context and encountered more problems with character encoding, so every time I
get more and more out of the essay.

I wouldn't even think about reading it unless I saw it posted on HN. However
as long as I've got to deal with character encoding problems like this:

<http://ibm-china.jobs/branch-admin-lan-zhou/jobs-in/>

I'll still reap some benefit out of re-reading this essay.

------
buff-a
FTA: "The traditional store-it-in-two-byte methods are called UCS-2 (because
it has two bytes) or UTF-16 (because it has 16 bits), and you still have to
figure out if it's high-endian UCS-2 or low-endian UCS-2."

Written in 2003, 7 years after UCS-2 was obsoleted by UTF-16 because Unicode
2.0 was too big for just 16 bits. UCS-2 is not UTF-16. Windows NT wasn't
UTF-16 either, IIRC.

Of course, Microsoft kept telling everyone that 16bits-per-character is
Unicode.

[1] <https://en.wikipedia.org/wiki/UTF-16>

------
dlitz
Here's my follow-up: The Absolute Minimum Every Software Developer Must Know
About Unicode _encodings_ :

Just use UTF-8.

~~~
DrJokepu
It's really not that simple.

If you do a lot of string manipulations, you're better off with either UTF-32
or dumbified (16-bit fixed) UTF-16, otherwise you will have to count
characters from the beginning of the string every time you need to access the
nth character within the string. Moreover, if you deal with a text with a lot
of characters between 0x0800 and 0xFFFF (e.g. East-Asian languages) you're
much better off with UTF-16 as you will save a whole byte per character.

~~~
pjscott
How often do you need random access by code point? In every case I can recall,
if random access by UTF-8 offset wasn't the right thing (which it usually
was), then random access by code point wouldn't be either. Almost all string
offsets you'll ever have to deal with come from having a program look at a
string, and in that case, you can just use UTF-8 byte offsets. What sort of
text processing are you doing where this isn't the case?

As for East Asian text, you have a point: it _will_ usually be shorter in
UTF-16 than UTF-8. Before making this decision, though, ask yourself how much
that extra space is worth to you. Is it worth dealing with possible encoding
hassles? (The answer to this may be yes, but it's a question that should be
asked.) Also, on a lot of data, there are many characters from the ASCII range
mixed in with the East Asian text. I did an experiment a while back where I
downloaded some random web pages in Chinese, Japanese, Korean, and Farsi, and
compared their size in UTF-8 and UTF-16. Because of the amount of those
documents that was HTML tags, all four pages ended up smaller in UTF-8.

~~~
DrJokepu
Maybe I'm missing something but I'm not sure if I understand you question. How
would you write even a simple parser without being able to access the contents
of the string randomly by the character index?

~~~
pjscott
You _can_ access the contents randomly. Just use the index in bytes, rather
than characters. Let's look at a really simple parsing task as an example:
splitting tab-delimited strings, in UTF-8. First, you find the indexes in the
string (in bytes) of the commas, then you use those to split out substrings.
This is exactly the same code you would use with plain ASCII text, and in fact
a lot of programs designed to process ASCII work unmodified with UTF-8.

Another example: for a lex-and-yacc type of parser, you can use regular
expressions to split a string into tokens, and then use a parser on that
stream-of-tokens representation. None of this requires character indexing;
just byte indexing.

------
tszming
Seems the article need some updates?

>> In UTF-8, every code point from 0-127 is stored in a single byte. Only code
points 128 and above are stored using 2, 3, in fact, up to _6_ bytes.

In Wikipedia:

>> UTF-8 encodes each of the 1,112,064[7] code points in the Unicode character
set using one to _four_ 8-bit bytes

Reference: <http://en.wikipedia.org/wiki/UTF-8>

~~~
nodemaker
I dont get this actually.

Say a UTF8 string is ae 31 c1 12.

Now how do we decide whether it has the characters "31","c1","ae","12" or the
characters are "ae 31" and "c1 12" or even "ae","31 c1" and "12".??

EDIT: Never mind!..found my answer here
[http://stackoverflow.com/questions/1543613/how-does-
utf-8-va...](http://stackoverflow.com/questions/1543613/how-does-
utf-8-variable-width-encoding-work)

~~~
pjscott
The tldr is that UTF-8 is a prefix code: no valid character is a prefix of any
other.

<http://en.wikipedia.org/wiki/Prefix_code>

------
skrebbel
I'm not sure why HN's duplicate checker doesn't find this, but:

<http://news.ycombinator.com/item?id=1219065>

<http://news.ycombinator.com/item?id=1618201>

<http://news.ycombinator.com/item?id=1987621>

~~~
llambda
Sometimes it misses, although in some cases I think it might be related to
age: after a certain time it's fun to revisit and given that the base changes
no doubt some current readers weren't around when it was last posted.

------
chokma
Also worth a read <http://stackoverflow.com/a/6163129> (Perl and Unicode and
why it's really hard)

~~~
Joeri
Ouch. And here I thought PHP made it needlessly complicated to use UTF-8:
[http://malevolent.com/weblog/archive/2007/03/12/unicode-
utf8...](http://malevolent.com/weblog/archive/2007/03/12/unicode-utf8-php-
mysql/)

~~~
Tobu
Perl actually has one of the best Unicode supports (second only to ICU). This
guy worked on it, so he's listing all the edge cases (bulk of the post),
explaining how to get rid of backward-compatible brokenness (and maybe Perl
needs a quicker way to saner defaults), and explaining how to do work that
requires explicit handling.

------
mafro
I've wanted to get one of these for years:
<http://www.cafepress.com/nucleartacos.26721820>

(I do not work for cafepress)

------
lini
The only thing I remembered on my first lesson on character codes was: ASCII a
stupid question get a stupid ANSI!

------
rdtsc
I suspect the reason so many people don't know about is because they don't
care enough. (I don't care enough). This is not an interesting enough topic
for most devs to bother with unless something has broken or isn't working
write and Unicode "problems" are suspected to be the reason.

~~~
jwr
…and this is exactly why this essay should be regularly reposted. If you write
any software that is used by people other than yourself, you should care,
because they will most likely find it broken.

Also, if you write software for money, ignoring Unicode is just plain
incompetent. I can't count the number of times I've had packages shipped to
"Bia&#322;y Kamie&#324;" or "Bia&amp;#322;y Kamie&amp;#324;" street instead of
"Biały Kamień". If you expose even a single name or address field in your
software, you need to handle Unicode.

And yes, you need to handle Unicode even if you want to limit yourself only to
the US market. Your customer might have an umlaut or an accent in his name.

~~~
rdtsc
> If you write any software that is used by people other than yourself, you
> should care,

I agree and I should know better. I even opened the link. But after reading
for 10 seconds, I closed it. I just wasn't motivated enough. Even had to fix
bugs and issues related to this just last month. But every time I do, I just
go and find out enough to solve the problem and then never really dig deeper.
Don't really know why it is this way.

~~~
outworlder
How did you manage to learn anything in your profession then? Really, that is
a short and well-written post, which shouldn't take more than minutes from
your schedule.

~~~
rdtsc
> How did you manage to learn anything in your profession then?

I cared about it. I will forget to eat and sleep if I am studying something I
am motivated about. The rest I just learn as needed only if am forced to (read
"when stuff breaks"). Not a very good approach I guess.

------
alexchamberlain
Off topic: Has anyone got a link to case insensitive string comparison that
also ignores accent differences?

For instance, when I search on Facebook for names, I never type accented e-s,
but it stills brings up my friends who have accented e-s in their name.

~~~
wazoox
Depends upon the language I guess. Here's an example in Perl:
<http://stackoverflow.com/a/5163247/93865>

------
simon
While you're enjoying this classic article (and I _do_ re-read it every time
it's posted), you may wish to also take in Tim Bray's - in my opinion equally
classic - article about Unicode.

<http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode>

Tim has spent more time working with bodies of text [1][2] than many of us, so
his perspective is very useful and the article is thoughtful.

[1] Oxford English Dictionary [2] XML committee

------
runn1ng
I will just add - Unicode isn't hard - until you have to display or otherwise
deal with Right-To-Left languages and their invisible control characters.

I find out that even working as a user with text that is right-to-left in
places and in other places left-to-right is _hard_. It's quite easy to deal
with it programatically, but only before you have to _display_ it somehow.

Ok, sorry my ramblings.

------
bitstorm
There seems little point when the 2nd most popular browser (Chrome) won't
correctly display Unicode.

[http://en.wikipedia.org/wiki/Mathematical_operators_and_symb...](http://en.wikipedia.org/wiki/Mathematical_operators_and_symbols_in_Unicode)

square square square

Contrast with Firefox

------
psykotic
I can't say it qualifies as the absolute minimum everyone should know, but I
highly recommend reading the book Unicode Demystified if you're the kind of
programmer who wants to understand how everything works.

------
Rajiv_N
When I was beginning to learn HTML, this article helped me get a grasp over
encodings etc... Thanks @spolsky. I started following your blog after reading
this :)

------
dougaitken
I thought decorum dictated that an aged post should have the year in brackets
or something similar! A lot has changed in the last 8 & half years.

~~~
justincormack
He says

"When I discovered that the popular web development tool PHP has almost
complete ignorance of character encoding issues, blithely using 8 bits for
characters, making it darn near impossible to develop good international web
applications, I thought, enough is enough."

So that hasn't changed.

~~~
notJim
This is FUD. It's really not difficult to use the mb_* family of functions to
deal with unicode in PHP. If you're writing a new app, this is trivial. If
you're working with an existing app, it's obviously more difficult, but far,
far from impossible.

~~~
wazoox
Of course you _can_ manipulate utf-8 in PHP, else it would have died long ago.
But as a matter of fact PHP 6 was a failure, and unicode is still an
afterthought that you must hack around with special functions in PHP5.

~~~
notJim
Well, there's almost no string handling built-in to PHP as a language at all,
it's just provided as part of the standard library. So instead of using one
part of the library (the standard string functions), you use another (the mb_*
string functions.) I don't really see how that's a hack. The one thing that
could go wrong is if you don't have the mb_* extension, but that's easily
rectified, and hasn't been something I've seen in the wild in the past few
years.

I'm not saying PHP's UTF-8 handling is great by any means, but the claim was
that it's "nearly impossible." I'm suggesting that one should instead say
"Building a UTF-8 compliant site in PHP is annoying, and requires more work
than one would prefer, but if you do a bit of research, it's not that hard."

~~~
wazoox
> _I don't really see how that's a hack._

Because if you switch encoding you need to modify the code. All languages
actually supporting utf-8 use the very same functions whatever the encoding
is, eventually you simply need to declare that you're using utf-8 but that's
all.

------
xxiao
too long, he must be enjoying writing for the purpose of writing itself.

------
BadassFractal
I love how Joel manages to come off as a total snarky douchebag in his
writing.

