
What Every Software Developer Must Know About Unicode (2003) - jervisfm
http://www.joelonsoftware.com/articles/Unicode.html#
======
gilgoomesh
This article deals mostly with Windows and is from 2003 so it fails to
emphasise the current standard practice as much as it should:

    
    
        Use UTF-8 everywhere you can.
    

UTF-8 is:

* the most backwards compatible (can be passed through many tools intended for ASCII-only with a few limitations – including avoiding composed latin glyphs)

* most likely to give an appropriate result if the end-user incorrectly interprets it

* the most space efficient encoding (on average)

* avoids endianness problems

* de-facto encoding for most Mac and Linux C APIs

* verifiable with a high degree of accuracy (unlike many other encodings which can't be verified at all)

Specifically:

* If you have to pick an encoding, always try to use UTF-8 unless you're only storing text to pass into an API which requires something different.

* The Winapi (aka Win32) is the only commonly used API that regularly requires something other than UTF-8 (the Windows Unicode APIs use UTF-16 – not UCS-2 as indicated in the Spolsky article). Windows' UTF-16 requirement a pain for platform independence -- be careful. However, you should still aim to use UTF-8 for all text files on Windows and only use UTF-16 for the Windows API calls (never use the locale specific non-Unicode encodings).

* There are a few language+environment combinations that literally _can 't_ open Unicode filenames. These include MinGW C++ which has no platform independent way of opening file streams with unicode filenames. You need to fall back to C _wfopen and UTF-16 to open files correctly.

Note: you don't always have to choose the encoding. e.g. the Mac class
NSString or the C# String class use UTF-16 internally, you don't normally need
to care what they do internally since any time you access the internal
characters, you specify the desired encoding. You should usually extract
characters in UTF-8.

~~~
ddebernardy
Agreed with what you wrote except "the most backwards compatible (can be
passed through many tools intended for ASCII-only)"... That amounts to
knowingly sweep bugs under the rug.

If a tool is intended for ascii-only, don't pass it utf-8 strings. Else,
you'll probably expose yourself to malformed utf-8 strings and potential
problems (e.g. php's mysql_escape vs mysql_real_escape).

~~~
lambda
Technically, it's not that it can be passed through tools intended for ASCII
only. It can be passed through tools which assume an ASCII compatible
character set (character set for which ASCII is a subset), which are 8 bit
clean, and which don't make incorrect assumptions about being able to truncate
strings at arbitrary points and be left with two valid strings.

Which is actually generally true of any tools which had been internationalized
with legacy, pre-Unicode character sets like the ISO 8859 series.

------
pygy_
A good summary, but for one imortant detail: In UTF-16, some code points
(laying on the so-called "astral planes", ie not on the "basic multilingual
plane") take 32 bits.

The Emoji, for example, lie on the first higher plane: 🍒🎄🐰🚴. Firefox and
Safari display them properly, Chrome doesn't, no idea for IE and Opera.

UCS-2 is a strict 16-bit encoding (a subset of UTF-16), and it cannot
represent all characters.

It is the encoding used by JavaScript, which can be problematic when double
width characters are used. For example, `"🐙🐚🐛🐜🐝🐞🐟".length` is 14 even though
there are only seven characters, and you could slice that string in the middle
of a character.

~~~
shousper
I wonder how hard it'd be to get JavaScript/ECMAScript onto a better
encoding.. Do we actually have a "better" encoding?

~~~
Scorponok
Depends what you mean by "better". UTF-8 generally ends up using fewer bytes
to represent the same string than UTF-16, unless you're using certain
characters a lot (e.g. for asian languages), so it's a candidate, but it's not
like you could just flip a switch and make all javascript use UTF-8.

~~~
oofabz
I think the size issue is a red herring. UTF-8 wins some, UTF-16 wins others,
but either encoding is acceptable. There is no clear winner here so we should
look at other properties.

UTF-8 is more reliable, because mishandling variable-length characters is more
obvious. In UTF-16 it's easy to write something that works with the BMP and
call it good enough. Even worse, you may not even know it fails above the BMP,
because those characters are so rare you might never test with them. But in
UTF-8, if you screw up multi-byte characters, any non-ASCII character will
trigger the bug, and you will fix your code more quickly.

Also, UTF-8 does not suffer from endianness issues like UTF-16 does. Few
people use the BOM and no one likes it. And most importantly, UTF-8 is
compatible with ASCII.

~~~
lucian1900
There is absolutely no situation in which UTF-16 wins over UTF-8, because of
the surrogate pairs required. That makes both encodings variable length.

UTF-32 is probably what you're thinking of.

~~~
oofabz
I know that both encodings are variable-length. That is the issue I am trying
to address.

My point is that in UTF-16 it's too easy to ignore surrogate pairs. Lots of
UTF-16 software fails to handle variable-length characters because they are so
rare. But in UTF-8 you can't ignore multi-byte characters without obvious
bugs. These bugs are noticed and fixed more quickly than UTF-16 surrogate pair
bugs. This makes UTF-8 more reliable.

I am not sure why you think I am advocating UTF-16. I said almost nothing good
about it.

~~~
millstone
Bugs in UTF-8 handling of multibyte sequences need not be obvious. Google
"CAPEC-80."

UTF-16 has an advantage in that there's fewer failure modes, and fewer ways
for a string to be invalid.

edit: As for surrogate pairs, this is an issue, but I think it's overstated. A
naïve program may accidentally split a UTF-16 surrogate pair, but that same
program is just as liable to accidentally split a decomposed character
sequence in UTF-8. You have to deal with those issues regardless of encoding.

~~~
lmm
> A naïve program may accidentally split a UTF-16 surrogate pair, but that
> same program is just as liable to accidentally split a decomposed character
> sequence in UTF-8. You have to deal with those issues regardless of
> encoding.

The point is that using UTF-8 makes these issues more obvious. Most
programmers these days think to test with non-ascii characters. Fewer think to
test with astral characters.

