

Character Encodings For Modern Programmers - nkurz
http://blog.gatunka.com/2014/04/25/character-encodings-for-modern-programmers/

======
flohofwoe
For our multi-platform game projects we have settled with UTF-8, UTF-16 and
UTF-32 using the Unicode/LLVM standalone source code to convert between them.

We no longer mess around with code pages, or old-school multibyte encodings
like Shift-JIS, and we also don't compile Windows executables in "UNICODE
mode", instead we keep all strings as UTF-8, and convert from and to UTF-16
(on Windows) or UTF-32 (on UNIX-like operating systems). Conversion to and
from UTF-32 is hardly necessary though, since outside Windows, everything
seems to use UTF-8 anyway.

Instead of using OS functions like MultiByteToWideChar() or the iconv library,
we have integrated the LLVM/Unicode standalone UTF conversion functions:
[http://llvm.org/docs/doxygen/html/ConvertUTF_8h_source.html](http://llvm.org/docs/doxygen/html/ConvertUTF_8h_source.html),
although with C++11 these conversions are now builtin.

A few other notable points: \- properly handling IME input for Asian languages
can be tricky (for fullscreen 3D games)

\- Arabic text rendering (not because of right-to-left, but because the
character appearance changes depending on whether a character is at the start
or end of a word, and there is nearly no sample code around which demonstrates
this behaviour (and the one we found had all Arabic comments)

\- some Asian languages require incredibly huge font textures (most 3D-game
text renderers are font-texture based as far as I'm aware of)

[edit: removed redundant link]

------
sheetjs
Fun facts:

\- a few code pages (such as CP864 DOS Arabic) use the arabic percent sign ٪
(0x066A) at codepoint 0x25 instead of the standard ANSI %.

\- Mozilla Firefox and Thunderbird approximate the Arabic codepages (including
ASMO-708) to ISO-8869-6, which breaks XLS parsing from files generated from
certain international versions of Excel (which actually forced me to build a
library for the various character encodings: [https://github.com/SheetJS/js-
codepage](https://github.com/SheetJS/js-codepage))

~~~
MichaelGG
It's like there should be an encoding which specifies the intent. Unless the
Arabic percent sign (which looks like ANSI but flipped?) has some other
semantics, why shouldn't Excel handle both of them? Other than that there
could be 100 different percent signs.

~~~
sheetjs
The strings are encoded based on the specified codepage, and code page 864
cannot represent the standard percent symbol "%". Excel itself can handle it
because they have their own implementation of codepages, but other
implementations are not necessarily correct (many iconv implementations make
the same mistake).

Keep in mind that Unicode only emerged in the 1990s and Excel predates unicode
by about a decade.

~~~
MichaelGG
Right, I'm wondering why Unicode doesn't offer some sort of meaning-equivalent
API (or does it)? Like a way to know Arabic and standard percent have the same
meaning. Or that fullwidth 'A' is equivalent in meaning to 'A'.

------
alister
Lots of nice juicy history and interesting detail, but someone who doesn't
understand Unicode _at all_ should probably start with this classic article
first:

[http://www.joelonsoftware.com/articles/Unicode.html](http://www.joelonsoftware.com/articles/Unicode.html)
(The Absolute Minimum Every Software Developer Absolutely, Positively Must
Know About Unicode and Character Sets)

and then return to read the featured article.

~~~
rspeer
I'm glad, however, that there is a more modern description. Joel's article
dates from a time (2000) when UCS-2 sounded like a good idea.

I'll observe that this article is very C-centric. If Python is how you think
about things, I recommend the Python Unicode HOWTO, which of course has two
different relevant versions at the moment:

For Python 2:
[https://docs.python.org/2.7/howto/unicode.html](https://docs.python.org/2.7/howto/unicode.html)

For Python 3:
[https://docs.python.org/3/howto/unicode.html](https://docs.python.org/3/howto/unicode.html)

------
moreentropy
Great article, thanks!

If you're bored or not feeling modern today, read
[http://en.wikipedia.org/wiki/EBCDIC](http://en.wikipedia.org/wiki/EBCDIC)

For practical Python advice (I'm sure this was on hacker news)
[http://lucumr.pocoo.org/2013/7/2/the-updated-guide-to-
unicod...](http://lucumr.pocoo.org/2013/7/2/the-updated-guide-to-unicode/) is
a great read.

------
penguindev
Talk about the tower of babel. This article, although totally excellent,
didn't even mention unicode normalization problems. So basically, once
everyone starts talking unicode, then we can start trying to figure out what
is _meant_. [And not click on some phishing string that looks exactly like
some legit string]

------
TorKlingberg
A small addition: Some early/simple Japanese systems actually used single byte
encodings. They could only fit the katakana characters, so everything had to
be written phonetically. It is not very easy to read but works acceptably for
short strings. The characters were squished horizontally to take the same
space as English letters: [https://en.wikipedia.org/wiki/Half-
width_kana](https://en.wikipedia.org/wiki/Half-width_kana)

You can still encounter these in printed receipts from old POS systems. The
half-width katakana still exist as separate codepoints in Unicode, and are
quite popular in emoji and such.

------
lyndonh
Oh good, another blog post on text coding. We don't have enough of those.

