

What Every Software Developer Must Know About Unicode and Character Sets (2003) - mshafrir
http://www.joelonsoftware.com/articles/Unicode.html

======
pmjordan
I'm just reaching the end of a pretty soul-destroying consulting project. The
client side is C++, and uses a lot of strings. To my horror, there still
doesn't seem to be a de-facto standard way of dealing with the various unicode
encodings in C++, even after my multi-year C++ hiatus. I ended up using the
WideCharToMultiByte() and MultiByteToWideChar() Win32 functions, which are
rather yucky. I'd fully expected boost to have an answer to this problem by
now, but that only offers UCS-4 <-> UTF-8 conversion.

What libraries do C and C++ programmers use to hold unicode strings and
convert between encodings these days?

~~~
ximeng
Not quite answering your question, but worth noting that C++0x supports
various unicode encodings for string literals:

<http://en.wikipedia.org/wiki/C%2B%2B0x#New_string_literals>

There's also a link to a proposed boost solution here:

[http://stackoverflow.com/questions/511280/is-there-stl-
and-u...](http://stackoverflow.com/questions/511280/is-there-stl-and-
utf8-friendly-c-wrapper-for-icu-or-other-powerful-unicode-lib)

Doesn't quite sound like the standard way that you're looking for, but moving
closer.

Edit:

"ICU today is de-facto standard Unicode/localization library" from a mailing
discussion of the boost solution. And <http://art-blog.no-
ip.info/cppcms/blog/post/43> has an interesting comparison of a few libraries,
but not too comprehensive.

------
pistoriusp
My favorite author on the subject of Unicode is Mark Pilgram in his book Dive
in to Python 3 (it's relevant for any language):

<http://diveintopython3.org/strings.html>

"Some Boring Stuff You Need To Understand Before You Can Dive In."

------
wizard_2
It's worth bringing this up from time to time, if for nothing else then to
educate new developers.

------
Emore
What happens when I copy and paste?

Do I copy the code points, or the encoded characters--the bytes--along with
what encoding is used? Similarly, when I paste, is it the code points I paste
which are instantaneously encoded using the application's encoding scheme?

------
ptarjan
Just yesterday I ran into this exact problem.

In PHP curl_exec returns data in the raw encoding of the source. Fine, some
people will want that. But I want to do things with the data, so I want it in
UTF8.

So, I ended up writing my own curl_exec_utf8 function which I'm sure is wrong
for many edge cases, but it is 2010! Why is there no decent ways to deal with
charsets?

Here is the function, in case any of you need it, or want to point out how it
is hopelessly broken : [http://stackoverflow.com/questions/2510868/php-
convert-web-p...](http://stackoverflow.com/questions/2510868/php-convert-web-
page-to-utf8/2513938#2513938)

~~~
thmz
If we are lucky they will fix it before 2011.
<http://www.php.net/~scoates/unicode/render_func_data.php?x=0> Complete 70.70

------
jamiecobbett
For Ruby, I can highly recommend this series of articles:
<http://blog.grayproductions.net/articles/understanding_m17n>

------
gjm11
From 2003; it might be worth putting that in the title.

~~~
akirk
Hm but only really related to the recent Joel and the he's not blogging
anymore situation.

What he says is still very much valid. It's a nice intro into Unicode. Quite
refreshing to read.

~~~
akirk
Actually I just miss that he doesn't state anything about the downsides of
UTF-8. Like that you need to go through the string to determine how many
characters it has, due to their (potentially) variable length.

~~~
dfox
I'm still trying to find one valid use for length of string in unicode
characters. What one usually needs to know is length of string as it's
rendered by some output device, which is not related to count of unicode
characters in any useful way. Even for fixed point fonts you can have glyphs
that are composed from multiple unicode characters or characters whose glyphs
occupy two consecutive positions.

~~~
anamax
Twitter has a limit of 140 "codepoints". Not bytes. Not glyphs.

~~~
prodigal_erik
That's weird, I thought its limit was deliberately low enough to fit into an
SMS message, which has a limit of 140 octets (160 characters in some 7-bit
encoding GSM uses). Do they actually allow, say, 140 kanji?

~~~
anamax
[http://groups.google.com/group/twitter-development-
talk/brow...](http://groups.google.com/group/twitter-development-
talk/browse_thread/thread/9a106a68b3227af2#)

~~~
ricree
That post basically just says go look at this wiki page:
<https://twitterapi.pbworks.com/Counting-Characters>

Why not link to that in the first place?

------
hairsupply
And of course, all developers should be familiar with U+F8D0 through U+F8FF.

