Hacker News new | past | comments | ask | show | jobs | submit login
Unicode is tricky in Java and might be impossible in C++ (2006) (blakeramsdell.com)
17 points by mblakele on June 6, 2009 | hide | past | web | favorite | 21 comments

Aha, I have uncovered it. There are two problems. One is that apparently std::locale is horribly, horribly broken in the C++ standard library for Mac OS X. You can work around this by using C, like Real Programmers. The other problem is that without some form of setlocale, the standard library will run in "C" or "POSIX" mode, which can do nothing at all interesting with Unicode. By running setlocale, or by running std::locale::global (std::locale ("")) on a platform with a C++ library that works, it will print appropriately.

Interestingly on my Ubuntu box, before I put the std::locale in it was printing "I have EUR100 to my name.", which is a pretty cool fallback.

Yes, the article title is wildly unfair. But an article titled "the g++ std::locale implementation under Mac OSX is broken" isn't nearly as interesting as claiming that C++ might not be able to support Unicode.

I mean, really...C++ can work with individual bits; it can obviously support Unicode strings. The problem here is one of making the compiler aware of the details of the platform on which it's being used. The blame here appears to lie on Apple:



Yes, that's right, for those just tuning in, this has been broken since at least 2006. Probably earlier.

I'm just speechless.

setlocale() in C and C++ affects the entire process. Setting it to anything other than the "C" locale very likely breaks stuff elsewhere and it's completely unpredictable.

Realistically, for serious text processing you need to leave the entire broken POSIX locale system and C++ wchar_t alone and work with libraries such as ICU.

...which isn't the end of the world, since ICU comes with a bunch of stuff that makes life bearable when developing international software, like language-aware collation and regular expressions. If you're going to be developing real i18n software in C++, you're always going to be using ICU or something like it. There's no reason to bother with std::locale.

In fact, I'm guessing that Apple never got around to fixing this because they probably don't use std::locale for real i18n work internally.

(update: the ICU homepage says that Mac OSX uses ICU internally -- http://site.icu-project.org/)

Seems my post was very unclear and I apologize for that. What I meant to say is:

Try to avoid POSIX locales at all cost. Use ICU!

I'm guessing you're an American, because everybody else is very ill served by the C/POSIX locale. Locales are about something different than general text munging, and if setting the locale to something other than C (which, BTW, is done by default on Ubuntu since many years ago; been en_US.UTF-8 since as far back as I can remember) broke software, that software is seriously stupidly designed. Locales are about making the date come out in the order the user was expecting, the default text encoding, error messages. They are fundamentally about interfacing to the user in the way that the user expects.

And yes, they are flawed, seriously so, mostly due to a lack of ambition. setlocale assumes that the locale the user is currently working in is also the locale the user has always been working in since time immemorial, and it assumes there is only one user. If you have a multilingual document, you're in trouble. If you use the locale date/time printer for anything other than immediate input and output to the user, you're in trouble (and there are circles of hell that certain Usenet newsreader authors will burn in for all eternity for this one). Any time you write a file out to disk with one locale and read it in with another you are potentially in horrible danger. This is where something like ICU comes in really handy; any time you have to deal with data from several different origins, you will be ill served by locale. So in that sense, you're quite right.

But all this is a very different problem from making printf work the way it is advertised as doing, from making it possible to output a Unicode string that you have told the computer in as many ways that you wish to output in UTF-8, write in UTF-8. If you want to do simple text output and not deal with ICU, GNU recode, whatever, you need to use setlocale or your libc will mutely suppress everything but 7-bit ASCII, and you will be very, very confused.

"I'm guessing you're an American, because everybody else is very ill served by the C/POSIX locale."

And because everybody else is ill served I recommended to use ICU and don't touch setlocale() for processing unicode text. I'm not sure where we disagree. What makes you think that I feel well served by POSIX locale?

Err...this works fine for me if I use UTF-8 rather than wide strings, eg:

std::string s("foo\u203D"); std::cout << s << std::endl;

Also, he's completely wrong about the program terminating on output of a wide string, it's just that wcout is broken. If he had tried causing output afterwards with, say, printf(), he'd notice the program is still alive. Or running it in GDB would show this just as well. It's generally speaking a bad idea to test whether your program is alive, using the same mechanism that you suspect of killing it!

afaik, when not using literal non-ascii characters in java source code you're supposed to use native2ascii. Running it on my mac gives this:

  $ native2ascii Foo.java 
  public class Foo
     public static void main(String[] args)
         System.out.println("I have \u201a\u00c7\u00a8100 to my name.");
The resulting program from that works perfectly fine in my mac terminal, which makes the "I know Java pretty well" statement pretty suspicious...

Yet it works in cat. So you can at least program the solution in C, compile it with cpp, and then use it from inside your C++ program.

Presumably it's the difference between a text-mode filehandle and a binary-mode filehandle.

So (untested, because I'm baking a cake) printing to fdopen(1, "wb") should just send bytes, allowing printf etc to work?

or `cat cat.c` to see how those cats did it.

Old, but the MacRoman issue in Java bit me today so I thought someone else might benefit.

MacRoman? In 2009? Really?

It's even worse than the usual sin of pretending that DPI is remotely meaningful on a screen and should be platform-specific because of some shit you read.

Interesting article, but it would really help me believe the author knows the topic (text encoding) by not having goofy HTML entity encoding scattered all over the place (making the C++ code extremely hard to read) and not use smart quotes in the code samples.

I print unicode strings to a Mac terminal all the time without problems. just send a string formatted as UTF8 to puts(), printf(), or similar.

It never would have occurred to me to use the built-in locale stuff. That's heading for a world of hurt.

Better headline would be 'Unicode is broken in OS X'. Lets just try out for fun how Linux box handles this: http://codepad.org/vIa5Byya

Yeah right, Unicode might be impossible in C++ ...

... on OS X

\u20ac is not UTF8; it's UCS2. If you set the locale to UTF8 then you need to printf with UTF8 and not UCS2 (or UTF16).

\u20ac specifies a code point, not an encoding. (specifically the code point 000020AC). The encoding of wchar_t in C89 is up to the implementor, and in particular GCC uses UCS4 on most platforms. But this is besides the point; C99 and C++ both require that any C language processor that runs across that sequence of characters to immediately interpret it as the Unicode character €, or rather to behave as if it did. So you can have function names like "b\u00EAte" if you were for some reason incapable of typing ê.

What setting the locale to UTF-8 does is require (well, for very weak definitions of require; the standard is pretty damned quiet on anything other than C or POSIX locales) that the Unicode character sequence in question be output in the UTF-8 encoding. Presumably you could define a locale en_US.UTF-16 that output in UTF-16, although I would point and laugh if you did.

In sum: C89 doesn't say a damn thing of any use about wide chars; just that they exist and here's some convenient functions (and that only in TR1). C99 requires that you be able to specify characters using Unicode code points in wide character strings but otherwise does not specify input or output. The locales, which are standardized only by convention, talk about encoding from a well specified (usually) disk format to whatever the library's internal representation is.

All strings in a physical machine are in some encoding. I/O almost always requires transcoding, except in the unlikely event that the destination encoding happens to match the chosen language's in-memory encoding.

If you create a wide-char string (C++'s most obvious candidate for full Unicode-level support), and print it out to the console using wcout, the programmer should reasonably expect the library to perform transcoding as necessary to match the destination (providing things are configured correctly).

The fact that \u20ac maps to a code point in UCS2 or UTF-16 when encoding a wide string in text format for the compiler to read, is all but irrelevant. So long as the final data at runtime is in a valid encoding that matches its type, the runtime library should handle everything from there.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact