

Unicode is tricky in Java and might be impossible in C++ (2006) - mblakele
http://www.blakeramsdell.com/techblog/2006/06/10/unicode-is-tricky-in-java-and-might-be-impossible-in-c/

======
gchpaco
Aha, I have uncovered it. There are two problems. One is that apparently
std::locale is horribly, horribly broken in the C++ standard library for Mac
OS X. You can work around this by using C, like Real Programmers. The other
problem is that without some form of setlocale, the standard library will run
in "C" or "POSIX" mode, which can do nothing at all interesting with Unicode.
By running setlocale, or by running std::locale::global (std::locale ("")) on
a platform with a C++ library that works, it will print appropriately.

Interestingly on my Ubuntu box, before I put the std::locale in it was
printing "I have EUR100 to my name.", which is a pretty cool fallback.

~~~
fauigerzigerk
setlocale() in C and C++ affects the entire process. Setting it to anything
other than the "C" locale very likely breaks stuff elsewhere and it's
completely unpredictable.

Realistically, for serious text processing you need to leave the entire broken
POSIX locale system and C++ wchar_t alone and work with libraries such as ICU.

~~~
timr
...which isn't the end of the world, since ICU comes with a bunch of stuff
that makes life bearable when developing international software, like
language-aware collation and regular expressions. If you're going to be
developing real i18n software in C++, you're always going to be using ICU or
something like it. There's no reason to bother with std::locale.

In fact, I'm guessing that Apple never got around to fixing this because they
probably don't use std::locale for real i18n work internally.

(update: the ICU homepage says that Mac OSX uses ICU internally --
<http://site.icu-project.org/>)

~~~
fauigerzigerk
Seems my post was very unclear and I apologize for that. What I meant to say
is:

Try to avoid POSIX locales at all cost. Use ICU!

------
vasi
Err...this works fine for me if I use UTF-8 rather than wide strings, eg:

std::string s("foo\u203D"); std::cout << s << std::endl;

Also, he's completely wrong about the program terminating on output of a wide
string, it's just that wcout is broken. If he had tried causing output
afterwards with, say, printf(), he'd notice the program is still alive. Or
running it in GDB would show this just as well. It's generally speaking a bad
idea to test whether your program is alive, using the same mechanism that you
suspect of killing it!

------
cr0bar
afaik, when not using literal non-ascii characters in java source code you're
supposed to use native2ascii. Running it on my mac gives this:

    
    
      $ native2ascii Foo.java 
      public class Foo
      {
         public static void main(String[] args)
         {
             System.out.println("I have \u201a\u00c7\u00a8100 to my name.");
         }
      }
    

The resulting program from that works perfectly fine in my mac terminal, which
makes the "I know Java pretty well" statement pretty suspicious...

------
zkz
Yet it works in cat. So you can at least program the solution in C, compile it
with cpp, and then use it from inside your C++ program.

~~~
jbert
Presumably it's the difference between a text-mode filehandle and a binary-
mode filehandle.

So (untested, because I'm baking a cake) printing to fdopen(1, "wb") should
just send bytes, allowing printf etc to work?

------
mblakele
Old, but the MacRoman issue in Java bit me today so I thought someone else
might benefit.

MacRoman? In 2009? Really?

~~~
blasdel
It's even worse than the usual sin of pretending that DPI is remotely
meaningful on a screen and should be platform-specific because of some shit
you read.

------
thwarted
Interesting article, but it would really help me believe the author knows the
topic (text encoding) by not having goofy HTML entity encoding scattered all
over the place (making the C++ code extremely hard to read) and not use smart
quotes in the code samples.

------
zokier
Better headline would be 'Unicode is broken in OS X'. Lets just try out for
fun how Linux box handles this: <http://codepad.org/vIa5Byya>

Yeah right, Unicode might be impossible in C++ ...

... on OS X

------
allenbrunson
I print unicode strings to a Mac terminal all the time without problems. just
send a string formatted as UTF8 to puts(), printf(), or similar.

It never would have occurred to me to use the built-in locale stuff. That's
heading for a world of hurt.

------
TwoBit
\u20ac is not UTF8; it's UCS2. If you set the locale to UTF8 then you need to
printf with UTF8 and not UCS2 (or UTF16).

~~~
gchpaco
\u20ac specifies a code point, not an encoding. (specifically the code point
000020AC). The encoding of wchar_t in C89 is up to the implementor, and in
particular GCC uses UCS4 on most platforms. But this is besides the point; C99
and C++ both require that any C language processor that runs across that
sequence of characters to _immediately_ interpret it as the Unicode character
€, or rather to behave as if it did. So you can have function names like
"b\u00EAte" if you were for some reason incapable of typing ê.

What setting the locale to UTF-8 does is require (well, for very weak
definitions of require; the standard is pretty damned quiet on anything other
than C or POSIX locales) that the Unicode character sequence in question be
output in the UTF-8 encoding. Presumably you could define a locale
en_US.UTF-16 that output in UTF-16, although I would point and laugh if you
did.

In sum: C89 doesn't say a damn thing of any use about wide chars; just that
they exist and here's some convenient functions (and that only in TR1). C99
requires that you be able to specify characters using Unicode code points in
wide character strings but otherwise does not specify input or output. The
locales, which are standardized only by convention, talk about encoding from a
well specified (usually) disk format to whatever the library's internal
representation is.

