

A rant about cross-platform programming with wchar_t - BudVVeezer
http://losingfight.com/blog/2006/07/28/wchar_t-unsafe-at-any-size/

======
prodigal_erik
ANSI C specified wchar_t in 1989, two years before Unicode 1.0. They couldn't
even be sure Unicode was going to win.

Besides, the _whole point_ of wchar_t is to not be variable width. UTF-16 in
wchar_t is an abomination that dates back to the industry building APIs that
take UCS-2 (which the author really ought to cover) before they realized UCS-2
was too narrow to do its job. So now we have a lot of code that appears to
support Unicode but may not handle it correctly, depending on whether QA knew
they should try surrogate pairs. Almost nobody realizes UTF-16 needs to be
searched and spliced as carefully as UTF-8. Each is just a compression scheme
for the million or so actual codepoints, and there aren't many reasons to
favor one over the other (in memory, at least).

What's the actual problem here, the team made assumptions that ANSI warned
against making? Apple failed to accept UCS-4 for their API?

~~~
BudVVeezer
The fact still remains that wchar_t is still a royal pain to use across
platforms. Sure, it defines the size of a single character consistently within
the platform. But it doesn't define what encoding that is. USC-2? USC-4? And
the fact is: different compilers define it differently, which sucks in
practice.

------
jheriko
This is silly... not only is the article inaccurate but the problem described
is trivial to solve. As long as you know what encoding wchar_t uses and what
encoding your data is stored in this is not a big problem, use one format
internally and convert your data on the way in as appropriate. Trust me, I
solved it with no prior knowledge, no formal education in less than a day, as
a distraction during my day job... I did not have to re-write the entire
library from scratch.

~~~
BudVVeezer
Inaccurate in what way?

It may be trivial to solve, but it's certainly not an issue anyone should
_have_ to solve, if the C++ standards committee would have dictated size and
encoding information along with the datatype.

~~~
jheriko
the MS compiler uses wchar_t as UCS2, not UTF-16 for example

all of the C++ types have platform dependence - expecting a specific number of
bits or encoding for wchar_t would make an exception of it, and one that would
be impossible to implement in practice.

