

The flaws of thinking "Just use Unicode" - lucumo
http://www.itworld.com/print/58558

======
tseabrooks
My companies product supports about 35 languages and 25 countries. I think
'just use Unicode' is fairly accurate. The problem may be how you go about
doing it. I'm reminded of a professor that used to tell his students to use
libraries and code written by people with an expertise in the relevant area
whenever possible.

We use the IBM ICU library for our product and it handles the vast majority of
the issues presented in the article. There is no reason for me to be writing
the code to handle the thousands of permutations of language, country, and
culture when someone else has already done the heavy lifting here. ICU solves
problems such as: sorting, number representation, etc.

The other problems should be things unrelated to Unicode - things your product
/ business already need.

1)Pick a target market - If you haven't picked a target market worrying about
using Unicode to support some languages is putting the cart before the horse.

2)Get a translator(s) that know what they are doing - You already needed a
translator - now you need to pick one competent enough to know the difference
between Mexican Spanish and the Spanish spoken in Spain.

3) Write software that can switch look and feel on the fly during run time -
This has already be done and should be a good philosophy for companies writing
GUI intensive applications anyway.

Really - The correct sentence should be, "Just use Unicode, with your pre-
existing good business practices."

~~~
philfreo
This may work well for 35 languages. Not 350 or 3500.

~~~
tseabrooks
Why doesn't this scale to any number of languages? Assuming you are using the
existing practice of not hard coding strings in your software but rather
referring to them with some sort of IDs then I see no reason not to use 3000
languages with this approach.

~~~
lucumo
We are using IDs on our website instead of hard-coded variables. It causes no
end of problems. A non-exhaustive list:

1\. Using variables to put numbers in the string. Singular and plural are not
always only two options. Rules in this regard can get quite complex. gettext
has this solved by using rules: n%2 == 0? Use A. n==0? Use B. n%2 == 1? Use C.
etc. That should give you an idea about complexity.

2\. Replacing words into the string is nearly impossible due to declensions.

3\. Names. When using a different script, will the name be transliterated? Do
the names need to be inflected? Will they be handled correctly when changing
writing direction?

3a. Names have a whole load of problems anyway. The "given name" isn't always
the first name. The surname isn't always the most important one. Some people
don't have family names, some people have two or more. Some people have
double-barrelled given names. Etc.

4\. Ordinals differ greatly from language to language.

5\. String length can mess up your layout. A good translator can help by using
different idiom, but sometimes it's just not possible.

6\. Layout expectations are different when the writing system is right-to-
left. You also need to learn the details of Unicode's direction algorithm,
because you _will_ encounter situations where it doesn't work properly without
guidance.

These are just some problems we've actually encountered when using multiple
languages.

~~~
oikujhgfvg
You don't even need to go outside English for some of these to be a problem,

What letter does the name Van der Waals sort under?

What about D'Arcy

Does McKay come before MacKay ?

Is 7-4-09 an American holiday or a week after April fools?

------
Nosferax
I use three languages commonly : French, English and Russian. I can set the
language for non-Unicode programs only once, which means I can't display all
programs written with a different encoding properly.

Please just use Unicode.

~~~
lucumo
The article doesn't say you _shouldn't_ use Unicode. But it explains that only
using Unicode just isn't enough. You need more. You shouldn't stop thinking at
Unicode. Don't JUST use Unicode, use Unicode _and more_.

~~~
aristus
A small example: many programs do not fold accents. So when I start typing
"lo" in the to: field of most email apps, it suggests "Lorena Foo" but not
"Fulanito López".

------
theli0nheart
I stopped taking this post seriously after I read this:

> Further east we hit ideographics. The concept of a "letter" has just flew
> out of the occidental window. Not only that but the text is laid out top to
> bottom.

Chinese and Japanese are both read from left to right (although in the past
characters were written top to bottom and right to left). The only language
that I know of that is presently laid out from top to bottom is Mongolian.
I'll be damned if I ever need to write software for people in Mongolia.

~~~
potatolicious
For me it was:

> Oh, and there are languages with unbounded sets of "characters" such as
> Chinese which literally cannot be fully described in Unicode.

I read Chinese just fine in Unicode. Is there something I'm missing? You have
50,000 commonly used characters, 100,000 rare ones - this certainly doesn't
bust any limitations within unicode. Makes fonts rather large, but that's
another issue altogether.

The rest of the article seems to be confused about what unicode is - I don't
think anybody seriously goes into unicode expecting it to practically
_translate_ for them. It's an encoding, that's it. It won't convert "," to "."
depending on locale, it certainly won't rephrase your messages for you to be
sensitive to local cultures. That's _your responsibility_ , not the purview of
_a form of encoding_.

I think the author's point at the end was: "Unicode" as a concept won't
localize your programs for you. Well, duh?

~~~
lucumo
_> I think the author's point at the end was: "Unicode" as a concept won't
localize your programs for you. Well, duh?_

That's his point as I understand it. But I disagree with the "duh" part. It's
rare to see software that goes beyond simple translation.

------
philfreo
_There is no such thing as a definitive list of languages_

This is true. But SIL has what is considered the "standard" for this list, and
there are almost 7000 of them. Many of them are unwritten. See
<http://www.ethnologue.com/> and ISO 639-3.

------
awolf
>>What would our software need to do to operate in, say, Europe? Well, apart
from selecting a subset of the myriad of languages (France alone has 30+ )...

Who said that you need to support every language in a region to have your
software operate there? If someone living in France speaks an obscure dialect
then its a pretty safe bet they ALSO speak French. And if they only speak a
single obscure language they probably aren't going to be your customer anyway.

If you hit the top 15-20 most used languages in the world then you'll be fine.

Just use unicode.

~~~
xiaoma
I really agree with just supporting the top 15-20 most used languages, but the
problem with "just" using unicode is that #1 on that list of languages isn't
well served by some unicode formats.

In fact, the PRC requires that all language-related products introduced into
the Chinese marketplace must be able to function in GB 18030.

------
ajross
The whole premise of the link is flawed. It's an article about "localization
is hard". Well, duh.

I don't see anything in it, at all, about character set and encoding choice.
And in that realm, yes: you should just be using UTF-8. Not even "unicode" --
you want to be in UTF-8, period. Convert at the edges of your system if you
absolutely must handle other encodings. But wherever you can, you should just
be using UTF-8.

~~~
lucumo
The premise of the article wasn't that you shouldn't use Unicode. It was that
you need to care about more than _just_ Unicode. Thinking that the answer to
localisation is to "just use Unicode" is a flaw in people's thinking that this
article points out and corrects.

I'm curious as to why you believe UTF-8 is the only encoding you should use? I
know of at least one person that says you should use UTF-16, because then it's
immediately obvious when a certain piece of data is in the wrong encoding,
whereas with UTF-8 you need to look for non-Latin characters (or accents) to
distinguish it from, for example, ISO-8859-1. That line of thinking holds some
value to me. Why do you believe otherwise?

~~~
ajross
UTF-16 can't be used with traditional tools. No grepping, no strings, no
scanning quickly with a text editor. It locks you into whatever oddball
toolchain you cooked up in development. Old code that passes the string to a
strcat() will fail in mysterious ways. And your old libraries probably expect
1-byte characters anyway, which means you're going to be constantly converting

The inability to distinguish between UTF-8 and ASCII is a _feature_ , not a
bug. If you want to know if it's a valid string, just validate it. UTF-8 has
extraordinarily strong validation properties, it's almost impossible for
strings in other encodings to parse successfully.

~~~
lucumo
_> Old code that passes the string to a strcat() will fail in mysterious ways.
And your old libraries probably expect 1-byte characters anyway, which means
you're going to be constantly converting_

But they will have that same problem with UTF-8, right? Just as soon as you
start using non-Latin characters? And if you aren't, there's no point in going
beyond ASCII.

~~~
ajross
Uh, no, not at all. The ANSI strXXX() functions work just fine with UTF-8,
that was kind of the point. Anything treating strings with nul termination or
counted byte length works without change. Any work involving only ASCII
substrings (e.g. parsing of computer-readable data) works without change. No
non-ASCII character is represented using ASCII bytes in the string. Everything
Just Works. Which is why you should just use it.

The sole exception are algorithms that truly need random access to
_characters_ within the string (e.g. "give me the 32nd character"), which
can't interoperate with any variable-length encoding. But there are precious
few of these. All anyone cares about is substrings.

