
A Localization Horror Story: It Could Happen To You - gspyrou
http://search.cpan.org/dist/Locale-Maketext/lib/Locale/Maketext/TPJ13.pod#A_Localization_Horror_Story:_It_Could_Happen_To_You
======
jwr
That is a very good article.

One architectural takeaway suggestion I've learned over the years, which is
not obvious when reading that article:

"You should not assume that you can generate any part of any string visible to
the user without the full context."

Whenever you design a localizable application, it isn't enough to provide a
string that can be translated. You have to allow for delegation to the most
specific piece of code dealing with the string, because only that piece of
code will have the appropriate context to properly produce the string.

This means your code can't just assume it can generate strings somewhere deep
inside the guts of a library. The programmer writing the final application
that uses your library needs to be able to generate/override those strings on
a per-case basis in the actual code that displays them to the user. The
strings might be different between two UI windows.

Trust me, I know. I'm Polish. Few languages are as insane as my native tongue.
If you don't believe me, take a peek at this concise 252-pages long
introduction to Polish numerals: [http://www.amazon.com/Liczebnik-grammar-
numerals-exercises-l...](http://www.amazon.com/Liczebnik-grammar-numerals-
exercises-language/dp/832420234X)

~~~
Nitramp
And Polish has the added benefit of being completely impossible to pronounce
for stupid foreigners. I remember practicing to say Hello ("Cześć") for half
an evening...

~~~
mootothemax
_And Polish has the added benefit of being completely impossible to pronounce
for stupid foreigners_

Hrm, not sure I entirely agree with you there, and I'm both stupid and living
as a foreigner in Poland :) The writing system is fantastic in that regard,
you can look at a word and have a very good idea of how it should sound.

For me, there's only one really difficult sound and that's ń: the best
description I've heard of it is the first N in onion. But it's still a bugger
to pronounce and hear the difference between, say, koń (horse) and koni
(horses).

~~~
pavel_lishin
This just means that Polish is a phonetic language - like Russian, my native
language, which is 99% phonetic - but that doesn't mean that English speakers
can easily make the required sounds.

I don't think I've ever met any American who can pronounce ы - though it's fun
watching them try.

~~~
getsat
Hah!

My Russian tutor says that my accent (which is mimicry based on Russian from
songs, films, etc.) sounds almost native EXCEPT for how I pronounce ы. I can't
quite get it down correctly. :(

------
patio11
My previous company produced a guide to internationalization/localization/etc
for engineers (this is kinda helpful to have when a mixed team of Japanese,
Koreans, Chinese, Indians, and one very out of place white guy are trying to
make multilingual software on top of business processes not designed with
diverse client populations in mind).

The guide was somewhat whimsically named bluepill.doc and subtitled Welcome To
The Real World. You have no idea how deep this rabbit hole gets. I did this
for years and I am regularly surprised by novel, hard problems. It is like
security. (It even intersects with security sometimes: since approximately no
application developers actually understand encoding issues, there are
virtually boundless _classes_ of vulnerabilities arising from their
(mis)understandings not matching technical reality.)

(I only found out later that the blue pill was the escape-back-into-
comfortable-fantasy option. Whoopsie.)

~~~
joelhaasnoot
Just wait till you add unicode.doc, text_encodings_in_all_their_formats.doc
and really unknown languages to the mix. Quite a nightmare...

------
neilk
MediaWiki is one of those websites that is translated into nearly every
language known to humans, and it has a pretty elegant system for this. Not
perfect, but good enough for almost anything. You can get away with minimal
markup in the lexicon this way:

* In the code, messages are specified abstractly, e.g.
    
    
       print getMessage('found_x_files_in_x_dirs', $fileCount, $dirCount);
    

* Languages each have their own class with a 'convertPlural' function that maps the quantity to the forms. So in english, that function might be simple, for Arabic, it's complex: [http://svn.wikimedia.org/svnroot/mediawiki/trunk/phase3/lang...](http://svn.wikimedia.org/svnroot/mediawiki/trunk/phase3/languages/classes/LanguageAr.php)

* Lexicons use a simple wiki markup to define the different forms of their language. To illustrate that the arguments don't have to be used in order, I did it in the reverse of how the code passes arguments.
    
    
        'found_x_files_in_x_dirs' => 
        "I searched $2 {{PLURAL:$2|directory|directories}} 
        and found $1 {{PLURAL:$1|file|files}}"
    

So for a language like Arabic you write a similar pipe-delimited list of
forms. You just have to know how to lay down the six different forms in the
order that LanguageAr.php defined.

Note how this side-steps most (but not all) complicating issues like case or
gender, so you don't have to mark it that way in the lexicon. If the word is
used in the feminine gender, accusative case plural in the sentence, that's
what the translator writes.

All this is mediated with the amazing <http://translatewiki.net/> website, run
mostly by volunteers.

------
thomas11
A detailed, insightful and well-written article on the pitfalls and
complications of internationalization in software, written by two linguists.
Very good read, independent of Perl.

------
johkra
Gettext [1] does actually offer a way around this problem, which works fairly
well in practice.

You can define the number of plurals and rules to select a plural case. Then
you have as many translations as plural forms in your translation (po) file.

Arabic for instance has 6 plural cases with the following rules [2]:

    
    
      nplurals=6; plural= n==0 ? 0 : n==1 ? 1 : n==2 ? 2 : n%100>=3 && n%100<=10 ? 3 : n%100>=11 ? 4 : 5;
    

See also <http://wiki.amule.org/index.php/Translations#Plural_forms> for an
example of both a rule and the resulting code in the po file.

[1]
[http://www.gnu.org/software/gettext/manual/gettext.html#Plur...](http://www.gnu.org/software/gettext/manual/gettext.html#Plural-
forms) [2] <http://translate.sourceforge.net/wiki/l10n/pluralforms>

------
Nitramp
His advice of thinking of phrases as functions is spot on. However I think
he's missing the easiest way of formulating solutions that would be usable by
many translators/linguists: pattern matching.

Imagine your phrase/function is called:

"Found %n1 matching files in %n2 directories"

You could pattern match for one particular language like this:

    
    
      %n1 == 0, %n2 == 0
      %n1 == 1, %n2 == 1
      %n1 > 1, %n2 == 1
      ... and so on ...
    

With the matching being any (simple?) boolean function of the operator,
applied in order. At the point where this becomes too cumbersome, you could
fall back to proper code. I bet that this would be much easier to use for
translators, with an optional fallback to a programmer if it gets too complex
to spell out all combinations.

~~~
jwr
You're simplifying things. Please remember that in many languages the form of
the numeral depends not only on the number, but also on _what_ it is you're
counting. And you don't always know that if you're deep inside a library.

In general, there is no easy way out — and you have to allow for exceptions.
Please see my other comment, about an architectural takeaway.

~~~
Natsu
> Please remember that in many languages the form of the numeral depends not
> only on the number, but also on what it is you're counting. And you don't
> always know that if you're deep inside a library.

Japanese, for example:

<http://en.wikipedia.org/wiki/Japanese_counter_word>

Take the counter 本, for a commonly-used example. That's used for: "Long, thin
objects: rivers, roads, train tracks, ties, pencils, bottles, guitars; also,
metaphorically, telephone calls, train or bus routes, movies (see also:
tsūwa), points or bounds in sports events. Although 本 also means "book", the
counter for books is satsu (冊)."

~~~
Vivtek
But doesn't the counter always follow the numeral in a non-declined manner in
Japanese? That is, a pattern like "%n-satsu book" would work, right? I thought
of Japanese, too, but I think maybe this is a pattern that is the inverse of
the problem discussed.

~~~
Natsu
So long as you know what you're counting, you're fine, yes.

But the real problem is that we have a different problem in each language.
Every single language has its own way to do things and that's where we run
into trouble, because you need generic ways of doing things and there really
isn't one for all languages. You essentially need one function to create text
for each language, even though some of them can share helper functions, like
that one for numerals.

~~~
Vivtek
Oh - I'm not doubting the original point! I'm a professional translator
myself. I _know_ how weird it can get (most of my languages are Germanic and
Romance and provide no great headaches, but Hungarian is _hard_ to translate
well).

------
metabrew
At Last.fm we used PHP+Smarty with a gettext-esque pre-compilation step (a bit
like IntSmarty) for templates that allowed translators to embed smarty
templating code, so you could write this in smarty:

    
    
      {l}Found {$d} directories{/l}
    

and the replacement for each language could have its own switch on $d to
decide how to translate it.

One (unavoidable?) downside is that the translators have to know some basic
if/then/smarty syntax, and if they mess it up your template won't compile.
Also you have to trust them somewhat, since they essentially get to execute
PHP on your webserver.

~~~
jwr
Another downside is that this simply will not work for inflected languages, as
the translation of "directories" will change depending on the number.

(unless I'm misunderstanding something in your templating scheme)

~~~
metabrew
It works because the translator can write template code that is executed by
the templating engine at runtime, so the translator can write something like
this as the translation:

    
    
      {switch $d}
       {case 1} foo {$d}...
       {case 2} {$d} bar..
       {default} ba{$d}z
      {/switch}
    

We provided the translators with some basic examples of if/then/switches and
so on, they were free to do all sorts of crazy things (especially in polish).

~~~
jwr
Oh, ok, I understand now — that would work fine, then. It's actually a good
solution, because it lets your translators customize strings right where they
have the full context, e.g. in final templates.

------
mootothemax
This is why I'm scared to localize any of my applications. I'm in a good
position where one of my web apps is used by lots of South Americans and
Spanish people, and a couple of users have offered to translate it for free
into their local tongue. Which is great... but I'm terrified about the
potential amount of work when it comes to "You have X [nouns] set up" in table
headings and the like.

~~~
gnaffle
I think this highlights a problem. Localization _can_ be hard, but most of the
time it's not. You shouldn't be scared to localize your app.

After all, there's a vast amount of open source software that has been
localized with gettext without problems.

In my experience, most translation strings are straightforward, and the corner
cases as mentioned in the article don't happen very often. If you can tolerate
a few hacks in your code, or a few extra translation messages, you don't have
to go all the way down the rabbit hole. The perfect is the enemy of the good
in this case.

~~~
joeyh
In my experience, fully localizing a piece of software tends to lead to one of
the most complex and fragile parts of its build system (especially if its
documentation is localized too), often takes an appreciable amount of its
total build time, tends to bloat its installation size by double or more, and
leads to problems of inertia. The last inhibits eg, minor changes to a string,
since any such change is making extra work for N translators (in my cases, N
can be as high as 50).

Also, it requires hard decisions, like should this language that has falled
20% out of date be disabled, or is it better for the program to startle the
user with English 20% of the time? (Which 20%?)

And finally, after all this work and pain, technical computer users will
complain that they prefer the English version because their native language
has clumsy terms for computing terms, or the translation to their native
language is not idiomatic enough, or whatever. And you as the coordinator
can't begin to judge translation quality unless you're fluent in N languages.

I've been lucky to have very skilled people handling the translation
coordination in some free software projects I've been involved in, and
localization has still had most of these elements of pain in most of them.

(Or if you prefer a proper rant:
[http://kitenet.net/~joey/blog/entry/on_localization_and_prog...](http://kitenet.net/~joey/blog/entry/on_localization_and_progress/)
)

~~~
gnaffle
I agree that having translations adds inertia and resistance to change in the
code, and that this can be a problem. You should add localizations when you
"need" them and know that you are going to have to support them, not because
they're nice to have.

What I wanted to point out was that the article puts focus on a problem which
often times is not that big of an issue.

I've mostly been doing web development and localization of web applications,
and I guess this makes the build process a bit simpler than what you describe.

------
revorad
I'd take the easy route:

printf("Directories scanned: %g", $directory_count);

~~~
mootothemax
_printf("Directories scanned: %g", $directory_count);_

That's precisely the problem the article addresses :) You could just about get
away with it in English where the only number that won't work is 1, but what
about Polish where "Directories" will take a different form if the number ends
in a 2, 3 or a 4 but isn't 12, 13 or 14?

~~~
revorad
The whole point of that workaround is that I'm making it obviously computer-
lingo. "Directories scanned: 0" and "Directories scanned: 1" are not
grammatically correct, but they sound OK and get the point across without
sounding broken like "I found 1 directories".

~~~
mootothemax
_The whole point of that workaround is that I'm making it obviously computer-
lingo_

Thanks for the clarification. My point was that a workaround in English may
not be a workaround in every other language in the world.

~~~
hxa7241
This is the interesting question isn't it? Do other languages not have similar
'neutral' formulations? How far could such a ploy be generalised?

~~~
dunmalg
For the most part, it would work in essentially ANY language. The "noun,
colon, number" notation completely sidesteps the issue of grammar by dumping
the verb, which is usually the instigator of the worst problems (Russian, I am
looking at you!). For example, the easy way out in Russian is simply
"каталоги: 5", which is the nice and easy plural "directories", free from
abusive endings imposed by accusative case verbs.

------
arethuza
One of the things that makes large ERP systems so complex is that these
localization requirements also apply to business rules (e.g. for tax, payroll,
even some rather basic accounting processes).

~~~
StuffMaster
Oh god, large, localized ERP. This horrible though never entered my mind
before.

------
stretchwithme
Its amazing how many different ways to do something there can be, even when it
may seem initially that the way you've been taught is the only possible one.
The something in this case being language.

Which is one reason to speak or code in more than one language.

------
JabavuAdams
And then you realize that your shiny custom font rendering assumes left-to-
right and also that subsequent characters don't modify preceding character
glyphs (cursive font).

------
cpeterso
Localizing games is even more complicated. User applications typically "talk"
to the user, but games also have a variety characters speaking to each other.
And in MMO/sandbox games, you may not be able to anticipate which characters
will be conversing. Localizing inter-character conversations also introduces
pronouns, which are not often used for user applications.

Game translators need to know all sorts of metadata about the speaker and
listener characters' "social context", such as gender, age, and "honor". Is
the old king speaking to a young peasant girl or an entire village? Is the
young prince speaking to an old peasant woman or to his old grandmother? Is
the grandmother speaking to her son, who happens to be the king?

------
bromley
Ouch, languages are complicated. Two simple solutions that I think might work
for a small packaged-software company like mine:

1\. Just don't bother internationalizing. I think that's often not a bad
solution for small software businesses anyway. It's not much use making a
software package that works in French/German/Russian/Chinese/Swahili unless
you have the language skills or partnerships to sell and support the software
in those languages as well.

2\. Design the software such that messages are whole sentences or phrases that
stand alone and can be translated one-to-one. Nothing fancy like the %g type
stuff for number inserts. Keep it simple stupid.

~~~
cdavid
1 really only works if you work in a big market for your native language. For
example, you will not sell a lof of your software in English in Japan or
France (where translation is mandatory by law, although not applied
thoroughly).

What would you say if Japanese were trying to sell Japanese software with
Japanese-only indications/manual ? Do you think many American would buy it ?

~~~
zokier
I think you underestimate the market for English-only software.

------
Vivtek
_And this is to say nothing of the near impossibility of finding a commercial
translator who would know even simple Perl._

Dammit, there has _got_ to be a way for me to make money there.

------
frankc
Not always appropriate, but what about simply not using proper sentances. What
would be the implications of using this kind of form:

Directories scanned: %g

Files found: %g, Directories with files: %g

------
speleding
It's a very good article. One thing that has bitten me while doing
localization is the programmer instinct to reuse code as much as possible. If
you have ten buttons labeled "Save" then you only need one translation for it,
right? Wrong! After you have had to go back through the code and split out all
those "Save" labels into different contexts the lesson is ingrained...

------
ajithvl
i have been using grasshopper [1] for some project & it already allows having
function is translations. it was nice to see a framework in a young ecosystem
like node.js which can already handle this :)

[1]<https://github.com/virtuo/grasshopper>

------
weavejester
I'm rather of the opinion that the translations should be sandboxed scripts,
rather than strings.

------
pronik
A note to all the readers: the history part of the article is good, the
technical Perl part is very ill-advised, partly due to when it was written
(1999, prior to Gettext's plurals support). Please don't follow those advices
if you are writing Perl.

------
wlievens
In Java, I would use FreeMarker as template language for all sorts of things,
including i18n, rather than the typical sprintf-like syntax, because FM
templates can contain arbitrary complex code, but the simple template cases
work too.

------
jhrobert
OTOH:

seach result: directory: NN, file: NN.

as in "search result: directory: 1, file: 0." or "search result: directory: 4,
file: 23."

ie: nouns only, singular form, no verbs, no plural, etc....

Sure, it does not look "good" but its probably much more easy to translate.

Worse is better

------
gregwebs
Here is a one possible solution: using a grammer
<http://www.grammaticalframework.org/>

------
ajithvl
i have been using grasshopper[1] for some project & it already allows having
function is translations. it was nice to see a fw in a young ecosystem like
node.js which can already handle this :)

[1]<https://github.com/virtuo/grasshopper>

------
DenisM
So the answer is to have translators write perl? Really?

Is this the best we, as an industry, can do?

------
locopati
For just this situation, Java offers the MessageFormat (and the ChoiceFormat
refinement)

form.applyPattern( "There {0,choice,0#are no files|1#is one file|1<are
{0,number,integer} files}.");

[http://download.oracle.com/javase/6/docs/api/index.html?java...](http://download.oracle.com/javase/6/docs/api/index.html?java/text/MessageFormat.html)

~~~
julian37
You haven't actually read the article, have you?

~~~
locopati
Actually, I did give the article a quick read. He's raising intersting points
around the problems of localization and showing how to solve those problems in
Perl using logic to do something that Java lets you manage somewhat easily.
What's not included in my brief comment on MessageFormat, is the other part of
the equation - Java also provides Locale and Property files for managing
strings. My comment was intended to provide info to those Java coders out
there who may not be aware of such things, since the Java API has a lot of
nooks and crannies like that.

------
Tichy
Couldn't read it all. I guess the takeaway is that the localization system
needs to be scriptable somehow, templates of the form "bla bla %1" are not
sufficient.

~~~
Tichy
Would be great if somebody could explain what I missed?

