Hacker News new | comments | show | ask | jobs | submit login
A Localization Horror Story: It Could Happen To You (cpan.org)
266 points by gspyrou on Jan 12, 2011 | hide | past | web | favorite | 67 comments



That is a very good article.

One architectural takeaway suggestion I've learned over the years, which is not obvious when reading that article:

"You should not assume that you can generate any part of any string visible to the user without the full context."

Whenever you design a localizable application, it isn't enough to provide a string that can be translated. You have to allow for delegation to the most specific piece of code dealing with the string, because only that piece of code will have the appropriate context to properly produce the string.

This means your code can't just assume it can generate strings somewhere deep inside the guts of a library. The programmer writing the final application that uses your library needs to be able to generate/override those strings on a per-case basis in the actual code that displays them to the user. The strings might be different between two UI windows.

Trust me, I know. I'm Polish. Few languages are as insane as my native tongue. If you don't believe me, take a peek at this concise 252-pages long introduction to Polish numerals: http://www.amazon.com/Liczebnik-grammar-numerals-exercises-l...


Trust me, I know. I'm Polish. Few languages are as insane as my native tongue.

I'm learning Polish and the crazy numbers business has really opened my eyes - doesn't the number 2 have something like 16 or 17 possible forms?


And Polish has the added benefit of being completely impossible to pronounce for stupid foreigners. I remember practicing to say Hello ("Cześć") for half an evening...


And Polish has the added benefit of being completely impossible to pronounce for stupid foreigners

Hrm, not sure I entirely agree with you there, and I'm both stupid and living as a foreigner in Poland :) The writing system is fantastic in that regard, you can look at a word and have a very good idea of how it should sound.

For me, there's only one really difficult sound and that's ń: the best description I've heard of it is the first N in onion. But it's still a bugger to pronounce and hear the difference between, say, koń (horse) and koni (horses).


This just means that Polish is a phonetic language - like Russian, my native language, which is 99% phonetic - but that doesn't mean that English speakers can easily make the required sounds.

I don't think I've ever met any American who can pronounce ы - though it's fun watching them try.


Hah!

My Russian tutor says that my accent (which is mimicry based on Russian from songs, films, etc.) sounds almost native EXCEPT for how I pronounce ы. I can't quite get it down correctly. :(


Polish has the great advantage of being highly regular: you learn a few rules, you shape your mouth innards the right way, and you can read out anything.

Compare that to french, chinese or english...


Just out of curiosity: Is the Polish numeral grammar changing over time? Has it been less complicated than it is now? Or is it slowly being simplified?


Dual form has been disappearing for past couple centuries.

http://en.wikipedia.org/wiki/Dual_(grammatical_number)

quoted // Of the living languages, only Slovene and Sorbian have preserved the dual number as a productive form. In all of the remaining languages, its influence is still found in the declension of nouns of which there are commonly only two: eyes, ears, shoulders, in certain fixed expressions, and the agreement of nouns when used with numbers.


My previous company produced a guide to internationalization/localization/etc for engineers (this is kinda helpful to have when a mixed team of Japanese, Koreans, Chinese, Indians, and one very out of place white guy are trying to make multilingual software on top of business processes not designed with diverse client populations in mind).

The guide was somewhat whimsically named bluepill.doc and subtitled Welcome To The Real World. You have no idea how deep this rabbit hole gets. I did this for years and I am regularly surprised by novel, hard problems. It is like security. (It even intersects with security sometimes: since approximately no application developers actually understand encoding issues, there are virtually boundless classes of vulnerabilities arising from their (mis)understandings not matching technical reality.)

(I only found out later that the blue pill was the escape-back-into-comfortable-fantasy option. Whoopsie.)


Just wait till you add unicode.doc, text_encodings_in_all_their_formats.doc and really unknown languages to the mix. Quite a nightmare...


Maybe it was a viagra reference instead? :-)


MediaWiki is one of those websites that is translated into nearly every language known to humans, and it has a pretty elegant system for this. Not perfect, but good enough for almost anything. You can get away with minimal markup in the lexicon this way:

* In the code, messages are specified abstractly, e.g.

   print getMessage('found_x_files_in_x_dirs', $fileCount, $dirCount);
* Languages each have their own class with a 'convertPlural' function that maps the quantity to the forms. So in english, that function might be simple, for Arabic, it's complex: http://svn.wikimedia.org/svnroot/mediawiki/trunk/phase3/lang...

* Lexicons use a simple wiki markup to define the different forms of their language. To illustrate that the arguments don't have to be used in order, I did it in the reverse of how the code passes arguments.

    'found_x_files_in_x_dirs' => 
    "I searched $2 {{PLURAL:$2|directory|directories}} 
    and found $1 {{PLURAL:$1|file|files}}"
So for a language like Arabic you write a similar pipe-delimited list of forms. You just have to know how to lay down the six different forms in the order that LanguageAr.php defined.

Note how this side-steps most (but not all) complicating issues like case or gender, so you don't have to mark it that way in the lexicon. If the word is used in the feminine gender, accusative case plural in the sentence, that's what the translator writes.

All this is mediated with the amazing http://translatewiki.net/ website, run mostly by volunteers.


A detailed, insightful and well-written article on the pitfalls and complications of internationalization in software, written by two linguists. Very good read, independent of Perl.


Gettext [1] does actually offer a way around this problem, which works fairly well in practice.

You can define the number of plurals and rules to select a plural case. Then you have as many translations as plural forms in your translation (po) file.

Arabic for instance has 6 plural cases with the following rules [2]:

  nplurals=6; plural= n==0 ? 0 : n==1 ? 1 : n==2 ? 2 : n%100>=3 && n%100<=10 ? 3 : n%100>=11 ? 4 : 5;
See also http://wiki.amule.org/index.php/Translations#Plural_forms for an example of both a rule and the resulting code in the po file.

[1] http://www.gnu.org/software/gettext/manual/gettext.html#Plur... [2] http://translate.sourceforge.net/wiki/l10n/pluralforms


His advice of thinking of phrases as functions is spot on. However I think he's missing the easiest way of formulating solutions that would be usable by many translators/linguists: pattern matching.

Imagine your phrase/function is called:

"Found %n1 matching files in %n2 directories"

You could pattern match for one particular language like this:

  %n1 == 0, %n2 == 0
  %n1 == 1, %n2 == 1
  %n1 > 1, %n2 == 1
  ... and so on ...
With the matching being any (simple?) boolean function of the operator, applied in order. At the point where this becomes too cumbersome, you could fall back to proper code. I bet that this would be much easier to use for translators, with an optional fallback to a programmer if it gets too complex to spell out all combinations.


You're simplifying things. Please remember that in many languages the form of the numeral depends not only on the number, but also on what it is you're counting. And you don't always know that if you're deep inside a library.

In general, there is no easy way out — and you have to allow for exceptions. Please see my other comment, about an architectural takeaway.


> Please remember that in many languages the form of the numeral depends not only on the number, but also on what it is you're counting. And you don't always know that if you're deep inside a library.

Japanese, for example:

http://en.wikipedia.org/wiki/Japanese_counter_word

Take the counter 本, for a commonly-used example. That's used for: "Long, thin objects: rivers, roads, train tracks, ties, pencils, bottles, guitars; also, metaphorically, telephone calls, train or bus routes, movies (see also: tsūwa), points or bounds in sports events. Although 本 also means "book", the counter for books is satsu (冊)."


But doesn't the counter always follow the numeral in a non-declined manner in Japanese? That is, a pattern like "%n-satsu book" would work, right? I thought of Japanese, too, but I think maybe this is a pattern that is the inverse of the problem discussed.


So long as you know what you're counting, you're fine, yes.

But the real problem is that we have a different problem in each language. Every single language has its own way to do things and that's where we run into trouble, because you need generic ways of doing things and there really isn't one for all languages. You essentially need one function to create text for each language, even though some of them can share helper functions, like that one for numerals.


Oh - I'm not doubting the original point! I'm a professional translator myself. I know how weird it can get (most of my languages are Germanic and Romance and provide no great headaches, but Hungarian is hard to translate well).


Like I said, I understand and appreciate that arbitrary functions are required for a complete solution. I just think that this kind of matching would be a solution for a subset of problems (languages) large enough to make it useful.


GNU gettext has something like this, see http://www.gnu.org/software/hello/manual/gettext/Plural-form... for instance. Not sure if the article referenced this, I got the feeling it didn't since it started talking about replacing gettext pretty early on.


GNU gettext dispatches on a single number, i.e. it can handle cases like "Scanned %u directories", but not "Scanned %u files in %u directories".


At Last.fm we used PHP+Smarty with a gettext-esque pre-compilation step (a bit like IntSmarty) for templates that allowed translators to embed smarty templating code, so you could write this in smarty:

  {l}Found {$d} directories{/l}
and the replacement for each language could have its own switch on $d to decide how to translate it.

One (unavoidable?) downside is that the translators have to know some basic if/then/smarty syntax, and if they mess it up your template won't compile. Also you have to trust them somewhat, since they essentially get to execute PHP on your webserver.


Another downside is that this simply will not work for inflected languages, as the translation of "directories" will change depending on the number.

(unless I'm misunderstanding something in your templating scheme)


It works because the translator can write template code that is executed by the templating engine at runtime, so the translator can write something like this as the translation:

  {switch $d}
   {case 1} foo {$d}...
   {case 2} {$d} bar..
   {default} ba{$d}z
  {/switch}
We provided the translators with some basic examples of if/then/switches and so on, they were free to do all sorts of crazy things (especially in polish).


Oh, ok, I understand now — that would work fine, then. It's actually a good solution, because it lets your translators customize strings right where they have the full context, e.g. in final templates.


I don't think that downside is unavoidable. In the same way that you can't expect somebody to create some or all of a web page without at least a basic understanding of some form of markup, you can't expect a linguist to handle various differing cases without an understanding of conditionals.

This is a classic case of "make things as simple as possible... but no simpler". You've achieved the first bit. To go any simpler would lose vital granularity.


This is why I'm scared to localize any of my applications. I'm in a good position where one of my web apps is used by lots of South Americans and Spanish people, and a couple of users have offered to translate it for free into their local tongue. Which is great... but I'm terrified about the potential amount of work when it comes to "You have X [nouns] set up" in table headings and the like.


Localization is complex, but it is nothing to be scared of. It can be done with proper forethought and planning, but it is by no means trivial. The translation itself can be tedious. Actually getting the translations onto your site is more difficult. Maintaining translations as your original content changes is where the fun starts.

I have had experience localizing web applications and it was not pleasant. We had to design the site from the start to account for multiple languages. It complicated the design. The alternate languages were, at best, several days behind the original English. A couple weeks was more typical. Localizing an existing site where we had not planned for it from the start was not practical.

Personally, I would recommend outsourcing localization if that is an option. It just isn't worth it to do it yourself. It is time consuming. More importantly, the time spent localizing is time not spent on your real business. Then again, I am completely biased because I make my living at a company where we provide this exact service.

(It may be a shameless plug, but it's on topic and might be useful to someone... The name of the company I work for is in my profile.)


I think this highlights a problem. Localization _can_ be hard, but most of the time it's not. You shouldn't be scared to localize your app.

After all, there's a vast amount of open source software that has been localized with gettext without problems.

In my experience, most translation strings are straightforward, and the corner cases as mentioned in the article don't happen very often. If you can tolerate a few hacks in your code, or a few extra translation messages, you don't have to go all the way down the rabbit hole. The perfect is the enemy of the good in this case.


In my experience, fully localizing a piece of software tends to lead to one of the most complex and fragile parts of its build system (especially if its documentation is localized too), often takes an appreciable amount of its total build time, tends to bloat its installation size by double or more, and leads to problems of inertia. The last inhibits eg, minor changes to a string, since any such change is making extra work for N translators (in my cases, N can be as high as 50).

Also, it requires hard decisions, like should this language that has falled 20% out of date be disabled, or is it better for the program to startle the user with English 20% of the time? (Which 20%?)

And finally, after all this work and pain, technical computer users will complain that they prefer the English version because their native language has clumsy terms for computing terms, or the translation to their native language is not idiomatic enough, or whatever. And you as the coordinator can't begin to judge translation quality unless you're fluent in N languages.

I've been lucky to have very skilled people handling the translation coordination in some free software projects I've been involved in, and localization has still had most of these elements of pain in most of them.

(Or if you prefer a proper rant: http://kitenet.net/~joey/blog/entry/on_localization_and_prog... )


I agree that having translations adds inertia and resistance to change in the code, and that this can be a problem. You should add localizations when you "need" them and know that you are going to have to support them, not because they're nice to have.

What I wanted to point out was that the article puts focus on a problem which often times is not that big of an issue.

I've mostly been doing web development and localization of web applications, and I guess this makes the build process a bit simpler than what you describe.


Spanish and Portuguese plurals are exactly like English ones, except that the articles and adjectives also have to agree in number with the noun, not just the verb.


I'd take the easy route:

printf("Directories scanned: %g", $directory_count);


printf("Directories scanned: %g", $directory_count);

That's precisely the problem the article addresses :) You could just about get away with it in English where the only number that won't work is 1, but what about Polish where "Directories" will take a different form if the number ends in a 2, 3 or a 4 but isn't 12, 13 or 14?


The whole point of that workaround is that I'm making it obviously computer-lingo. "Directories scanned: 0" and "Directories scanned: 1" are not grammatically correct, but they sound OK and get the point across without sounding broken like "I found 1 directories".


The whole point of that workaround is that I'm making it obviously computer-lingo

Thanks for the clarification. My point was that a workaround in English may not be a workaround in every other language in the world.


This is the interesting question isn't it? Do other languages not have similar 'neutral' formulations? How far could such a ploy be generalised?


For the most part, it would work in essentially ANY language. The "noun, colon, number" notation completely sidesteps the issue of grammar by dumping the verb, which is usually the instigator of the worst problems (Russian, I am looking at you!). For example, the easy way out in Russian is simply "каталоги: 5", which is the nice and easy plural "directories", free from abusive endings imposed by accusative case verbs.


Interesting opinions: 1

Good ideas: 0

Usability is everything: True


Is that a self-referential comment?


One of the things that makes large ERP systems so complex is that these localization requirements also apply to business rules (e.g. for tax, payroll, even some rather basic accounting processes).


Oh god, large, localized ERP. This horrible though never entered my mind before.


Its amazing how many different ways to do something there can be, even when it may seem initially that the way you've been taught is the only possible one. The something in this case being language.

Which is one reason to speak or code in more than one language.


And then you realize that your shiny custom font rendering assumes left-to-right and also that subsequent characters don't modify preceding character glyphs (cursive font).


Localizing games is even more complicated. User applications typically "talk" to the user, but games also have a variety characters speaking to each other. And in MMO/sandbox games, you may not be able to anticipate which characters will be conversing. Localizing inter-character conversations also introduces pronouns, which are not often used for user applications.

Game translators need to know all sorts of metadata about the speaker and listener characters' "social context", such as gender, age, and "honor". Is the old king speaking to a young peasant girl or an entire village? Is the young prince speaking to an old peasant woman or to his old grandmother? Is the grandmother speaking to her son, who happens to be the king?


Ouch, languages are complicated. Two simple solutions that I think might work for a small packaged-software company like mine:

1. Just don't bother internationalizing. I think that's often not a bad solution for small software businesses anyway. It's not much use making a software package that works in French/German/Russian/Chinese/Swahili unless you have the language skills or partnerships to sell and support the software in those languages as well.

2. Design the software such that messages are whole sentences or phrases that stand alone and can be translated one-to-one. Nothing fancy like the %g type stuff for number inserts. Keep it simple stupid.


1 really only works if you work in a big market for your native language. For example, you will not sell a lof of your software in English in Japan or France (where translation is mandatory by law, although not applied thoroughly).

What would you say if Japanese were trying to sell Japanese software with Japanese-only indications/manual ? Do you think many American would buy it ?


I think you underestimate the market for English-only software.


And this is to say nothing of the near impossibility of finding a commercial translator who would know even simple Perl.

Dammit, there has got to be a way for me to make money there.


Not always appropriate, but what about simply not using proper sentances. What would be the implications of using this kind of form:

Directories scanned: %g

Files found: %g, Directories with files: %g


It's a very good article. One thing that has bitten me while doing localization is the programmer instinct to reuse code as much as possible. If you have ten buttons labeled "Save" then you only need one translation for it, right? Wrong! After you have had to go back through the code and split out all those "Save" labels into different contexts the lesson is ingrained...


i have been using grasshopper [1] for some project & it already allows having function is translations. it was nice to see a framework in a young ecosystem like node.js which can already handle this :)

[1]https://github.com/virtuo/grasshopper


I'm rather of the opinion that the translations should be sandboxed scripts, rather than strings.


A note to all the readers: the history part of the article is good, the technical Perl part is very ill-advised, partly due to when it was written (1999, prior to Gettext's plurals support). Please don't follow those advices if you are writing Perl.


In Java, I would use FreeMarker as template language for all sorts of things, including i18n, rather than the typical sprintf-like syntax, because FM templates can contain arbitrary complex code, but the simple template cases work too.


OTOH:

seach result: directory: NN, file: NN.

as in "search result: directory: 1, file: 0." or "search result: directory: 4, file: 23."

ie: nouns only, singular form, no verbs, no plural, etc....

Sure, it does not look "good" but its probably much more easy to translate.

Worse is better


Here is a one possible solution: using a grammer http://www.grammaticalframework.org/


i have been using grasshopper[1] for some project & it already allows having function is translations. it was nice to see a fw in a young ecosystem like node.js which can already handle this :)

[1]https://github.com/virtuo/grasshopper


So the answer is to have translators write perl? Really?

Is this the best we, as an industry, can do?


For just this situation, Java offers the MessageFormat (and the ChoiceFormat refinement)

form.applyPattern( "There {0,choice,0#are no files|1#is one file|1<are {0,number,integer} files}.");

http://download.oracle.com/javase/6/docs/api/index.html?java...


You haven't actually read the article, have you?


Actually, I did give the article a quick read. He's raising intersting points around the problems of localization and showing how to solve those problems in Perl using logic to do something that Java lets you manage somewhat easily. What's not included in my brief comment on MessageFormat, is the other part of the equation - Java also provides Locale and Property files for managing strings. My comment was intended to provide info to those Java coders out there who may not be aware of such things, since the Java API has a lot of nooks and crannies like that.


Couldn't read it all. I guess the takeaway is that the localization system needs to be scriptable somehow, templates of the form "bla bla %1" are not sufficient.


Would be great if somebody could explain what I missed?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: