
A Localization Horror Story: It Could Happen to You - pmoriarty
http://search.cpan.org/dist/Locale-Maketext/lib/Locale/Maketext/TPJ13.pod?#A_Localization_Horror_Story:_It_Could_Happen_To_You
======
eloisant
When I was in Japan I did proof reading for a Japanese feature phone. A major
Japanese brand, actually. That was really comical.

There was an Australian guy for English, an German guy, an Italian lady, and
me for French. What they did prior to the meeting is: * translate from
Japanese to English by Japanese people with a poor English level (maybe the
software engineers actually) * translate from weird English to other languages
by translators who had only the strings, absolutely no context.

In the meeting we had all the strings, and one person from the manufacturers
who had access to the "super-confidential" unreleased device.

More than half of the translations were off because of lack of context. The
French guy actually translated "Garbage day" to something like "Shitty day",
apparently he thought that was a way to mark in your calendar that you had a
really bad day.

Pretty often we had sentences like "delete one", and invariably one of us had
to ask "One what? I need to know if it's masculine/feminine/neutral". Of
course they didn't prepare to that, it was too late to change the code, so
they made us do ugly things like "%n item(s)".

Also the Australian guy was loosing faith into humanity: \- That sentence,
it's completely wrong, it just doesn't mean anything in English. People will
just go "WTF?" when they read that \- We're not allowed to change the English
strings, they're already validated \- .....

~~~
TeMPOraL
I don't know why nobody seems to put information like "warning: this phone's
UI in <your local language> is total and utter crap".

Anyway, what you wrote is exactly why I stick to using all software and
webservices - OS, text editors, Facebook, et al. - in en_US instead of my
native pl_PL. Because translations are always crappy - even for big players.
Lack of context is the key here - translated text often feels out of place,
because there usually is some overarching idea behind them that isn't
communicated to translators. Then there is lack of consistency. Words in
original text often have some site-specific meaning, which tends to also be
somehow lost in the translation process. For example, on Facebook the word
"like" talks about a well-defined thing, not about the dictionary meaning, so
it's _totally not ok_ to randomly replace it with synonyms during translation
[0].

I realized at some point that I often look at a crapy translation, guess what
was the English original, and then in my mind translate to what it should be
in the first place. Because for some strange reason I, the user, have the
context, and the paid translation team has not. I guess I'm going to put that
into my "Translation issues" file in the "Mysteries of capitalism" drawer,
right next to "how on Earth multi-milion media companies can't do a movie
translation that isn't a total crap" file. I mean, seriously, you're better
off looking for pirated subtitles even if you bought the original because
pirates at least seem to have watched the movie they're translating.

</rant>

[0] - I wish more translators would use the approach Jehovah's Witnesses used
when doing their own Bible translation. Since it was designed to be studied
and analyzed, they preferred accuracy over aesthetics - therefore one of the
translation rules was "as much as possible, let's have any given word in
original text be always represented by the same word in English". Adhering to
that single rule would eliminate like half of the "context missing" problems
with software translations.

~~~
fennecfoxen
You know what multi-million movie has a translation that isn't total crap?
_Frozen_. They really put resources into that. You can look up random Disney
songs on Youtube in different languages, and then look up the Frozen songs,
and you can sort of tell that they've done a better job even if you don't
speak the language.

Even relatively obscure languages like Dutch where they usually just watch
English-language movies:
[https://www.youtube.com/watch?v=yOueN0sV2SY](https://www.youtube.com/watch?v=yOueN0sV2SY)

~~~
TeMPOraL
I agree. Frozen, and other Pixar/Disney/Dreamworks children movies (like
Shrek) tend to be of awesome quality in all languages. But I attribute this to
the fact that those movies are not _translated_ \- they're being _localized_ ,
which by definition requires much more work and paying much closer attention.

~~~
daxelrod
The Latin American Spanish localization of Dreamworks's Shrek is a great
example.

They brought in Eugenio Derbez, a Mexican comedian, to voice Donkey (voiced in
English by Eddie Murphy). Donkey in particular speaks in colloquialisms and
pop culture references with wordplay, so Derbez wrote a bunch of new lines and
jokes that referenced Latin American colloquialisms and pop culture.

Children learn different fairy tales in different countries, so they also
managed to change the identity of some of the characters without changing
their appearances (and without altering video at all, just audio).

~~~
emcrazyone
@daxelrod: Wow, thanks for that. Quite interesting. I always wondered about
how that was done and if it was a direct conversion of sorts but I guess it's
not. Very interesting.

I just have to ask: How do you know this?

~~~
daxelrod
All of this information comes from the teacher of a Spanish class I took (we
watched the Latin American version of Shrek in the class). I wish I had some
more tangible sources to cite.

EDIT:
[http://www.imdb.com/name/nm0220240/otherworks](http://www.imdb.com/name/nm0220240/otherworks):
"Jeffrey Katzenberg and Dreamworks allowed [Eugenio Derbez] not only to dub
Donkey's voice, but to translate and adapt the script of "Shrek" and "Shrek 2"
to make it more appealing to Latin America"

I also remember that the Gingerbread Man was one of the characters who was
altered, but I don't remember the name of the Latin American replacement.

~~~
crpatino
Gingerbread Man is translated as "El Hombre de Jengibre", which was previously
unknown to Spanish speaking children.

What was relocalized was the Muffin Man nursery rhyme
([http://en.wikipedia.org/wiki/The_Muffin_Man](http://en.wikipedia.org/wiki/The_Muffin_Man)),
which was substituted by Pinpon's ronda song:

Pinpon is a puppet very handsome and made out of cardboard. He washes his
little face with soap and water. He untangles his hair with an ivory comb. And
in spite of the hair pulling he cries not nor even winces.

~~~
daxelrod
Ah! Thank you, you're absolutely right, I misremembered the character.

------
dkbrk
As far as I can tell, the best tool for localisation almost nobody is using is
[http://www.grammaticalframework.org/](http://www.grammaticalframework.org/).
Licensing is a mix of GPL, BSD and MIT pieces.

It's a high-level functional programming language with a dependent type system
specialised for operating on language ASTs. It's resource library, to quote
"covers the morphology and basic syntax of currently 29 languages: Afrikaans,
Bulgarian, Catalan, Chinese, Danish, Dutch, English, Estonian, Finnish,
French, German, Greek, Hindi, Japanese, Italian, Latvian, Maltese, Nepali,
Norwegian bokmål, Persian, Polish, Punjabi, Romanian, Russian, Sindhi,
Spanish, Swedish, Thai, Urdu."

In essence, once it has the language-independent AST, it can produce output in
all its supported languages with the correct tenses, genders, inflections,
etc.

It also seems to have tools for assisted parsing, so you could have an english
document and interactively parse it into the correct AST. In addition, the
text can be parameterised semantically, so if you changed the gender of a
person, that could propagate to all the correct locations and update the
translations as required.

While it seems the upfront cost may be quite high in having to learn such a
complex system, I think the benefits of having reproducible, high-quality
outputs into n languages for free could make this highly advantageous in many
applications.

~~~
canjobear
I'm very skeptical that this would work outside of toy examples, though it
depends on what is meant by language-independent AST.

For example, the best way to translate Spanish "X dió un golpe a Y" would be
"X hit Y". But my naive idea of what the AST for the Spanish sentence would
look like would be something like `(GIVE (X HIT Y)`, which when naively
transduced to English would be the "X gave a hit to Y", which is either
unidiomatic or means the wrong thing altogether. In order to avoid this
problem, the AST would have to be a more abstract representation of the
semantics. And coming up with a sufficiently expressive, tractable, and
neutral representation of natural language semantics is an unsolved problem
that people are still devoting their whole careers to.

I was briefly involved in a very early stage startup that was considering
using systems like this for better machine translation. We ran into problems
like the above, and also: ambiguity, and the fact that the hand-written
grammars and semantic representation systems were just very brittle and
incomplete.

~~~
rmc
> which when naively transduced to English would be the "X gave a hit to Y",
> which is either unidiomatic or means the wrong thing altogether

What's interesting, is that there are dialects of English (Hiberno-English
spoken in Ireland) where "X gave Y a hit" would be a way to say "X hit Y". :)

~~~
pimlottc
While in others, it would make X a drug dealer.

~~~
masklinn
Or a mate. A mate lets his mate toke.

------
neilk
Side note: as you might expect, Wikipedia's internationalization is the only
system that attempts to do quantities and other formatting correctly for
_every goddamn language on the planet_ , but is considerably easier for
translators to work with than the OP's examples (sorry, Sean ;)

I did some work on bringing it to JavaScript and making it HTML-aware, and
since then Santhosh Thottingal has vastly extended it and it's become
pervasive at Wikipedia. More projects should use it, or at least learn from
it.

Demo:
[http://thottingal.in/projects/js/jquery.i18n/demo/](http://thottingal.in/projects/js/jquery.i18n/demo/)

Github:
[https://github.com/wikimedia/jquery.i18n](https://github.com/wikimedia/jquery.i18n)

~~~
bmn_
The Russian demo does not work. Must be 1 котёнок, 2 котёнка.

~~~
mynegation
Ironically, this is exactly the case described in the article, and even the
article got it slightly wrong. The rules in pseudocode - for this specific
sentence! - are as follows:

    
    
      if ((n % 10 == 1) && (n % 100 != 11)) { 
          // singular nominative  (not accusative as article says) - 1 котенок, 101 котенок, 301 котенок
      } else if (n % 10 >= 2 && n % 10 <= 4 && n % 100 <12 && n % 100 > 14) {
          // singular genitive: 2 котенка, 43 котенка, 1024 котенка
      } else {
          // all other cases plural genitive: 5 котят, 11 котят, 212 котят
      }
    

But this is true only for this sentence because kittens here are the subject,
not the object: with Russian verb "есть" (to be, to exist) the literal
translation would be "Kitten/Kittens exists/exist belonging to Harry".

Once the declension of the numbered subject(s) turns to accusative: "Гарри
гладит 1 котенка" ("Harry pets one kitten"), the rules become much simpler:

    
    
      if ((n % 10 == 1) && (n % 100 != 11)) { 
          // singular accusative - 1 котенка, 101 котенка, 301 котенка
      } else {
          // all other cases plural genitive: 2 котят, 43 котят, 5 котят, 11 котят, 212 котят
      }

------
reidrac
Slightly OT, mi favourite localization error was in Ubuntu when they had that
nice netbook interface (that later would become Unity). The network icon label
was "Rojo" in the Spanish localization, that is the word for "red" color.
What?

Well, if you translate "Net" to Spanish you get "Red"; and if you translate
that again (by mistake), you get "Rojo". There you are :)

~~~
baby
OT means Out of Topic? In french we say HS (for Hors Sujet).

~~~
Kiro
I'm really surprised that this is the first time you've seen "OT" being used.
I can only presume it's because you don't normally ready English
websites/discussion boards but I still find it intriguing that you've never
encountered it before.

In Sweden we use "OT" as well but it's referring to "Off Topic" and not
something Swedish.

~~~
baby
I read mostly english websites/discussion boards (reddit, HN) and I know a
bunch of abbreviations (IIRC, INB4, AFAIK, QED, IFF, ST...) but I don't
remember seeing OT. Learning new stuff everyday :D

------
bmn_
Article is from 1998 and very much out of date. Read:
[http://blogs.perl.org/users/aristotle/2011/04/stop-using-
mak...](http://blogs.perl.org/users/aristotle/2011/04/stop-using-
maketext.html)

~~~
EvaK_de
Should be reflected in the title, if possible, neh?

~~~
lmm
Cool example, but I think "ne" is the "correct" way to romanize ね.

~~~
blywi
Could also be from German, although this would also be written ne instead of
neh. People from the northern part of Germany use this in pretty much the same
way it is used in Japanese (at least according to what I know with my limited
knowledge of Japanese) I always thought of this as a strange quirk that the
same language construct can evolve in two unrelated languages. It's just like
parallel evolution in biology.

~~~
schoen
There's also Brazilian Portuguese "né", which is an end-of-sentence tag with
_exactly_ the same meaning. It's a contraction of "não é" ('isn't it') and is
used in a way akin to German "nicht wahr".

~~~
weinzierl
Also "isso" which in German is colloquial for "Ist so" and means "That's it"
or "Exactly". When in Brazil I always found it funny that they use "isso",
short for "isso mesmo", in much the same way.

~~~
schoen
It's funny to think that the conversation fragment

\- ... Né?

\- Isso.

could happen in either Brazil or Germany with the same meaning. :-)

------
whizzkid
It made me remember the Norwegian customer I had.

I needed to write Norwegian localization strings in a YAML file which did not
work for some reason.

After 4 hours of debugging, the problem was;

In YAML, the "no:" string (for "norwegian") defined in a YAML file was parsed
as a boolean, and this makes the application broke..

~~~
mcphage
Yeah, I've run into that, also. Really annoying. I had to quote all of the
"no": strings. Looked ugly, but what can you do?

------
gldalmaso
In order to avoid these pitfalls, usually I get out of "sentence" mode to
"label" mode. For instance: "Directories scanned: 12". Probably not well
suited for all cases, but usually good enough for mine, though actually I only
have to support pt-BR, es-ES and en-US so maybe that's not saying much.

~~~
MichaelGG
Exactly. All this work, or just restructure the message. It should be
acceptable in most languages, because charts and spreadsheets aren't going to
have per cell labels. And it has the benefit of being easier to read and
parse.

Also, it's a really terrible style to use first person in an app unless it's
actually sentient. Otherwise it's annoyingly like Clippy, or just plain
obnoxious and presumptive.

~~~
TeMPOraL
> _Also, it 's a really terrible style to use first person in an app unless
> it's actually sentient. Otherwise it's annoyingly like Clippy, or just plain
> obnoxious and presumptive._

I agree, though I found another nice use case, well demonstrated by Bret
Victor[0][1]. I played around with it for a while and I find that describing
what something will happen in a normal sentence, parts of which you can tweak,
is a pretty good way of doing options pages.

[0] -
[http://worrydream.com/#!/TenBrighterIdeas](http://worrydream.com/#!/TenBrighterIdeas)
[1] - [http://worrydream.com/Tangle/](http://worrydream.com/Tangle/)

------
dmytrish
Kudos to the author of the article for his perseverance in decorating message
to fit grammar. I'd go another way, just using more formal and dry format:

    
    
        Number of scanned directories: %g
        Number of found files: %g
    

That solves the problem with Slavic languages at least. Italian aversion to 0
may be mitigated with printing 'none', I guess. Please correct me if this form
does not fit other languages.

~~~
placebo
Just scanned the comments to see if anyone would suggest that :) Being the
lazy type, that's the first thought I had reading the article. Perhaps it's
not appropriate for all target audiences and all target languages but many
times you can find a much easier solution by going about it in a totally
different way.

------
mrfoto
Funny how Slovene seems to tick all the complications checkboxes :D We have 4
grammatical numbers (singular, dual, plural for 3 and 4, plural for 5 and
above), they repeat at mod 100 (so 101 is singular, 102 dual,…), it's an
inflectional language with 3 grammatical genders, sentence should take a
different form depending on whether the user is male or female,…

~~~
gambiting
I was just trying to think of how it would work out in Polish....

directory is "katalog" in Polish, and it would be:

1 katalog

2 katalogi

3 katalogi

4 katalogi

5 katalogów

6 katalogów

....(it doesn't change for any number greater than 5)

But if you wanted to say "X files were found in Y directories" then you would
have to say:

....1 katalogu

....2 katalogach

....3 katalogach

....1000 katalogów

(for 1001 even I am not sure if you should say katalogów or katalogach, both
sound correct to me)

...and again, the whole thing changes depending on the speaker being
male/female + singular/plural("I have found"/"we have found").

Compared to the grammar of Slavic languages, English is super easy.

~~~
tinganho
ICU's messageformat solves this easily. One project that supports ICU's
messageformat is L10ns [http://l10ns.org](http://l10ns.org)

~~~
ygra
This looks quite nice and powerful and indeed an elegant way of solving the
problem with multiple plural forms in a string. The only concern I have with
that syntax is that it's yet another DSL, or markup language and translators
need to know it, or could get it wrong. Granted, a program for helping
translators might do automatic linting (much as Qt's Linguist already warns if
you omit placeholders from the translated phrase that are there in the
original).

Another thing is that the mini-language grows complex enough that the
resulting text can be quite hard to read and understand:

    
    
        {people, plural, offset:1 =0{No one went.} =1{{user1} went.} =2{{user1} and {user2} went}.} other{{user1} and # others went}}.
    

is just a single (or two) placeholder and it takes a while to even parse how
it's supposed to work.

~~~
TeMPOraL
Well, if we're going to introduce that level of complexity into a DSL, why not
go full Turing-Complete and write it in code?

    
    
        (case (length folks) (0 "No one went.")
                             (1 ((elt 0 folks) " user went."))
                             (2 ((elt 0 folks) " and " (elt 1 folks) " went."))
                             (otherwise ((elt 0 folks) " and " (length folks) " others went.")))
    

You can wrap that in a lambda that concatenates resulting strings and voilà,
you have "smart" string tables. And it's not a problem to make it even more
DSL-y and translator friendly.

~~~
ygra
And then you have the exact same problem as if you'd write that logic in your
source code. Just with half a dozen layers of abstraction, a more cumbersome
way of displaying strings in your application and another programming language
on top. I'd say that's not a net positive.

~~~
TeMPOraL
I disagree. That logic has to go somewhere anyway - you can't skip it because
it's inherent in the problem of displaying a proper message. So you could at
least write it in an expressive language instead of encoding it into what
looks almost as readable as regular expressions.

~~~
jameshart
Now you need to find a translator who knows Lisp

~~~
TeMPOraL
No you don't, in much the same way that you don't need a translator that
"knows JSON or XML". Just don't tell them it's Lisp. That's how you do DSLs.

Also, I advocate closer work between translators and developers. Let the
translators give the text and explain corner cases to someone who can code up
the logic.

BTW. Lisp is only hard for people who acquired this stupid meme that "Lisp is
weird/for crazy people". You'd be hard-pressed to find something which is
simpler in terms of syntax and readability.

------
theoh
The two Turkish letters dotted and dotless i are often confused by users of
poorly localised software. Wikipedia links to a murder case allegedly caused
by this:
[http://en.wikipedia.org/wiki/Dotted_and_dotless_I](http://en.wikipedia.org/wiki/Dotted_and_dotless_I)

A real horror story.

(Less seriously, Unicode has counterintuitive case-changing behaviours with
those letters. If you are working outside the Turkish locale and uppercase a
dotless I and then lowercase it, it gains a dot. I am curious about this
design decision, since it seems like a basic error in operating a the level of
glyphs rather than symbols. Or maybe the opposite.)

~~~
lmm
Upper and lower casing can't be assumed to be inverse; there are plenty of
other cases where they will change (e.g. precomposed characters that don't
have a precomposed upper case). The correct lower-casing of "I" in English is
definitely "i"; the correct upper-casing of "ı" in English is maybe a wrong
question, because it just isn't an English letter, so I guess you could argue
for leaving it unchanged, but converting it to "I" is probably what the person
who wrote "ı" would want to happen when it was upper-cased. Maybe?

~~~
theoh
With Unicode, why aren't the two Turkish Is just treated as if they have
nothing to do with the normal Latin I? The fact that the glyph for uppercase
dotless I resembles the glyph for uppercase Latin I should be irrelevant,
surely. It's a kind of typographic false friend situation.

Maybe there's a missing level of indirection in Unicode that prevents it from
doing this, but I can't see how there could be.

~~~
lmm
One answer is that unicode had to import existing documents; I suspect that a
lot of documents are written in a Turkish codepage that would have been an
8-bit encoding with the lower half as ASCII, that wouldn't have bothered with
a different codepoint for "Turkish" I. As I said, you can't rely on
upper/lowercasing roundtripping correctly in general.

(I was about to give the example of ß, which is usually uppercased to SS. But
interestingly Unicode has now adopted a codepoint for the (disputed, and
currently lacking a typographic consensus) capital version, ẞ. So maybe a
codepoint for "uppercase Turkish I" is on the way. Turkish users will still
expect to be able to lowercase "I" to a dotless lowercase i though, since a
lot of existing documents will have "I"s in)

~~~
theoh
I did a bit of research on this and you're right, legacy encodings are one
problem. More seriously there seems to be no established way to manage
multilingual text which includes homoglyphs (say by using colour coding) so
you would really be replacing one problem with another.

It does seem like this Turkish I problem is the most conspicuous situation,
maybe unique, where changing locale changes the behaviour of toupper/tolower.
Unicode, on the other hand, has many homoglyphs and duplicate characters which
all need to be dealt with.

[http://en.m.wikipedia.org/wiki/Homoglyph](http://en.m.wikipedia.org/wiki/Homoglyph)
[http://en.m.wikipedia.org/wiki/Duplicate_characters_in_Unico...](http://en.m.wikipedia.org/wiki/Duplicate_characters_in_Unicode)

------
olau
Just in case people are wondering about the horror story: use ngettext which
is a function in the gettext library.

~~~
gulpahum
Indeed, this is a solved problem. Use ngettext, see its page for all different
language variations (even the Slovenian four different forms)
[https://www.gnu.org/software/gettext/manual/html_node/Plural...](https://www.gnu.org/software/gettext/manual/html_node/Plural-
forms.html)

~~~
philh
How does ngettext handle "Your query matched %g files in %g directories"?

~~~
bmn_
With reordering syntax.

[https://www.gnu.org/software/gettext/manual/html_node/c_002d...](https://www.gnu.org/software/gettext/manual/html_node/c_002dformat.html#c_002dformat)

~~~
yoha
This only addresses the ordering of parameters, not the fact that you need two
different counters.

------
DangerousPie
Interesting, but I am wondering if it is really worth going through all this
trouble, just to support a few edge cases.

Personally, I usually don't even notice small mistakes like "1 directories"
(or similar mistakes in my native language). Sometimes I will see the correct
version somewhere and think "Oh, nice that they thought of that" but I
definitely don't expect it.

Are the possible returns of having a "perfect" translation really high enough
to justify investing in a much more complex system? I am sure translators who
can code functions instead of just putting values into an Excel table will
come at quite a premium as well...

~~~
PinguTS
You may not notice it. But there are other people who do.

That is the difference between a very well designed product or a product,
which just does the job.

That is the reason why engineers should not design interfaces. That should be
left to UI/UX experts. Just the other day, I as an engineer myself, complained
to another that his product does the job nicely and looks OK. But it was
missing this little twist of a finished product that I'd really like to use.
Because the interface was designed how I would have done it myself, because
lack of UI/UX knowledge.

~~~
DangerousPie
Of course, I'm sure there are people who notice this. My question was whether
something like this makes enough of a difference to justify the investment.

As a developer you only have a limited amount of funds and time to spend on
your product. Your goal is to invest these in a way that gives you the biggest
returns. Of course it's great to not just have a translation that "does the
job" and if you can do better you definitely should. But if that you takes a
lot of work and only has a negligible effect on your sales, shouldn't you be
prioritizing other things?

~~~
bloodorange
We have 100s of thousands to millions of pageviews per day in 10s of languages
on the various pages of our site. The site grows from A/B testing and I have
to say that many of these small things do add up over time (they add up to
measurable conversion in the funnel). There is the odd one that surprisingly
does nothing or does worse but generally, paying attention to language details
did prove effective for us. I work on this stuff everyday in a small team
where we know more than 30 languages between us and sometimes just two or
three of these small changes more than pay for our annual wages in just a
month.

------
Zarkonnen
Huh. I actually took a stab at solving this problem using language generation
with my final year [project]([https://github.com/Zarkonnen/A-Natural-Language-
Generator-fo...](https://github.com/Zarkonnen/A-Natural-Language-Generator-
for-Software-Translation)) at university.

------
ajuc
Another thing:

"Are you sure you want to quit?" in Slavic language will have gender of the
user embedded. You need to know it to adress user correctly.

You can do stilted "To the person that uses this program - are you sure you
want to quit?", but that's insane. So everybody just use male version.

~~~
userulluipeste
I don't know, in Russian either "ты" (singular/informal) or "вы"
(plural/formal) works fine for both genders. Now, for your example, Google
translated result is "Вы уверены, что хотите выйти?" which seems fine to me!

~~~
ajuc
"Wy" (plural you) don't work for single person in Polish, it sounds like
"communist-speach" to us (almost like you called people "tovarishch") :) It
was only used by soviet puppet politicians during communism (as carbon copy of
Russian expression).

So it needs "Ty" (singular you), and with singular you "sure" translates to
"pewien" for male recipients and "pewna" for female.

In Polish it should be "Jesteś pewny(or pewien)/pewna, że chcesz wyjść?"

So, it's Polish-specific, not Slavic-specific as I thought.

~~~
pavel_lishin
How is that typically handled in software? I can also imagine non-software
examples where this might be hard (A sign reading "Warning: you are entering a
restricted area", etc.)

~~~
ajuc
I explained poorly, verbs in second person are gender agnostic, adjectives are
gender-dependent.

The "restriced area" is funny, standard version is "Nieuprawnionym wstęp
wzbroniony" ~ "For non-priviledged-ones entry is forbidden", it's plural noun
made from adjective, in nominative it would be different for male and
mixed/female groups (uprawnieni/uprawnione), but fortunately in plural in
dative case it's "uprawnionym" for both genders so it works out OK. I guess
it's common and maybe that's why the dative case works that way?

In most dialogs in software adjectives are the problem, especially "are you
sure". Usually software that don't know your gender for other reasons just use
male version.

In non-software world in formal documents it's often written like "he/she" in
every gender-dependent place, often with passive voice to cut the amount of
"/".

------
eCa
Also, if you 'localize' something using Google Translate[1], _please_ let the
user choose language somewhere in the app.

For example, the Hostelworld ios app[2] requires the user to change language
_for the entire device_. As something of a language perfectionist it leaves
the app virtually useless.

[1] Translating to English from other languages works fine for me.

[2] [https://itunes.apple.com/us/app/hostelworld.com-hostels-
budg...](https://itunes.apple.com/us/app/hostelworld.com-hostels-
budget/id348890820)

~~~
dezgeg
I wanted to do that on one Android app, but apparently it's impossible to
implement with Android's built-in localization features.

------
cj
I created Localize.js ([https://localizejs.com](https://localizejs.com)), a
localization SaaS.

Pluralization is a challenge, but we're able to solve this with some pretty
simple HTML tags.

For example:

<div>I have <var pluralize="3">3</var> dogs!</div>

Localize.js identifies the <var> tag with the pluralize attribute, and
pluralizes the phrase to any language (including languages like Arabic which
can have 6 different plural forms).

~~~
mfenniak
Huh. Localize.js sounds really cool. Fantastic idea.

Can you explain your example a little more? What would translators see in this
case for Arabic; would they need to provide three translations? And if there
were two variables, nine translations?

"Localization" also implies a lot more than just translation. Does Localize.js
handle work like culture-specific number and date formatting? Different
collation of records for different languages? How does it identify
application-generated text versus user data (eg. on a blog, does it translate
blog comments entered by readers, or just text like "Please enter a comment
below")?

~~~
cj
> What would translators see in this case for Arabic

The Arabic translator would have to provide 6 different translations, one
translation for each plural form.

> And if there were two variables, nine translations?

We currently only allow for pluralization based on one variable per phrase.
We're hoping address this soon.

> Does Localize.js handle work like culture-specific number and date
> formatting?

We usually recommend libraries like moment.js to handle date localization.
They do a fantastic job at localizing dates.

> How does it identify application-generated text versus user data

We provide a set of HTML markers that you can use to indicate to Localize.js
that certain text should be translated. For example...

<div class="blog-comments" _notranslate_ > [...] </div>

[https://localizejs.com/docs/usage/variables](https://localizejs.com/docs/usage/variables)

------
smhg
Now, I don't know the state of the gettext utilities in 1999, but the
arguments don't seem to hold up anymore (as others commented).

It just surprises me how many times gettext is discarded as "not solving the
problem" while it gets many things right.

It feels like the lack of knowledge about the complexity of i18n/l10n and
about gettext are often the real issues.

------
lmm
I've long found that "externalized" translations in po files (or any
equivalent) are more trouble than they're worth, for exactly this reason.
Translations need to be functions, so they need to be written in a format
that's good for writing functions - i.e. a programming language. What we want
is a MessageSource interface, and a bunch of language-specific
implementations.

Fortunately I work in Scala, so it's very easy to have an "embedded DSL"
that's ordinary, first-class code but not much harder for non-technical
translators to read or write than the .po format; we can write helpers for
grammatical case or numbers or similar. But having the full power of a
programming languages there means that when you hit a case you haven't thought
of (and you will), you can fall back to just an if/else.

------
luminarious
For websites, [http://l20n.org](http://l20n.org) seems the most natural
version so far. Or is there something better?

~~~
tinganho
You might want to checkout [http://l10ns.org](http://l10ns.org)

------
barrystaes
I dont agree with the article. The author goes about manually implementing
localisations, and eventually throwing out GnuGetText. But it DOES have
excellent plural support, and a header in your PO file allows chinese to use
"nplurals=1; plural=0;" for example: [http://localization-
guide.readthedocs.org/en/latest/l10n/plu...](http://localization-
guide.readthedocs.org/en/latest/l10n/pluralforms.html)

Or use plurals as such:
[https://www.gnu.org/software/gettext/manual/html_node/Transl...](https://www.gnu.org/software/gettext/manual/html_node/Translating-
plural-forms.html)

------
anton_gogolev
And remember the Turkey Test [1].

[1]:
[http://stackoverflow.com/a/797043/60188](http://stackoverflow.com/a/797043/60188)

------
dcposch
People here are talking about complicated, difficult solutions, like using a
library to create text in multiple languages from an AST(!)

I think the moral of the post is simple: don't try to generate natural
language. It won't sound natural. Just keep it basic.

"Directories scanned: %d" "Directories matched: %d" "Files matched: %d"

...will localize just fine

~~~
jpatte
That might be ok for technical people who are used to this kind of "machine
speak", but for end users it's often not acceptable. Consider the difference
between:

    
    
        Purchased credits: %d.
        Remaining credits: %d.
        Buy more : 5 [link] / 10 [link] / 20 [link].
    

and

    
    
       You have purchased a total of %d credits and have used %d of them so far. You can buy [drowdownlist] additional credits by clicking here [link].
    

Which one do you think will have the highest conversion rate?

------
tinganho
Please checkout [http://l10ns.org](http://l10ns.org) It handles the
pluralization case pretty well. It uses ICU's messageformat which is a markup
for defining plural formatting
[http://l10ns.org/docs.html#pluralformat](http://l10ns.org/docs.html#pluralformat)

------
tux3
Qt takes care of that in a really nice way.

You write tr("I scanned %1 directory.", "", count) and it takes cares of
applying the correct translation with the right plurals depending on the
number.

[http://doc.qt.digia.com/4.2/qobject.html#tr](http://doc.qt.digia.com/4.2/qobject.html#tr)

~~~
tedunangst
How does that deal with the case where it should translate to "I didn't scan
any directories"? According to the documentation, "In the translated version
the variables must still appear." [http://doc.qt.digia.com/4.2/linguist-
translators.html](http://doc.qt.digia.com/4.2/linguist-translators.html)

------
pp19dd
We version (localize) projects all the time, so I thought reasonable logic for
seeing whether a string is empty is to check whether its trimmed length was
greater than two, allowing for two stray characters that show up all the time
in content. For example, someone who's not sure how to translate something
would type "??" in a field. So, {if $slide.title|length>2} ... This worked for
4-5 languages.

Then our Chinese (Mandarin) division called and asked me to look into buggy
behavior with their translations. Turns out a whole sentence got translated to
... two characters, and wasn't showing up.

------
js2
Previous discussion -
[https://news.ycombinator.com/item?id=2095334](https://news.ycombinator.com/item?id=2095334)

------
TeMPOraL
I once suggested an idea that maybe instead of strings in tables one could use
something better suited for the task at hand like, say, code? Maybe let the
tables store not only strings but functions as well, so that you could handle
the more complex cases directly?

I remember being hit in the head by gettext manual and told something about
translators not knowing how to code.

Heck, I still think it's a neater idea than gettext.

------
alxndr
I wonder if Lojban[1] could serve as an unambiguous, largely-context-
independent way to store representations of the central concepts, which can
then be translated into natural human language.

[1] "a constructed, syntactically unambiguous human language based on
predicate logic",
[http://en.wikipedia.org/wiki/Lojban](http://en.wikipedia.org/wiki/Lojban)

------
codefisher
I saw most of this coming. I have studied a little Ancient Greek which has the
same problem of the Arabic, and Polish which is similar to the Russian, and
now I am living in Italy. I guess it is one of those things though as
programmers that we just forget about too much, and just expect translation to
a mechanical process in the last stage of the development cycle.

------
aardshark
iOS localization handles this with its stringsdict format:

[https://developer.apple.com/library/ios/documentation/MacOSX...](https://developer.apple.com/library/ios/documentation/MacOSX/Conceptual/BPInternational/StringsdictFileFormat/StringsdictFileFormat.html)

------
nodata
What if the text wasn't a sentence?

~~~
bmn_
If the text is not a whole sentence or paragraph, it cannot be accurately
translated anymore into a number of languages with different word order.

As a programmer with i18n responsibilities, you learn to identify and fix
fragmented text problems first.

------
pmontra
Having quite well in mind the most common problems with localization I tend to
use strings like this:

Directory searched: 10. Files found: 0.

They are not shiny but a direct translation is OK in Italian and it could be
OK in the other languages of the post (Chinese, Arabic and Russian).

------
4ad
I wonder if it isn't better to generate the messages as an AST, and have a
language generator, the back-end of a compiler really, that generates strings
for each language. I'm sure there will be less edge cases that way.

/edit: wow, downvotes.

~~~
4ad
This compiles to "He likes 2 green cats" in English.

The representation makes no assumption of SVO order, the grammatical category
of numbers, and what cases and tenses are available.

    
    
        (OPRESENT ((OSUBJECT (OPRONOUN (3, 0))) ((OVERB, "like") OMUL (2, (OCAST (OATTR (OCOLOR "brown")), (ONOUN (OANIMAL, "cat")))))))
    

I'm sure someone smart will tell me what assumption I made that's invalid in
some language I do not know, but this already is better, and simpler, and more
correct than a lot of text-mapping based internalizations I've seen.

And I will be able to correct my S-expr to get rid of more assumptions, and I
will just make it longer perhaps, but I won't need to write all the convoluted
code prevalent in the article.

Also, with AST-based representation you can add whatever context you need. For
example, were I to have used "pig" instead of "cat", this would have already
worked fine, since we have (OANIMAL, "pig") and not (OPERSON, "pig"). This is
trivial, but you can add whatever amount of context required.

~~~
schoen
Now I'm just wondering about the lexicon since the primitive tokens are things
like "like". One potential problem is when two languages don't have words with
enough semantic overlap to be comfortable using one as a translation for
another. Another potential problem is when

An analogy to my other comment on AST translation: in English you "like doing
something" but in German you "do something gladly" (and again in English you
"like a person" but in German you can "have a person dear", akin to English
"hold dear"). If we expect that the AST can produce a translation using the
single verb "like", we may be in for trouble if the target language doesn't do
that (although maybe code can be written that uses the AST and that's aware of
this complexity as part of the realization of the translation).

Another example could come from the problem of describing states of being or
perception, like "I'm cold", "I'm tired", "I'm hungry", "I'm thirsty", "I'm
sick", etc. In English we really like using "to be" plus adjectives for such
situations, but other languages have other preferred strategies. For example
Latin has specialized verbs for the actions of (at least) being hungry,
thirsty, or sick (like esurio, sitio, aegroto); in Romance languages people
often "have" hunger or thirst (Spanish "tengo hambre", lit. 'I have hunger';
Portuguese "estou com fome", lit. 'I am with hunger'); in German it is cold
"to" a person ("mir ist kalt", not "ich bin kalt" 'I am a cold person').

If you imagine having your AST start with Latin, you may have a challenging
story about how you could get from "sitisne?" to "are you thirsty?" "¿tienes
sed?" "está com sede?" and "hast du durst?" \-- not to deny that it may be
achievable with enough work.

------
cschneid
This is by far my favorite bit of documentation. I have to go look it up every
time a manager or client starts asking for localization to justify my high
estimates on how long it'll take.

------
btbuildem
So the solution is to design and implement your interface in a Slavic language
(presumably most complex, as we found so far) and translate down to other
languages with less demanding rules?

~~~
alxndr
Could be a perfect use for Lojban.

------
pornel
The L20n library handles all these cases:
[https://news.ycombinator.com/item?id=8892273](https://news.ycombinator.com/item?id=8892273)

------
zamalek
_Edit:_ Disregard, turns out things have improved

> The %g slots are in an order reverse to what they are in English. You wonder
> how you'll get gettext to handle that.

I learned C/++ after C# and this is one thing that _really_ got to me. String
interpolation in C++ is extremely primitive, which would be fine only if more
recent iterations of stdlib had something that wasn't so completely
incompetent.

 _For those who don 't use .Net, the Italian translation would have been: "In
{1:g} directories contains {0:g} files match your query." It doesn't solve all
the problems, but being able to specify indices in your template string does
solve many._

~~~
bmn_
The article is outdated. Gettext does have reordering syntax.

[https://www.gnu.org/software/gettext/manual/html_node/c_002d...](https://www.gnu.org/software/gettext/manual/html_node/c_002dformat.html#c_002dformat)

~~~
ygra
Makes me wonder why anyone would ever design a translation system that _doesn
't_ have the ability to reorder placeholders from the very first version.
Perhaps if the author doesn't know anything about languages andn their
differences, but in that case they probably shouldn't write such a library ...

------
demarq
I don't know If I am simplifying to much but this seems something handlebars
could solve in a minute! Every translation could just be a template.

------
im3w1l
I loved the irony of

> [\xE9 is e-acute in Latin-1. Some pod renderers would scream if I used the
> actual character here. -- SB]

------
quotequad
This just shows the beauty of the Unix way of doing things. Simply:

12 10 4

works in all locales. :-)

~~~
ygra
Until you can display negative numbers¹. Or have native numerals². Or need to
know what the numbers even mean. :-)

__________

¹ I tend to set my minus sign to U+2212 to catch errors in code where we just
use ToString() instead of ToString(CultureInfo.InvariantCulture). Almost as
much fun as putting a Unicode character into your user name that isn't
representable in the current legacy codepage on Windows.

² ۱۲ ۱۰ ۴ probably won't work as input to the application trying to parse that
line ;)

~~~
krzysz00
How do you change your negative sign on a unixoid? The only reference to
negative signs in locale(5) is under LC_MONETARY.

------
abbaselmas
suddenly love my language (Turkish) although there are special characters. 1
elma (apple) 2 elma 3 elma . . 10 elma

------
peteretep
I love this article, and used t send it to none-technical people all the time.
<3 Sean Burke, the author

------
aerovistae
That is so, so awesome.

------
donmb
This happens when you are a perfectionist.

------
eveningcoffee
TL;DR - when you are facing a localization problem - you have to parametrize
the units of measurements in your text.

