

Internationalisation for beginners - kyllikki
http://vincentsanders.blogspot.co.uk/2013/05/true-art-selects-and-paraphrases-but.html

======
luxpir
Thanks for the write-up. A very interesting read. It's an area that is ripe
for innovation, and a massively growing industry.

The XLIFF and TMX formats also offer flexibility in the handling of translated
data, as with .po files, but there are many problems still to be solved, as
contingencies mentions.

As you mention "Real people are still required to do the translations and
verify them" and the army of professional translators and agencies in the
market is on hand to do that, but developers often work in formats they are
unfamiliar with.

The bulk of a freelancer's work is in MS Office files, run through a CAT
(computer assisted translation) tool, and the resulting file (and translation
memory, TM) is delivered. When a developer needs a bunch of strings translated
they stray into unfamiliar territory for the average freelancer.

Specialists are out there, but a common format approach would help here. Most
professional CAT tools (costing from 200-1000+ of your local currency units)
can process .po files, which is a bonus, but doesn't solve many of the
remaining problems out there.

A multi-language translation memory (i.e. several source/target combinations)
would be useful in many cases, as would a simple 'export translatables' button
in the admin dashboards of apps.

I hope more HN readers dig in to the problems mentioned here, as technical
solutions could have a big influence on the future of
globalisation(-ization!).

------
gomox
Thanks for the write-up, I deal with the same issue in our company and while
we do work in Gettext with UTF-8 (that solves most basic issues just fine), it
seems every project that does i18n is cooking it up in their own way and I
have not been able to find many references online. I will probably make an
article describing our setup when I get around to polishing it.

The concensus around Transifex in #i18n@freenode seems to be that the open
source version is old and not maintained and should not be used. The SaaS
offering is much newer and packs quite a bit more features.

The "good" open source offering appears to be Pootle [0].

Honestly, I would be very worried about depending on a cloud service such as
Transifex for something that is so deeply embedded into our (pretty
continuous) development process. This requires automation, and all the time
invested in integrating with release processes and continuous integration can
easily go overboard. Of course, if Transifex were seamlessly integrated with
project management applications out of the box, then it wouldn't be such a
risky proposition.

\----

An interesting point about i18n that is quite independent from the tool
selection is how you write your message identifiers. You can basically use
labels (i.e, an ID for the string) or use the "original" string.

Here's the tradeoff: if you use an ID, you must reference the application
constantly to understand what the translation should say (and in any non
trivial application, this is a huge burden for translators), and there is
either no string reuse (because places with the same intended content have
used different IDs), or the need for an anal curator to go around chastising
developers ("the OK button should always be ACTION_BUTTON_LABEL_OK!! fix
it!!"). On the other hand, if you use original strings in English you will
find that you experience language collisions (two places where the original
string in English is the same, but the translated one is not), so you end up
resorting to introducing artificial differences to make them unique (i.e
"Request (verb)" and "Request (substantive)" instead of just "Request").

A hack that goes a long way if your engineering team is based off a country
that uses a latin language, is to use that instead of English for original
strings. Latin languages are typically more complex than English so collisions
are greatly reduced. Chances are your translation team is also based in that
country as well, so no harm done.

\----

If you are doing branchy development, I put together a wiki page [1] on the
Mercurial wiki with a script I use to merge translation catalogs (.po)
seamlessly when doing branch merges. It can easily be used with git as well.

\----

Links

[0] <http://pootle.translatehouse.org/>

[1] <http://mercurial.selenic.com/wiki/MergeGettext>

~~~
mpessas
> An interesting point about i18n that is quite independent from the tool
> selection is how you write your message identifiers. You can basically use
> labels (i.e, an ID for the string) or use the "original" string.

> Here's the tradeoff: if you use an ID, you must reference the application
> constantly to understand what the translation should say (and in any non
> trivial application, this is a huge burden for translators), and there is
> either no string reuse (because places with the same intended content have
> used different IDs), or the need for an anal curator to go around chastising
> developers ("the OK button should always be ACTION_BUTTON_LABEL_OK!! fix
> it!!"). On the other hand, if you use original strings in English you will
> find that you experience language collisions (two places where the original
> string in English is the same, but the translated one is not), so you end up
> resorting to introducing artificial differences to make them unique (i.e
> "Request (verb)" and "Request (substantive)" instead of just "Request").

The PO format uses the field "context" to differentiate among the various uses
of a word/phrase. You should also add a comment for your translators in this
case.

Also, using an ID messes with the PO format itself. E.g., fallbacks in case of
a missing translation will not work.

But there are other formats that are ID-based, like .properties in Java.

------
adlq
Great blog post about l10n and i18n! I'm working on improving that process in
our company and currently I'm choosing Zanata [0] as a (Java-based)
translation platform because out of Transifex's no longer maintained community
edition (how unfortunate!) and Pootle, Zanata's installation actually was
painless and the community around it is very responsive!

Too bad I didn't stumble upon Weblate [1] first though, it looks promising
(thanks onemorepassword).

I've set up an independant "localization server" that executes the following
process:

1) Regularly pulls new revisions of the code and updates to the latest
revision.

2) A mercurial hook [2] is thus called and the source strings are extracted
from the code with xgettext [3] so that new POT gettext files are generated.

3) The POT files are finally pushed to Zanata's server via its API.

We currently do in-house translations for one locale, while others are managed
by an extenal translation provider. Employees in our company can just login
(Zanata provides OpenID authentication) and collaboratively translate and
review the application strings. Whereas Zanata can be used to export ressource
files and push projects to our external translation provider's platform via
their API.

But as others have said in this thread, l10n automation curently involves a
lot of manual code glueing and adapting with your version control system.
There's definitely potential since available solutions only address the
translation problem and haven't gone very far in the whole process.

I'd be more than glad to exchange about the subject with others who have gone
through the same experience!

\---

Links

[0] <http://zanata.org/>

[1] <http://weblate.org/fr/>

[3] <http://mercurial.selenic.com/wiki/Hook>

[4]
[http://www.gnu.org/software/gettext/manual/html_node/xgettex...](http://www.gnu.org/software/gettext/manual/html_node/xgettext-
Invocation.html)

------
bazzargh
"Finally the Java property file format was used (with UTF-8 encoding) which
while having bugs in the import and export escaping these could at least be
worked around."

The java property file format is ISO-8859-1 not UTF-8. I have to wonder if
that's the bugs you hit? While you can have something that is UTF-8, there's a
couple of wrinkles with trying to use that with java i18n.

See:
[http://docs.oracle.com/javase/6/docs/api/java/util/ResourceB...](http://docs.oracle.com/javase/6/docs/api/java/util/ResourceBundle.html#getBundle%28java.lang.String,%20java.util.Locale,%20java.lang.ClassLoader%29)

... when you load a resourcebundle, it tries to load a properties file, and it
ends up calling this method:

[http://docs.oracle.com/javase/6/docs/api/java/util/Propertie...](http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load%28java.io.InputStream%29)

... which mentions the encoding.

There's a couple of ways around this - one is to write a bunch of code to
change how resourcebundles are loaded, the other is to use java's native2ascii
tool in your to provide files that are correctly escaped.

~~~
kyllikki
Transifex have extended the format and allow resources to be UTF-8 encoded
(see [http://help.transifex.com/features/formats.html#java-
propert...](http://help.transifex.com/features/formats.html#java-property-
files) ) however the importer does not correctly cope with single quote
characters, backslash n (newline) and several other characters being encoded
when they ought to be (as per the document you referenced which I also used to
begin with ;-)

If you look at the script Vivek wrote [http://git.netsurf-
browser.org/netsurf.git/tree/utils/split-...](http://git.netsurf-
browser.org/netsurf.git/tree/utils/split-messages.pl) he clearly documents the
odd importer issues which is why we called it the transifex resource format
and not Java resource format ;-)

------
onemorepassword
Transifex looks nice, thanks for the tip, but it seems like you have to add a
lot of glue to connect your own version control to their proprietary version
control via their API.

What I would really like is something like Weblate (<http://weblate.org>),
that you can hook in directly to your code repo. Is there anything like that
out there?

~~~
nsallembien
Disclaimer: I work at Transifex.

The perl script written by the user in the article is about 100 lines of code.
Doesn't seem like a lot of glue...

Another nice thing that Transifex provides which is not described in the blog
is the Transifex client[1]. I wonder why he didn't use it.

[1][http://support.transifex.com/customer/portal/articles/960804...](http://support.transifex.com/customer/portal/articles/960804-overview)

~~~
kyllikki
I did not use it because it was an unverified python script with a bunch of
dependencies, as a rule I generally do not like executing untrusted code
without at least a basic review.

As you mentioned, the Perl script was 100 lines and was easier at that moment
in time to integrate than to review the python.

Of course once the python client has been reviewed perhaps it would be a more
general solution.

------
contingencies
Whilst the key/value approach is solid, the 'industry standard' .po (GNU
gettext / <https://en.wikipedia.org/wiki/Gettext>) format supports more
features, like complex plural and ordinal/cardinal number support that is a
requirement in some languages.

In addition, some of the biggest issues with internationalization in my
experience (~exclusively i18n projects for 10+ years) are generally
missing/broken support in certain components (great reasons to contribute
resources upstream for open source projects!), managing translations over
time, cultural issues, right-to-left, differing program-level logic (eg.
maximum SMS message length variations based upon character set requirements),
differing seasons/days of operation/holidays. Calendars are of course a pain
(though a solved one), as are timezones - for which a truly synchronized,
global approach is frustratingly hard to deploy at the best of times.

~~~
kyllikki
The gettext PO file format does indeed provide many other features, I do not
disagree, but there does seem to be an over reliance on it within the
platforms I looked at.

The format does have some pretty major drawbacks too, like the msgid can
become "fuzzy" which leads to a differing set of issues related to the unique
keying between translations.

It also tends to lead to developers English (C locale if you like) being
selected as the default language and it turns out developers like myself are
sloppy and sometimes produce barely parsable messages.

Your remaining points are really valuable to someone inexperience in the
field, like myself, so thanks for pointing those out.

It is interesting you call out cultural issues, did you have any specific
examples?

~~~
mpessas
> The format does have some pretty major drawbacks too, like the msgid can
> become "fuzzy" which leads to a differing set of issues related to the
> unique keying between translations.

I am not sure how much of an issue this is in practice. The main problem of
the PO format AFAICT is that it is quite outdated. For instance, it has no
support for genders and you cannot "mix" plural rules within a phrase.

> It is interesting you call out cultural issues, did you have any specific
> examples?

The wikipedia entry on l10n[1] has some examples.

The process of localization is not merely about translating some strings, but
adapting them to a specific language and culture, which is the hardest part.
For instance, your home page is one of the most important pages in your app
and is geared to make as many people as possible sign up. Do you think a
simple translation would have the same effect on British, French, Arabs,
Japanese etc people?

[1]:
[https://en.wikipedia.org/wiki/Internationalization_and_local...](https://en.wikipedia.org/wiki/Internationalization_and_localization)

