

An evaluation of automatic romanisation of Japanese text with Google Translate - quant18
http://pinyin.info/news/2009/google-translate-and-romaji/

======
quant18
Though I criticise Google's NLP efforts a lot, actually I think this guy is
too hard on it --- the word segmentation is pretty good. Most of the errors
are in the romanisations.

Which makes sense when you consider what Google does. Word segmentation is a
pretty essential part of dealing with Japanese search, so of course they've
been working on that problem for years. (And probably a lot of their training
data comes from search queries, consisting of long strings of kanji nouns,
rather than verb phrases --- people search for nouns a lot more. Which is why
they can correctly segment a godawful strings of kanji like "会社役員高橋延拓容疑者" or
"二重橋前交差点", but choke on incredibly common verb conjugations like "した" or
"していた").

On the other hand, they probably haven't been working on the romanisation
problem that long --- it's useful at the margins of search (some terms might
get written in hiragana rather than kanji either in queries or on pages), but
not essential. I'd imagine they started just recently, as a result of their
voice recognition/text-to-speech efforts.

