

Introduction to Statistical Machine Translation - genesiss
http://michaelnielsen.org/blog/introduction-to-statistical-machine-translation/

======
atgm
> The biggest single advance seems to have been a movement away from words as
> the unit of language, and towards phrase-based models, which give greatly
> improved performance.

This really struck me, as someone who both teaches and studies language. If it
works better for computers, I wonder how much better it would work for people.
If there's anything I've noticed in my time in Japan, it's that the Japanese
approach seems to be to nail down every single word with a single Japanese
meaning and stick to that meaning all the time, which leads to a lot of very
Japanese-sounding English.

------
sqrt17
> Note: English is my only language, which makes it hard for me to construct
> translation examples!

Ouch.

Building MT systems without knowing foreign languages is a bit like deaf
people building a speech recognizer. No offence meant to anyone, but it works
much better when you know what you're doing.

~~~
kurtosis
Are you really sure about this? It sounds equally plausible to me that _not_
knowing both the source and target languages would give one an advantage of
not relying on ad-hoc, hard-to-model, human judgements. Being monolingual is
more likely to enforce a discipline where one develops an algorithm which
would work effectively on _all_ natural languages.

IIRC the 'candide' group was (not intentionally) composed of scientists with
no knowledge of both english and french..
<http://www.cs.cmu.edu/~aberger/mt.html>

~~~
sqrt17
It does help to have sketchy knowledge about some language to see how you can
figure out things when you cannot assume any knowledge of them.

Contrary to your point, early statistical machine translation only works well
for relatively close language pairs, like English-French or English-Spanish.
It totally fails for more distant languages such as Chinese, Arabic or even
German, which is why you have so many Chinese-speaking people (including
English-Chinese bilinguals) in machine translation these days.

Parallel corpora are full of ad-hoc, hard-to-model, human judgements (from
people called "translators"). The advantage is that the translators don't come
up to you to criticize your translation model; however doing error analysis
for an MT system (i.e., the key to actually improving things and not producing
garbage) requires at least minimal knowledge of the source language and
relatively good knowledge of the target language.

