
Zero-Shot Translation with Multilingual Neural Machine Translation System - wwilson
https://research.googleblog.com/2016/11/zero-shot-translation-with-googles.html
======
Smerity
If people are interested in the underlying architecture of Google's Neural
Machine Translation (GNMT) system, I wrote an article that builds it up piece
by piece. While it's intended for people who are likely to implement GNMT or
similar architectures, the article is descriptive enough that it should be
possible to follow along even if you're not well versed in deep learning.

[http://smerity.com/articles/2016/google_nmt_arch.html](http://smerity.com/articles/2016/google_nmt_arch.html)

The GNMT architecture is used almost as is for the zero-shot MT experiments.
We're likely to see the GNMT architecture used extensively by Google for a
variety of projects as they spent a deal of time and effort ensuring it is
scalable to quite large datasets. Training a neural machine translation system
with a single language pair is difficult - training it with multiple,
especially all using the same set of weights, is insanely challenging!

As an example, the GNMT architecture was used as the basis of "Generating Long
and Diverse Responses with Neural Conversation Models", which trains on the
entirety of Reddit (1.7 billion messages) as well as other various datasets.

[https://openreview.net/forum?id=HJDdiT9gl](https://openreview.net/forum?id=HJDdiT9gl)

~~~
alexbeloi
Can you explain the odd looking connections to the decoder that appear in V2
and onwards? It's not entirely obvious to me from the description how the
attention is applied.

In the first part of the decoder rollout, the diagram connects the
(concatenated?) output of the encoder into the middle(?) of the first output
of the decoder rollout. In what sense?

For the second rollout of the decoder the diagram shows 3 arrows going into
the LSTM cell, what is that (left-facing) arrow on the right side doing?

------
sgentle
This reminds me of Searle's Chinese Room Argument[0]: imagine you have a
particularly dreary job where you sit in a room filled with boxes of symbols
written on paper. Every now and again someone comes in and hands you some new
symbols. You look through your rulebook and, depending on what it says, hand
them some symbols back. It turns out that these rules actually implement a
conversational program in Chinese. And if you can implement those rules and
not understand Chinese, why would you think a computer program, implementing
its own rules, could understand anything?

The common "Systems Reply" response to this is that you're looking at the
wrong layer of abstraction. The computer hardware (or the person in the room)
doesn't understand Chinese, the computer plus the rules plus the data _forms a
system that understands Chinese_. Searle's answer to this is that, well, what
if you memorised the rules and the database? You might know all the rules, you
might be able to follow them, but you wouldn't understand Chinese.

What I think is fascinating about this is that it's vulnerable to Bayesian
Judo[1]: if you have a strong belief that computers aren't capable of true
understanding because of the Chinese Room Argument, then building an actual
Chinese Room-style computer and having it show understanding should be a
fairly strong blow to that belief.

Now, it's easy to quibble about what true understanding actually means, but
one version (used by Searle's answer) is this: "[..] he would not know the
meaning of the Chinese word for hamburger. He still cannot get semantics from
syntax." But this news is exactly that! A computer translation of the same
semantic concept from one syntax to another _without ever having been taught
the rules connecting them_. In other words, this is semantics from syntax
implemented by nothing but a computer, a database, and a set of rules.

So, by the reverse Chinese Room Argument, I would say this system exhibits a
kind of understanding. Not a very sophisticated kind, mind you, but something
that should still spook you if you believe computers are categorically
incapable of thinking like us.

[0] [http://plato.stanford.edu/entries/chinese-
room/](http://plato.stanford.edu/entries/chinese-room/) [1]
[http://lesswrong.com/lw/i5/bayesian_judo/](http://lesswrong.com/lw/i5/bayesian_judo/)

~~~
timr
_" But this news is exactly that! A computer translation of the same semantic
concept from one syntax to another without ever having been taught the rules
connecting them."_

By that standard, statistical translation approaches were "understanding" a
long time ago. The new thing here isn't that systems aren't being taught "the
rules" (that wasn't happening in statistical MT either), the new thing is that
there's a different kind of classifier in the "middle" now, which is
representing a hidden state. This classifier is more flexible in a lot of
ways, but also more of a black box, and takes a _lot_ more effort to train
without overfitting. It's cool that you can translate between language pairs
that have never been explicitly trained, but let's not overstate the meaning
of it.

The blog post makes this rather breathless speculation:

 _" Within a single group, we see a sentence with the same meaning but from
three different languages. This means the network must be encoding something
about the semantics of the sentence rather than simply memorizing phrase-to-
phrase translations. We interpret this as a sign of existence of an
interlingua in the network."_

This is...a fun story, but not much else. First off, you can make
dimensionality reduction plots that "show" a lot of things. Even ignoring that
issue, in translations of short sentences involving specific concepts (i.e.
the example about the stratosphere), is it really surprising that you'd find
clusters? The words in that sentence are probably unique enough that they'd
form a distinct cluster in mappings from _any_ translation system.

Folks get caught up in the "neural" part of neural networks, and assume that
some magical quasi-human thought is happening. If the tech were called "highly
parameterized reconfigurable weighted networks of logistic classifiers",
there'd be less loopy speculation.

~~~
sgentle
Don't worry, I'm not being bamboozled by the word "neural". My argument that
there is a definition of understanding that you can derive from a well-known
thought experiment that looks like it is met by this implementation of "highly
parameterized reconfigurable weighted networks of logistic classifiers".

I don't see any particular difference between training a classifier and
teaching rules; the rules are just encoded in the parameters of the
classifier. If it helps, you can just replace "taught" with "trained on" and
"rules" with "data", but there's no version of the Chinese Room Argument where
you're sitting in a room with boxes full of unsupervised learning datasets and
a book of sigmoid functions.

Perhaps this system works similarly to previous ones, but not having been
taught (trained on) any rules (data) about the specific language pairs in
question seems to be a strong argument for some kind of semantic
representation of language. You might have seen that before, but I haven't and
the article seems to imply that it's new. Again, I'm talking specifically
about the similarity between this result and an example of something "machines
can't do".

The point is that the non-magical argument goes both ways. If a brain is just
a complicated and meaty computer, then we should expect sophisticated enough
programs on powerful enough hardware to start displaying things we might
recognise as intelligent. That's not going to look particularly impressive –
our machine translator isn't going to develop a conscience or try to unionise
– but it might do something that qualifies for some definition of
understanding.

~~~
timr
But you _are_ getting into magical thinking, in that there is no reasonable
definition of "understanding" that this system meets. It cannot reason or make
deductions. It can't re-write sentences to use completely different
words/structures but imply the same meaning. In fact, there is literally no
"conceptual" representation here -- there is a vector of numbers that gets
passed between encoder and decoder, but it is no more a form of intelligence
than the "hidden" state that is maintained by an HMM.

 _" not having been taught (trained on) any rules (data) about the specific
language pairs in question seems to be a strong argument for some kind of
semantic representation of language."_

Well, yeah, there's a representation of language. But it isn't "semantic" \--
it's vector of language-independent parameters for a decoder, which can then
output symbols in a second language. Could you theoretically imagine some huge
magical network of logistic classifiers that uses this as the first of a (far
larger) processing machine that enables something like human intelligence?
Maybe. But this is not it. This is bigger, far more complicated/flexible
version of a machine that is purpose-built to map between sequences of text.

(That said, I really don't want to go down the rabbit hole of "what is AGI,
anyway?", which is about as productive/interesting as hitting the bong and
wondering if maybe _we all live in a computer simulation after all_. I'm
merely observing that this is _not_ an intelligent machine.)

~~~
sgentle
> It cannot reason or make deductions. It can't re-write sentences to use
> completely different words/structures but imply the same meaning.

I agree that it doesn't meet these definitions of understanding. I'm arguing
that it meets the definition of "semantics from syntax".

> Well, yeah, there's a representation of language. But it isn't "semantic"
> \-- it's vector of parameters for a decoder, which then output symbols in a
> second language.

What is it that makes a vector of parameters not semantic? Would it be
semantic if they were stored in a different format? If I tell you I have a
system in which the concept of being hungry is stored as the number 5, would
you say "that's not a concept, that's a number"? If that vector of parameters
represents being hungry in any language, what is it if not a semantic
representation of hunger?

There's no need to imagine a huge magical network that implements a grandiose
vision of intelligence. We're talking about a small, non-magical network that
implements a very modest vision of intelligence. Bacteria are still alive even
though they're a lot less complex than we are. What would you expect the
single-celled equivalent of intelligence to look like? Something with a very
minor capacity for inference? Something with rudimentary abstraction across
different representations of the same underlying idea?

------
hota_mazi
Fascinating. Maybe the next step will be to extract the tokenized interlingua
language that's emerged in the neural network and map it to real words, and
blam, we reinvent Esperanto!

~~~
klodolph
It would probably not look anything like Esperanto, which is a charliefoxtrot
of pidgin Spanish with some Turkish orthography thrown in.

In all seriousness, however, different languages often have radically
different concepts encoded into them. For example, in English sentences you
might be forced to explicitly give subjects to the actors in your sentences,
assign genders to your pronouns, and choose whether actions take place in the
present or future, whether they are ongoing in general or happening at this
moment, etc. In Japanese you might be forced to choose how to express your
relationship with the person you are speaking to, the person you are speaking
about, and decide whether a causal relationship is "if and only if" or just
plain "if then".

There's long been a hypothesis about whether there is some kind of "universal
grammar" (Chomsky) but modern linguists do not really entertain that idea.

~~~
toomanybeersies
I think that's the fundamental problem with translating; different languages
have different information encoded in a sentence.

For instance, as you've said, Japanese encodes the relation to whom you're
speaking. Even French does this (to a lesser extent), with "tu"/"vous"
depending on how familiar the person you're speaking to is.

~~~
schoen
And as the parent commenter points out, not only do they have different
information encoded, but they may _require_ different information to be
expressed so that it's not permissible to simply omit information that the
source language omitted.

[https://en.wikipedia.org/wiki/T%E2%80%93V_distinction](https://en.wikipedia.org/wiki/T%E2%80%93V_distinction)

[https://en.wikipedia.org/wiki/Pro-
drop_language](https://en.wikipedia.org/wiki/Pro-drop_language)

[https://en.wikipedia.org/wiki/Evidentiality](https://en.wikipedia.org/wiki/Evidentiality)

~~~
mack73
But there absolutely must be a common denominator. A Japanese sentence might
encode the metric "relationship to subject" which an English sentence will
encode as "NULL".

~~~
Berobero
The issue is if you're translating from a language that doesn't explicitly
encode that information into a language that requires it; you need some way of
filling in the blanks. This is all well and great when that information can be
inferred from the explicit context provided; the real problem is that with
translation there are a non-trivial number of cases that require the
translator to consider the pragmatic context of what's being translated to
figure out a correct/good translation (e.g. who's speaking/writing, who are
they addressing, what is the overall societal context of content, what might
be the purpose of the content, etc.). For a lot of these problems it's
entirely reasonable to get a computer to fill in the blanks for 90%+ of cases,
but the last few percent of cases require AI that is equivalent to that of a
human.

------
ChuckMcM
Nice piece of work and counts as "implementing Star Trek in the present". Now
I just need a nice pair of noise cancelling over the ear headphones that let
me hear english spoken no matter where I am :-)

~~~
agildehaus
Stick a fish in your ear. Might get lucky.

------
honkhonkpants
Pretty impressive, but even more amazingly their paper is in a single-column
format that I can actually read on my computer, instead of pretending that I
am reading printed and bound conference proceedings. Truly a giant leap for
the field.

------
YeGoblynQueenne
>> We call this “zero-shot” translation, shown by the yellow dotted lines in
the animation. To the best of our knowledge, this is the first time this type
of transfer learning has worked in Machine Translation.

I think it was last year when a friend was telling me how Google translates
the Greek word for "swallow" (the bird) to French. Back then, the translation
was the French word for "to swallow" (the verb). The bird and the action don't
even sound remotely alike in Greek and neither are they spelled alike (the
bird is "χελιδόνι" the action is "καταπίνω"; google trans. will at least give
their correct pronounciation). My friend figured Google can't find enough
examples between the two languages, so it goes via English ... were the two
words are homonyms.

I think that was last year, and certainly before September.

So I gave it a try again today, and this is still what I get:

    
    
      Greek                 French
      χελιδόνι		avaler
      chelidóni
    

If you omit the accent on the "o" you don't even get the mistranslation- you
get only the phonetic transcription of the Greek word in latin characters.

Obviously the important thing here is not the one word that google translate
gets wrong, but the fact that it doesn't really look like this "new" system is
all that new, or that it does anything all that different from the previous
one, or indeed that it improves things at all.

------
YeGoblynQueenne
Google, and also Microsoft btw, absolutely need to be called out on this. They
keep claiming that their translation systems work well, because they have
reasonably good results between some language pairs, like English/French or
English/Spanish, that are a) close linguistically b) have a lot of examples of
translated documents and, more importantly, c) have many speakers who might
use Google translate.

For languages where none of the above holds, however, the results continue to
be completely ridiculous, no matter what "new" technique Google (or MS)
advertises. Since those languages are not spoken by as many people as English
or Spanish etc, however, it's very hard for the user to figure out how
attrociously bad their automatic translations are.

Here's an example from my native Greek; this is a bit of news text from
yesterday [1]:

 _Λανθασμένη χαρακτηρίζει ο Κύπριος κυβερνητικός εκπρόσωπος, Νίκος
Χριστοδουλίδης, την προσέγγιση, να μπαίνουν στο «ίδιο καλάθι» η Ελλάδα με την
Τουρκία σε σχέση με το κυπριακό._

And here's Google's translation:

 _Incorrect characterizes Cypriot government spokesman Nikos Christodoulides,
the approach to be put in "one basket" by Greece and Turkey in relation to
Cyprus._

So, the Cypriot government spokesman (well done) is put in a basket by Greece
and Turkey (wait wut). Hey, maybe the guy wanted to be put in _two_ baskets?
[2]

That's very typical of the way Google translates between Greek and English.
For Google, it's Neural Networks leading us to a bright future where language
barriers are eliminated thanks to Scienz! For Greeks, it's comedy gold.

And it's the same for Russians, Polish, Finns, Swedes, Indians, Chinese,
Hungarians...

Still, Google keeps including those languages in the count of languages it
"covers", because it's good advertisement and who can really dispute them
anyway?

_____________________

[1]
[http://www.kathimerini.gr/884845/article/epikairothta/politi...](http://www.kathimerini.gr/884845/article/epikairothta/politikh/xristodoylidhs-
gia-kypriako-den-yphr3e-akraia-8esh-apo-thn-a8hna)

[2] What's being said is more like: "The Cypriot government spokesman said
that it's a mistake to treat Greece and Turkey in the same manner with regards
to Cyprus".

------
vurpo
I wonder how much memory this translation via an intermediate representation
of a sentence takes. It seems like representing the semantic meaning of a
sentence in a language-independent way would take a huge amount of data.

~~~
ximeng
E.g. Japanese is 50MB to download for offline usage in Google Translate's
iPhone app.

------
glandium
The mentioned Japanese->English->Korean combination is one of the worst
possible things to do. Both Korean and Japanese are very different from
English, while similar to each other to some extent. Direct translation from
one to the other would actually have much better results than translating back
to Korean the (likely broken) English you get from Japanese.

Edit: I do realize they're not talking about successive translations, but
that's essentially how the training ends up happening, isn't it?

A better example, IMHO, would have been three very different languages, like
English, Japanese and Russian.

~~~
xbmcuser
It is not translating from Japanese>English>korean it learned how to translate
Japanese<>English Korean<>English now it knows how to translate
Japanese<>Korean. In laymen terms it understands Korean and Japanese so can
translate between them without needing examples of Japanese<>Korean
translations.

------
spynxic
This post seems to over-exaggerate a commonly known mathematical property.

Suppose I have languages X, Y, and Z. My machine currently knows how to
translate between X->Y and X->Z. The goal is to turn Y into Z without direct
training. The process would be to translate Y into X and X into Z..
effectively Y->Z.

This isn't really transfer learning as much as it is logical induction...Or am
I missing something?

~~~
eridius
Presumably what you're missing is the fact that this is translating from Y->Z
in a single step, rather than doing two separate translations.

~~~
mattkrause
They're certainly not doing English to Portuguese by way of Spanish or
anything like that.

However, you could almost read the paper as "We can translate something from a
natural language into a high-dimensional representation, then turn that high-
dimensional representation back into (another, possibly different) natural
language. "

~~~
empath75
It seems similar to automatically captioning images.

~~~
mattkrause
Very much so!

The two tasks are surprisingly interchangable. I once worked on a project
where we used a statistical MT approach to "translate" between image features
and captions--and I don't think we were the only ones trying such things.

In a pleasing bit of symmetry, the attentional network used here looks like it
was initially developed for image captioning.

