
A novel approach to neural machine translation - snippyhollow
https://code.facebook.com/posts/1978007565818999/a-novel-approach-to-neural-machine-translation/
======
CGamesPlay
I'm relatively novice to machine learning but here's my best attempt to
summarize what's going on in layman's terms. Please correct me if I'm wrong.

\- Encode the words in the source (aka embedding, section 3.1)

\- Feed every run of k words into a convolutional layer producing an output,
repeat this process 6 layers deep (section 3.2).

\- Decide on which input word is most important for the "current" output word
(aka attention, section 3.3).

\- The most important word is decoded into the target language (section 3.1
again).

You repeat this process with every word as the "current" word. The critical
insight of using this mechanism over an RNN is that you can do this repetition
in parallel because each "current" word does not depend on any of the previous
ones.

Am I on the right track?

~~~
jgehring
Yes, that's pretty accurate. Step 3 (attention) is repeated multiple times,
i.e. for each layer in the decoder. With each additional layer, you
incorporate more of the previously translated text as well as information
about which parts of the source sentence representation were used to generate
it. The independence of the current word from the previous words applies to
the training phrase as a complete reference translation is provided and the
model is trained to predict single next words only. This kind of computation
would be very inefficient with an RNN: it would have to run over each word in
every layer sequentially which prohibits efficient batching.

When generating a translation for a new sentence, the model uses classic beam
search where the decoder is evaluated on a word-by-word basis. It's still
pretty fast since the source-side network is highly parallelizable and running
the decoder for a single word is relatively cheap.

------
forgotmyhnacc
I really like that Facebook open sources both code and model along with the
paper. Most companies don't: e.g. Google, deepmind, Baidu.

~~~
albertzeyer
Google has released some framework for translation:
[https://github.com/google/seq2seq/](https://github.com/google/seq2seq/)

DeepMind has also released some framework which also has all the building
blocks for translation:
[https://github.com/deepmind/sonnet](https://github.com/deepmind/sonnet)

~~~
sprobertson
Sure but these don't address parent's statement: they don't release code with
research. These both came years after the original seq2seq paper.

~~~
denzil_correa
Google's Seq2Seq came 3-4 months after the NMT paper.

------
gavinpc
> Facebook's mission of making the world more open

That's a rather strong statement, for a company that has become one of the
world's most complained-about black boxes.

But yes, they have done a lot of good in the computer science space.

~~~
blacksmythe
> Facebook's mission of making the world more open

Like many big companies, they want to commoditize their products' complements.

"Smart companies try to commoditize their products' complements."
[https://www.joelonsoftware.com/2002/06/12/strategy-
letter-v/](https://www.joelonsoftware.com/2002/06/12/strategy-letter-v/)

~~~
beagle3
And they innovated in this space with the "facebook patent grant" \-- they
give away free stuff, with a patent grant that disappears as soon as you sue
them.

And they're better at marketing than many - heard of the amazing new zlib
replacement, Zstd? It's better in every way except one - unlike zlib
(unconditionally patent free), it is only patent free as long as you don't sue
Facebook. But almost no one is aware of that.

------
snippyhollow
paper: [https://s3.amazonaws.com/fairseq/papers/convolutional-
sequen...](https://s3.amazonaws.com/fairseq/papers/convolutional-sequence-to-
sequence-learning.pdf)

code:
[https://github.com/facebookresearch/fairseq](https://github.com/facebookresearch/fairseq)

pre-trained models: [https://github.com/facebookresearch/fairseq#evaluating-
pre-t...](https://github.com/facebookresearch/fairseq#evaluating-pre-trained-
models)

~~~
Eridrus
One logical continuation of adding more attention steps is to make decision of
how many attention steps to take determined by the network ala "Adaptive
Computation Time for Recurrent Neural Networks", are you planning to go in
that direction?

~~~
ninjin
One of my students tried something along these lines for Natural Language
Inference (NLI) last year. [1] The results where not conclusive, but perhaps
Machine Translation is a better target? My reason for believing this is that
the specific dataset for NLI most likely does not require multiple steps of
inference for most cases (you can get away with simple token overlap), while
the decoder in MT does so since it is constrained to output a single token at
each step.

[1]: [https://arxiv.org/abs/1610.07647](https://arxiv.org/abs/1610.07647)

------
pwaivers
As far I understood it, Facebook put lots of research into optimizing a
certain type of neural network (CNN), while everyone else is using another
type called RNN. Up until now, CNN was faster but less accurate. However FB
has progressed CNN to the point where it can compete in accuracy, particularly
in speech recognition. And most importantly, they are releasing the source
code and papers. Does that sound right?

Can anyone else give us an ELI5?

~~~
mabbo
I'll give it a shot.

Traditional Neural Networks worked like this: You have k inputs to a layer,
and j outputs, so you have O(k * j) parameters, effectively multiplying the
inputs by the parameter to get the outputs. And if you have lots of inputs to
each layer, and lots of layers, you have a _lot_ of parameters. Too many
parameters = overfitting to your training data pretty quickly. But you _want_
big networks, ideally, to get super accuracy. So the question is how to reduce
the number of parameters while still having the same 'power' in the network.

CNNs (Convolutional Neural Networks) solve this problem by tying weights
together. Instead of multiplying every input by every output, you build a
small set of functions at each layer with a small number of parameters in
each, and multiple nearby groups of inputs together. Images are the best way
to describe this: a function will take as inputs small (3x3 or 5x5) groups of
pixels in the image, and output a single result. But they apply the same
function all over the image. Picture a little 5x5 box moving around the image,
and running a function at each stop.

This has given some pretty incredible results in the image-recognition problem
space, and they're super simple to train.

Another approach, Recurrent Neural networks (RNNs) turns the model around in a
different way. Instead of having a long list of inputs that all come at once,
it takes each input one at a time (or maybe a group at a time, same idea) and
runs the neural-network machinery to build up to a single answer. So you might
feed it one word at a time of input in English, and after a few words, it
starts outputting one word at a time in French until the inputs run out and
the output says its the end of the sentence.

What Facebook is doing is applying CNNs to text-sequence and translation
problems. It seems to me that what they have here is kind of a RNN-CNN hybrid.

Caveats: I'm an idiot! I just read a lot and play around with ML, but I'm not
an expert. Please correct me if I'm wrong, smarter people, by replying.

~~~
jorgemf
> Please correct me if I'm wrong, smarter people, by replying.

You are not an idiot, maybe not an expert but definitely not an idiot. Your
description is quite easy to understand for someone without knowledge in the
field. I would add only that RNN are called recurrent because their have
recurrent connection with other neurons, and that is why they are hard to
parallelize. You need the output the one neuron to compute the output of other
neuron in the same layer, so you cannot parallelize that layer. This doesn't
happen in CNN.

------
deepnotderp
As far as I understand, only the use of the attention mechanism with ConvNets
is novel, right? Convolutional encoders have been done before.

~~~
jgehring
Yes, there have been a couple of attempts to use CNNs for translation already,
but none of them outperformed big and well-tuned LSTM systems. We propose an
architecture that is fast to run, easy to optimize and can scale to big
networks, and could thus be used as a base architecture for future research.

There are a couple of contributions in the paper
([https://arxiv.org/abs/1705.03122](https://arxiv.org/abs/1705.03122)) apart
from demonstrating the feasibility of CNNs for translation, e.g. the multi-hop
attention in combination with a CNN language model, the wiring of the CNN
encoder[1], or an initialization scheme for GLUs that, when combined with
appropriate scaling for residual connections, enables the training of very
deep networks without batch normalization.

[1] In previous work
([https://arxiv.org/abs/1611.02344](https://arxiv.org/abs/1611.02344)), we
required two CNNs in the encoder: one for the keys (dot products) and one for
the values (decoder input).

~~~
dooxoo
> there have been a couple of attempts to use CNNs for translation already,
> but none of them outperformed big and well-tuned LSTM systems

It is true that QRNN had results on mostly small-scale benchmarks, but it
seemed that Bytenet especially the second version had SOTA results both for
language models with characters and for machine translation with characters on
the same large-scale En-De WMT task that is used in this paper.

MT with characters, with regards to ordering, structure, etc, is potentially
much harder than with words or word-pieces, since the encoded sequences are 5
or 6 times longer on average, and the meanings of words need to be built up
from individual characters.

~~~
jgehring
Yes, ByteNet v2 outperforms LSTMs on characters but not on word pieces. It
would be interesting to see how our model performs on characters, especially
when scaled up to the size of ByteNet (30+30 layers) and also how ByteNet
performs on BPE codes. I think that character-level NMT is definitely
interesting and worth investigating, but from a practical point of view it
makes sense to choose a representation that achieves the maximum translation
accuracy and speed.

------
mrdrozdov
In this work Convolution Neural Nets (spatial models that have a weakly
ordered context, as opposed to Recurrent Neural Nets which are sequential
models that have a strongly ordered context) are demonstrated here to achieve
State of the Art results in Machine Translation.

It seems the combination of gated linear units / residual connections /
attention was the key to bringing this architecture to State of the Art.

It's worth noting that previously the QRNN and ByteNet architectures have used
Convolutional Neural Nets for machine translation also. IIRC, those models
performed well on small tasks but were not able to best SotA performance on
larger benchmark tasks.

I believe it is almost always more desirable to encode a sequence using a CNN
if possible as many operations are embarrassingly parallel!

The bleu scores in this work were the following:

Task (previous baseline): new baseline

WMT’16 English-Romanian (28.1): 29.88 WMT’14 English-German (24.61): 25.16
WMT’14 English-French (39.92): 40.46

------
londons_explore
This smells of "we built custom silicon to do fast image processing using CNNs
and fully connected networks, and now we want to use that same silicon for
translations. "

~~~
alexanderdmitri
I was reading about SyntaxNet (I believe an RNN) developed by Google
yesterday. One interesting problem they've run into is getting the system to
properly interpret ambiguities. They use the example sentence "Alice drove
down the street in her car":

"The first [possible interpretation] corresponds to the (correct)
interpretation where Alice is driving in her car; the second [possible
interpretation] corresponds to the (absurd, but possible) interpretation where
the street is located in her car. The ambiguity arises because the preposition
in can either modify drove or street; this example is an instance of what is
called prepositional phrase attachment ambiguity."[1]

One thing I believe helps humans interpret these ambiguities is the ability to
form visuals from language. A NN that could potentially interpret/manipulate
images and decode language seems like it could help solve the above problem
and also be applied to a great deal of other things. I imagine (I know
embarrassingly little about NNs) this would also introduce a massive amount of
complexity.

[1] [https://research.googleblog.com/2016/05/announcing-
syntaxnet...](https://research.googleblog.com/2016/05/announcing-syntaxnet-
worlds-most.html)

------
shriphani
I wonder if they can combine this with bytenet (dilated convolutons in place
of vanilla convs) - gives you a larger FOV and add in attention and then you
probably have a new SOTA.

------
pama
This is a very cool development. Has anyone written a pytorch or Keras version
of the architecture?

------
m00x
Does this mean that we're close to being able to use CNNs for text-to-speech?

~~~
option
Yes -
[https://arxiv.org/pdf/1609.03499.pdf](https://arxiv.org/pdf/1609.03499.pdf)

------
esMazer
no demo?

~~~
t3rmi
Yeah I was searching for one as well. I hope someone can link the demo page if
possible. Want to see the comparison between systran, Google and FB

~~~
jgehring
There is no online demo but you can run the pre-trained models on your local
machine: [https://github.com/facebookresearch/fairseq#quick-
start](https://github.com/facebookresearch/fairseq#quick-start). CPU-only
versions of the models are available as well.

For a comparison with other translation services, keep in mind that our models
have been trained on publicly available news data exclusively, e.g. this
corpus for English-French: [http://statmt.org/wmt14/translation-
task.html#Download](http://statmt.org/wmt14/translation-task.html#Download) .

------
danielvf
TLDR: Cutting edge accuracy, nine times faster than previous state of the art,
published models and source code.

But go read the article- nice animated diagrams in there.

~~~
EternalData
Haha, this really was TLDR-length. Kudos!

