- Encode the words in the source (aka embedding, section 3.1)
- Feed every run of k words into a convolutional layer producing an output, repeat this process 6 layers deep (section 3.2).
- Decide on which input word is most important for the "current" output word (aka attention, section 3.3).
- The most important word is decoded into the target language (section 3.1 again).
You repeat this process with every word as the "current" word. The critical insight of using this mechanism over an RNN is that you can do this repetition in parallel because each "current" word does not depend on any of the previous ones.
Am I on the right track?
When generating a translation for a new sentence, the model uses classic beam search where the decoder is evaluated on a word-by-word basis. It's still pretty fast since the source-side network is highly parallelizable and running the decoder for a single word is relatively cheap.
DeepMind has also released some framework which also has all the building blocks for translation: https://github.com/deepmind/sonnet
Google and Deepmind released a lot of stuff, I don't feel I have the right to complain about it.
Most papers are not about implementation and more about the concepts or proofs. They are rather straightforward to reimplement, and I don't think anybody is accusing them of faking their results.
That is because of their overwhelming influence, not the quality of their publications.
> They are rather straightforward to reimplement
Le and Mikolov's "Distributed Representations of Sentences and Documents", frequently cited as the original example of "doc2vec", could not be reproduced by Mikolov himself. 
> and I don't think anybody is accusing them of faking their results.
They sure aren't. That, too, is because of their overwhelming influence. You have to say very nicely that their results are wrong.
For example, here's an IBM research paper that leads and concludes with "we reimplemented doc2vec and made it work well", and whispers "but not as well as Le said". 
The statement Le and Mikolov's "Distributed Representations of Sentences and Documents", frequently cited as the original example of "doc2vec", could not be reproduced by Mikolov himself. is an overstatement - there was only one part that couldn't be completely reproduced.
It's true that Quoc Le's results on the dmpv version of doc2vec have been hard to reproduce. However, the very stackexchange link you cite above points out that it can be reproduced by not shuffling the data. It's likely that this was an oversight.
However - and it's an important thing - the reason this example gets some attention is because doc2vec is a very strong model even in dbow form.
here's an IBM research paper that leads and concludes with "we reimplemented doc2vec and made it work well"
No, they took the Gensim doc2vec implementation and experimented with parameters on different datasets.
Also, Mikolov's Word2Vec work was even more important than doc2vec and was fully reproducible and was released with code and trained models, while at Google.
Not really. Very often you will find the crucial details that make it work missing. Not sure if things have vastly improved over past 5-6 years.
For example. NIPS proceedings (like hundred of papers): https://papers.nips.cc/book/advances-in-neural-information-p... source code available (around 25, 2 of them in google github repos): https://www.reddit.com/r/MachineLearning/comments/5hwqeb/pro...
Most people on the Internet do.
That's a rather strong statement, for a company that has become one of the world's most complained-about black boxes.
But yes, they have done a lot of good in the computer science space.
Like many big companies, they want to commoditize their products' complements.
"Smart companies try to commoditize their products' complements."
And they're better at marketing than many - heard of the amazing new zlib replacement, Zstd? It's better in every way except one - unlike zlib (unconditionally patent free), it is only patent free as long as you don't sue Facebook. But almost no one is aware of that.
Or is the comment only tangential to OP?
pre-trained models: https://github.com/facebookresearch/fairseq#evaluating-pre-t...
Can anyone else give us an ELI5?
Traditional Neural Networks worked like this: You have k inputs to a layer, and j outputs, so you have O(k * j) parameters, effectively multiplying the inputs by the parameter to get the outputs. And if you have lots of inputs to each layer, and lots of layers, you have a lot of parameters. Too many parameters = overfitting to your training data pretty quickly. But you want big networks, ideally, to get super accuracy. So the question is how to reduce the number of parameters while still having the same 'power' in the network.
CNNs (Convolutional Neural Networks) solve this problem by tying weights together. Instead of multiplying every input by every output, you build a small set of functions at each layer with a small number of parameters in each, and multiple nearby groups of inputs together. Images are the best way to describe this: a function will take as inputs small (3x3 or 5x5) groups of pixels in the image, and output a single result. But they apply the same function all over the image. Picture a little 5x5 box moving around the image, and running a function at each stop.
This has given some pretty incredible results in the image-recognition problem space, and they're super simple to train.
Another approach, Recurrent Neural networks (RNNs) turns the model around in a different way. Instead of having a long list of inputs that all come at once, it takes each input one at a time (or maybe a group at a time, same idea) and runs the neural-network machinery to build up to a single answer. So you might feed it one word at a time of input in English, and after a few words, it starts outputting one word at a time in French until the inputs run out and the output says its the end of the sentence.
What Facebook is doing is applying CNNs to text-sequence and translation problems. It seems to me that what they have here is kind of a RNN-CNN hybrid.
Caveats: I'm an idiot! I just read a lot and play around with ML, but I'm not an expert. Please correct me if I'm wrong, smarter people, by replying.
You are not an idiot, maybe not an expert but definitely not an idiot. Your description is quite easy to understand for someone without knowledge in the field. I would add only that RNN are called recurrent because their have recurrent connection with other neurons, and that is why they are hard to parallelize. You need the output the one neuron to compute the output of other neuron in the same layer, so you cannot parallelize that layer. This doesn't happen in CNN.
Let me add this though:
Artificial neural networks were proposed to compute the probability of a sequence of words occurring; however, RNNs were the next step in Natural Language Processing since they allow variable-length sequences to be received as an input contrary to the previously proposed architecture.
However a simple RNN architecture didn't allow for long
-term dependencies to be captured (that is, use statistical modeling to predict a word sequence on a part of a text that is based on an idea previously developed on the corpus). So two kinds of fancy RNN architectures were developed to tackle this problem: GRUs and LSTMs. Production systems are already implementing these architectures and they are yielding pretty accurate results.
But now Facebook researchers are proposing using CNNs for this task because this architecture can take more advantage of GPU parallelism.
They showed how to use a CNN with text to get a speed boost, even though that's not how it's normally been done.
There are a couple of contributions in the paper (https://arxiv.org/abs/1705.03122) apart from demonstrating the feasibility of CNNs for translation, e.g. the multi-hop attention in combination with a CNN language model, the wiring of the CNN encoder, or an initialization scheme for GLUs that, when combined with appropriate scaling for residual connections, enables the training of very deep networks without batch normalization.
 In previous work (https://arxiv.org/abs/1611.02344), we required two CNNs in the encoder: one for the keys (dot products) and one for the values (decoder input).
It is true that QRNN had results on mostly small-scale benchmarks, but it seemed that Bytenet especially the second version had SOTA results both for language models with characters and for machine translation with characters on the same large-scale En-De WMT task that is used in this paper.
MT with characters, with regards to ordering, structure, etc, is potentially much harder than with words or word-pieces, since the encoded sequences are 5 or 6 times longer on average, and the meanings of words need to be built up from individual characters.
It seems the combination of gated linear units / residual connections / attention was the key to bringing this architecture to State of the Art.
It's worth noting that previously the QRNN and ByteNet architectures have used Convolutional Neural Nets for machine translation also. IIRC, those models performed well on small tasks but were not able to best SotA performance on larger benchmark tasks.
I believe it is almost always more desirable to encode a sequence using a CNN if possible as many operations are embarrassingly parallel!
The bleu scores in this work were the following:
Task (previous baseline): new baseline
WMT’16 English-Romanian (28.1): 29.88
WMT’14 English-German (24.61): 25.16
WMT’14 English-French (39.92): 40.46
"The first [possible interpretation] corresponds to the (correct) interpretation where Alice is driving in her car; the second [possible interpretation] corresponds to the (absurd, but possible) interpretation where the street is located in her car. The ambiguity arises because the preposition in can either modify drove or street; this example is an instance of what is called prepositional phrase attachment ambiguity."
One thing I believe helps humans interpret these ambiguities is the ability to form visuals from language. A NN that could potentially interpret/manipulate images and decode language seems like it could help solve the above problem and also be applied to a great deal of other things. I imagine (I know embarrassingly little about NNs) this would also introduce a massive amount of complexity.
For a comparison with other translation services, keep in mind that our models have been trained on publicly available news data exclusively, e.g. this corpus for English-French: http://statmt.org/wmt14/translation-task.html#Download .
But go read the article- nice animated diagrams in there.