The visual saliency maps in Figure 6 are astounding and make their performance on this task even more impressive as they give a lot of insight into what the model is doing and it seems to be focusing on the things in the image that a regular person would use to decide on the answer. Most striking was on the question "is this in the wild?", and the saliency was on the artificial, human structures in the background that indicated it was in a zoo. This type of reasoning is surprising as it requires a bit of a reversal in logic to come up with this way of answering the question. Super impressive.
The proposed input fusion layer is pretty cool - allowing information from future sentences to be used to condition how to processes previous sentences. This type of information combining was previously not really explored, and it makes sense that it improves the performance on the bAbI-10k task so much as back-tracking is an important tool for human reading comprehension. Also its clever that they encode each sentence as a thought vector before compositing so they can be processed both forwards and backwards with shared parameters- doing so on just one-hot words or even ordered word embeddings would require two vastly different parameters since grammar is wildly different when a sentence is reversed.
Lastly, on a side note, if 2014 was the year of CNN's, 2015 the year of RNN's, it looks like 2016 is the year of Neural Attention Mechanisms. Excited to see what new layers are explored in 2016 that will dominate 2017.
Being able to see where a neural attention mechanism is fascinating and allows for far more than just introspection. Indeed, one of the earliest and coolest examples are neural attention mechanisms learning to align words between languages without any help!
New papers on neural attention seem to be coming out every few weeks. I devoted much of my weekend to a naive implementation of hierarchical attentive memory that promises O(log n) lookup (important for improving the speed of attention based algorithms on large input) and can be "taught" to sort in O(n log n) - really exciting stuff!
For those interested, I highly recommend reading NVIDIA's "Introduction to Neural Machine Translation with GPUs". The three part overview is a great introduction.
I'd be really happy to see 2016 as the year of neural attention mechanisms :)
(full disclosure: one of the authors of the paper)
Maybe it could be used in conjunction with a chat bot that was trained with RL like AlphaGo (which used millions of human games as input data). People would say something or ask a question, then the net provide a bunch of answers and then people would rate the answers that are most human like, giving it a good/bad signal to use in training the RL part of the system. That would make the bot natural and human-like in conversation. Couple that with the reasoning part for answering questions and we get a reasoned/intelligent chat bot.
I'm wondering how far we are from being able to have normal conversations with bots - conversations that don't quickly get derailed or weird.
The attention mechanism is a feed forward single layer neural network. The problem is that the input sequence is variable length. How are the outputs of the bi-rnn fed to the attention mechanism. What happens if I have a 80 word sentence as input and what happens if I have a 10 word sentence as input ?
I: input length
M: minibatch size (same for input and output)
H_in: input hidden size (arbitrary/user selected)
H_out: output hidden size (arbitrary/user selected)
C: attention feature size (arbitrary/user selected)
Looking at the decode/generator RNN, "context" comes in at every timestep as every hidden state from the BiRNN (I, M, H_in) projected to (I, M, C). We do the normal RNN thing (tanh, LSTM, GRU) for the generator at first to get a decode hidden state (M, H_out).
Next the "first" output hidden state gets projected to the attention size C (so now (M, C)), and using numpy broadcasting this (1, M, C) size thing gets summed with the (I, M, C) size "context" which is a projection of the input RNN hiddens. Now we have something that is looking at both what the output RNN has previously done (and seen), and some context information from the input RNN.
A nonlinearity (tanh) is applied to the (I, M, C) sized piece, partly in order to bound the activation to (-1, 1) and partly just because we like nonlinearity. This (I, M, C) size thing then gets projected to (I, M, 1), or "alpha", then the useless dimension is dropped via reshape so we now have (I, M), then a softmax is applied over the first dimension I. This means that the network has effectively "weighted" each of the I timesteps of the input, though at the start of training it can't know what is relevant, at the end of training this is the thing you visualize.
This (I, M) is the actual "attention" - to apply it you simply take the original hidden states from the input BiRNN and use broadcasting to multiply again. (I, M, 1) * (I, M, H_in) gives a weighting over every timestep of the input, and finally we sum over the 0 axis (I) to get the final "context" of just (M, H_in) size. This can then be projected to a new size (H_out) and combined with the output hiddens to get an RNN that has "seen" a weighted sum of the input, so it can generate conditioned on the whole input sequence.
Note that this whole messy procedure I described happens per timesetep - so the neural network (given enough data) can learn what to look at in the whole input in order to generate the best output. Other attentions such as , are constrained either to only move forward or to be a hybrid of forward movement and (pooled) global lookup.  is a good summary of the different possibilities in use today.
 PPT, http://www.thespermwhale.com/jaseweston/ram/slides/session2/...
Normally you pad each minibatch to the same length (length of the longest sequence in that minibatch), then carry around an additional mask to zero out any "unnecessary" results from padding for the shorter sequences.
The BiRNN (using all hidden states) + attention mechanism is the thing that allows variable length context to be fed to the generative decode RNN. A regular RNN (using all hidden states) + attention, or even just the last hidden state of an RNN can all be used to map variable length to fixed length sequences in order to condition the output generator.
You will note that padding to the length of the longest sequence in a minibatch wastes computation - people often sort and shuffle the input so that sequences of approximately the same length are used in each minibatch, to maximize computation. If you padded to the overall longest sequence (rather than per minibatch), you would pay a massive overhead computationally.
The paper says
eij = a(si−1, hj) is an alignment model which scores how well the inputs around position j and the output at position i match. The score is based on the RNN hidden state si−1 (just before emitting yi) and the j-th annotation hj of the input sentence.
We parametrize the alignment model a as a feedforward neural network which is jointly trained with
all the other components of the proposed system.
So a is a feed fw nn. What is the input to this nn ? Do we input each hj corresponding to each word of the input sentence separately and get one number for each word ? If the input sentence has 200 words do I run this 200 times ? One for each input word ?
Also have you tried to run DMN on the Text QA / Reading comprehension datasets from DeepMind?
Teaching Machines to Read and Comprehend
Could anybody answer what programming language is most likely to be used for development of this kind of systems? I haven't found information about it in the paper.
The visual part need access to efficient GPU-based implementation of convolutions (typically using cuDNN from nvidia or neon from Nervana Systems).
You could write bindings for other languages or even work directly in C++ but I think that an interactive programming language with a REPL like Python and lua is very helpful for faster iterative model building, interactive exploration and evaluation.
Btw do you know anybody who use Julia for this?