
Dynamic Memory Networks for Visual and Textual Question Answering - evc123
http://arxiv.org/abs/1603.01417
======
nicklo
MetaMind papers are always pretty awesome. Couple of great highlights from
this paper:

The visual saliency maps in Figure 6 are astounding and make their performance
on this task even more impressive as they give a lot of insight into what the
model is doing and it seems to be focusing on the things in the image that a
regular person would use to decide on the answer. Most striking was on the
question "is this in the wild?", and the saliency was on the artificial, human
structures in the background that indicated it was in a zoo. This type of
reasoning is surprising as it requires a bit of a reversal in logic to come up
with this way of answering the question. Super impressive.

The proposed input fusion layer is pretty cool - allowing information from
future sentences to be used to condition how to processes previous sentences.
This type of information combining was previously not really explored, and it
makes sense that it improves the performance on the bAbI-10k task so much as
back-tracking is an important tool for human reading comprehension. Also its
clever that they encode each sentence as a thought vector before compositing
so they can be processed both forwards and backwards with shared parameters-
doing so on just one-hot words or even ordered word embeddings would require
two vastly different parameters since grammar is wildly different when a
sentence is reversed.

Lastly, on a side note, if 2014 was the year of CNN's, 2015 the year of RNN's,
it looks like 2016 is the year of Neural Attention Mechanisms. Excited to see
what new layers are explored in 2016 that will dominate 2017.

~~~
Smerity
Thanks for the great comment! For a direct link to the visual saliency maps
from the paper (Figure 6) which @nicklo mentions:

[http://i.imgur.com/DRfaNxB.png](http://i.imgur.com/DRfaNxB.png)

Being able to see where a neural attention mechanism is fascinating and allows
for far more than just introspection. Indeed, one of the earliest and coolest
examples are neural attention mechanisms learning to align words between
languages without any help[1]!

[http://i.imgur.com/J5zFZzN.png](http://i.imgur.com/J5zFZzN.png)

New papers on neural attention seem to be coming out every few weeks. I
devoted much of my weekend to a naive implementation of hierarchical attentive
memory[2] that promises O(log n) lookup (important for improving the speed of
attention based algorithms on large input) and can be "taught" to sort in O(n
log n) - really exciting stuff!

For those interested, I highly recommend reading NVIDIA's "Introduction to
Neural Machine Translation with GPUs"[3]. The three part overview is a great
introduction.

I'd be really happy to see 2016 as the year of neural attention mechanisms :)

[1]: [http://arxiv.org/abs/1409.0473](http://arxiv.org/abs/1409.0473)

[2]: [http://arxiv.org/abs/1602.03218](http://arxiv.org/abs/1602.03218)

[3]: [https://devblogs.nvidia.com/parallelforall/introduction-
neur...](https://devblogs.nvidia.com/parallelforall/introduction-neural-
machine-translation-with-gpus/)

(full disclosure: one of the authors of the paper)

~~~
neurangotan
Hi thanks for a great paper. I have read it as well as the nvidia articles a
lot of times but I am failing to grasp some important details. If you could
shed some light that would be great.

The attention mechanism is a feed forward single layer neural network. The
problem is that the input sequence is variable length. How are the outputs of
the bi-rnn fed to the attention mechanism. What happens if I have a 80 word
sentence as input and what happens if I have a 10 word sentence as input ?

~~~
kastnerkyle
It is a dense read, but you might have a look at [1]. This is how attention is
implemented in Theano. Basically the key is going "3D" per timestep (where 2D
per timestep is the norm when doing minibatch training), then taking a
weighted sum over the correct axis to get the right size to combine with the
RNN state.

Short summary:

I: input length

M: minibatch size (same for input and output)

H_in: input hidden size (arbitrary/user selected)

H_out: output hidden size (arbitrary/user selected)

C: attention feature size (arbitrary/user selected)

Looking at the decode/generator RNN, "context" comes in _at every timestep_ as
every hidden state from the BiRNN (I, M, H_in) projected to (I, M, C). We do
the normal RNN thing (tanh, LSTM, GRU) for the generator at first to get a
decode hidden state (M, H_out).

Next the "first" output hidden state gets projected to the attention size C
(so now (M, C)), and using numpy broadcasting this (1, M, C) size thing gets
summed with the (I, M, C) size "context" which is a projection of the input
RNN hiddens. Now we have something that is looking at both what the output RNN
has previously done (and seen), and some context information from the input
RNN.

A nonlinearity (tanh) is applied to the (I, M, C) sized piece, partly in order
to bound the activation to (-1, 1) and partly just because we like
nonlinearity. This (I, M, C) size thing then gets projected to (I, M, 1), or
"alpha", then the useless dimension is dropped via reshape so we now have (I,
M), then a softmax is applied over the first dimension I. This means that the
network has effectively "weighted" each of the I timesteps of the input,
though at the start of training it can't know what is relevant, at the end of
training this is the thing you visualize.

This (I, M) is the actual "attention" \- to apply it you simply take the
original hidden states from the input BiRNN and use broadcasting to multiply
again. (I, M, 1) * (I, M, H_in) gives a weighting over every timestep of the
input, and finally we sum over the 0 axis (I) to get the final "context" of
just (M, H_in) size. This can then be projected to a new size (H_out) and
combined with the output hiddens to get an RNN that has "seen" a weighted sum
of the input, so it can generate conditioned on the whole input sequence.

Note that this whole messy procedure I described happens per timesetep - so
the neural network (given enough data) can learn what to look at in the whole
input in order to generate the best output. Other attentions such as [2],[3]
are constrained either to only move forward or to be a hybrid of forward
movement and (pooled) global lookup. [4] is a good summary of the different
possibilities in use today.

[1] [https://github.com/kyunghyuncho/dl4mt-
material/blob/master/s...](https://github.com/kyunghyuncho/dl4mt-
material/blob/master/session2/nmt.py#L452)

[2] [http://arxiv.org/abs/1308.0850](http://arxiv.org/abs/1308.0850)

[3] [http://arxiv.org/abs/1508.04395](http://arxiv.org/abs/1508.04395)

[4] PPT,
[http://www.thespermwhale.com/jaseweston/ram/slides/session2/...](http://www.thespermwhale.com/jaseweston/ram/slides/session2/Smooth%20Operators-
NIPS2015.pptx)

~~~
neurangotan
Thanks. That was the info I was looking for. Still a lot to digest but it is
going to be really helpful.

------
ogrisel
Out of curiosity, is this work being submitted for a conference or journal?

Also have you tried to run DMN on the Text QA / Reading comprehension datasets
from DeepMind?

Teaching Machines to Read and Comprehend [https://github.com/deepmind/rc-
data](https://github.com/deepmind/rc-data)

------
inlineint
It's fascinating.

Could anybody answer what programming language is most likely to be used for
development of this kind of systems? I haven't found information about it in
the paper.

~~~
ogrisel
Python and lua both have good frameworks for developing such models
(tensorflow, theano possibly with lasagne or keras, torch, caffe...).

The visual part need access to efficient GPU-based implementation of
convolutions (typically using cuDNN from nvidia or neon from Nervana Systems).

You could write bindings for other languages or even work directly in C++ but
I think that an interactive programming language with a REPL like Python and
lua is very helpful for faster iterative model building, interactive
exploration and evaluation.

~~~
inlineint
Thank you for detailed answer. I didn't know about the frameworks you listed
except tensorflow.

Btw do you know anybody who use Julia for this?

~~~
ogrisel
There is
[https://github.com/pluskid/Mocha.jl](https://github.com/pluskid/Mocha.jl) but
it seems more focused on vision tasks at least for now.

