Hacker News new | past | comments | ask | show | jobs | submit login
Dynamic Memory Networks for Visual and Textual Question Answering (arxiv.org)
122 points by evc123 on March 7, 2016 | hide | past | favorite | 15 comments

MetaMind papers are always pretty awesome. Couple of great highlights from this paper:

The visual saliency maps in Figure 6 are astounding and make their performance on this task even more impressive as they give a lot of insight into what the model is doing and it seems to be focusing on the things in the image that a regular person would use to decide on the answer. Most striking was on the question "is this in the wild?", and the saliency was on the artificial, human structures in the background that indicated it was in a zoo. This type of reasoning is surprising as it requires a bit of a reversal in logic to come up with this way of answering the question. Super impressive.

The proposed input fusion layer is pretty cool - allowing information from future sentences to be used to condition how to processes previous sentences. This type of information combining was previously not really explored, and it makes sense that it improves the performance on the bAbI-10k task so much as back-tracking is an important tool for human reading comprehension. Also its clever that they encode each sentence as a thought vector before compositing so they can be processed both forwards and backwards with shared parameters- doing so on just one-hot words or even ordered word embeddings would require two vastly different parameters since grammar is wildly different when a sentence is reversed.

Lastly, on a side note, if 2014 was the year of CNN's, 2015 the year of RNN's, it looks like 2016 is the year of Neural Attention Mechanisms. Excited to see what new layers are explored in 2016 that will dominate 2017.

Thanks for the great comment! For a direct link to the visual saliency maps from the paper (Figure 6) which @nicklo mentions:


Being able to see where a neural attention mechanism is fascinating and allows for far more than just introspection. Indeed, one of the earliest and coolest examples are neural attention mechanisms learning to align words between languages without any help[1]!


New papers on neural attention seem to be coming out every few weeks. I devoted much of my weekend to a naive implementation of hierarchical attentive memory[2] that promises O(log n) lookup (important for improving the speed of attention based algorithms on large input) and can be "taught" to sort in O(n log n) - really exciting stuff!

For those interested, I highly recommend reading NVIDIA's "Introduction to Neural Machine Translation with GPUs"[3]. The three part overview is a great introduction.

I'd be really happy to see 2016 as the year of neural attention mechanisms :)

[1]: http://arxiv.org/abs/1409.0473

[2]: http://arxiv.org/abs/1602.03218

[3]: https://devblogs.nvidia.com/parallelforall/introduction-neur...

(full disclosure: one of the authors of the paper)

I see this system can take in a few statements and then apply reasoning in order to answer questions. Can we scale up this method to answer questions from Wikipedia pages, scientific papers and articles?

Maybe it could be used in conjunction with a chat bot that was trained with RL like AlphaGo (which used millions of human games as input data). People would say something or ask a question, then the net provide a bunch of answers and then people would rate the answers that are most human like, giving it a good/bad signal to use in training the RL part of the system. That would make the bot natural and human-like in conversation. Couple that with the reasoning part for answering questions and we get a reasoned/intelligent chat bot.

I'm wondering how far we are from being able to have normal conversations with bots - conversations that don't quickly get derailed or weird.

Hi thanks for a great paper. I have read it as well as the nvidia articles a lot of times but I am failing to grasp some important details. If you could shed some light that would be great.

The attention mechanism is a feed forward single layer neural network. The problem is that the input sequence is variable length. How are the outputs of the bi-rnn fed to the attention mechanism. What happens if I have a 80 word sentence as input and what happens if I have a 10 word sentence as input ?

It is a dense read, but you might have a look at [1]. This is how attention is implemented in Theano. Basically the key is going "3D" per timestep (where 2D per timestep is the norm when doing minibatch training), then taking a weighted sum over the correct axis to get the right size to combine with the RNN state.

Short summary:

I: input length

M: minibatch size (same for input and output)

H_in: input hidden size (arbitrary/user selected)

H_out: output hidden size (arbitrary/user selected)

C: attention feature size (arbitrary/user selected)

Looking at the decode/generator RNN, "context" comes in at every timestep as every hidden state from the BiRNN (I, M, H_in) projected to (I, M, C). We do the normal RNN thing (tanh, LSTM, GRU) for the generator at first to get a decode hidden state (M, H_out).

Next the "first" output hidden state gets projected to the attention size C (so now (M, C)), and using numpy broadcasting this (1, M, C) size thing gets summed with the (I, M, C) size "context" which is a projection of the input RNN hiddens. Now we have something that is looking at both what the output RNN has previously done (and seen), and some context information from the input RNN.

A nonlinearity (tanh) is applied to the (I, M, C) sized piece, partly in order to bound the activation to (-1, 1) and partly just because we like nonlinearity. This (I, M, C) size thing then gets projected to (I, M, 1), or "alpha", then the useless dimension is dropped via reshape so we now have (I, M), then a softmax is applied over the first dimension I. This means that the network has effectively "weighted" each of the I timesteps of the input, though at the start of training it can't know what is relevant, at the end of training this is the thing you visualize.

This (I, M) is the actual "attention" - to apply it you simply take the original hidden states from the input BiRNN and use broadcasting to multiply again. (I, M, 1) * (I, M, H_in) gives a weighting over every timestep of the input, and finally we sum over the 0 axis (I) to get the final "context" of just (M, H_in) size. This can then be projected to a new size (H_out) and combined with the output hiddens to get an RNN that has "seen" a weighted sum of the input, so it can generate conditioned on the whole input sequence.

Note that this whole messy procedure I described happens per timesetep - so the neural network (given enough data) can learn what to look at in the whole input in order to generate the best output. Other attentions such as [2],[3] are constrained either to only move forward or to be a hybrid of forward movement and (pooled) global lookup. [4] is a good summary of the different possibilities in use today.

[1] https://github.com/kyunghyuncho/dl4mt-material/blob/master/s...

[2] http://arxiv.org/abs/1308.0850

[3] http://arxiv.org/abs/1508.04395

[4] PPT, http://www.thespermwhale.com/jaseweston/ram/slides/session2/...

Thanks. That was the info I was looking for. Still a lot to digest but it is going to be really helpful.

usually zero padding is used; a max_input_length is set somewhere in the code and a number of zeros equal to (max_input_ length - actual_number_of_words_in_input) is appended to the array of input word_ids so that all input sentences have the same length.

This depends on implementation (unrolled RNNs vs true recurrence). Each minibatch needs to be the same length, but that is it. And that is even implementation dependent - if your core RNN had a special symbol for "EOS" it could always handle it in another way.

Normally you pad each minibatch to the same length (length of the longest sequence in that minibatch), then carry around an additional mask to zero out any "unnecessary" results from padding for the shorter sequences.

The BiRNN (using all hidden states) + attention mechanism is the thing that allows variable length context to be fed to the generative decode RNN. A regular RNN (using all hidden states) + attention, or even just the last hidden state of an RNN can all be used to map variable length to fixed length sequences in order to condition the output generator.

You will note that padding to the length of the longest sequence in a minibatch wastes computation - people often sort and shuffle the input so that sequences of approximately the same length are used in each minibatch, to maximize computation. If you padded to the overall longest sequence (rather than per minibatch), you would pay a massive overhead computationally.

Guys thanks both for the replies. I still do not understand the inputs and outputs of the attention mechanism neural network.

The paper says

eij = a(si−1, hj) is an alignment model which scores how well the inputs around position j and the output at position i match. The score is based on the RNN hidden state si−1 (just before emitting yi) and the j-th annotation hj of the input sentence.

We parametrize the alignment model a as a feedforward neural network which is jointly trained with all the other components of the proposed system.

So a is a feed fw nn. What is the input to this nn ? Do we input each hj corresponding to each word of the input sentence separately and get one number for each word ? If the input sentence has 200 words do I run this 200 times ? One for each input word ?

Out of curiosity, is this work being submitted for a conference or journal?

Also have you tried to run DMN on the Text QA / Reading comprehension datasets from DeepMind?

Teaching Machines to Read and Comprehend https://github.com/deepmind/rc-data

It's fascinating.

Could anybody answer what programming language is most likely to be used for development of this kind of systems? I haven't found information about it in the paper.

Python and lua both have good frameworks for developing such models (tensorflow, theano possibly with lasagne or keras, torch, caffe...).

The visual part need access to efficient GPU-based implementation of convolutions (typically using cuDNN from nvidia or neon from Nervana Systems).

You could write bindings for other languages or even work directly in C++ but I think that an interactive programming language with a REPL like Python and lua is very helpful for faster iterative model building, interactive exploration and evaluation.

Thank you for detailed answer. I didn't know about the frameworks you listed except tensorflow.

Btw do you know anybody who use Julia for this?

There is https://github.com/pluskid/Mocha.jl but it seems more focused on vision tasks at least for now.

Python seems to be the go to language for anything AI/machine learning.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact