
Deep learning for visual question answering: demo with Keras code - fchollet
http://iamaaditya.github.io/2016/04/visual_question_answering_demo_notebook
======
iamaaditya
Even though this is "Question Answering", it is trained as a classification
model. Thus the model will try to come up with one of the top "1000" answers
it has seen during the training. This certainly limits the possibility of
answers and sometimes returns very weird answer.

It is not for lack of trying that all the top papers in visual question
answering end up doing this as a classification task. Results are really poor
when it is used as RNN generation, and also extending it more than top 1000
answers does not yield any better results. 87% of the questions in training +
validation is within 1000 unique answers.

Latest models have started using more complex form of memory and more tightly
integrating the question vectors. One of the top model called DPPNet trains a
separate matrix from the question vector (chain of GRUs) to find
correspondence on the image filter weights. Their idea is that some question
have more relevant areas in the image features. Yet another model DMN+, by
Metamind uses dynamic memory network which they build to do language question
answers but the extension to images work pretty good.

Surprisingly the models that use visual attention are not the best and I think
it is mostly because this kind of model requires even more data and longer
training. Just taking 10 different crop of the question image and doing voting
of answer beats attention models (based on numbers reported by these papers).

Right now I am working on converting "End to end network"
-[http://arxiv.org/abs/1503.08895](http://arxiv.org/abs/1503.08895) to this
task. I tried working on Neural Turing machine but I could not make it work
for this kind of task, but it was mostly because of lack of indepth
understanding of NTM.

Any feedback from you guys are welcome.

P.S Thanks fchollet for writing Keras and for this post. Can't wait to try
Keras 1.0

~~~
syllogism
Thanks for your work on this, I find the VQA task really interesting.

The classification-based approach is definitely the part I find unsatisfying
about this task. The problem to me is that it biases the models learned very
strongly towards the data that was collected for training and testing.

Has anyone tried outputting a vector from the model, and using cosine to
predict the nearest word/phrase/sentence etc? This seems to work for non-
visual QA.[1] Training is performed using noise contrastive estimation. I've
discussed this idea with the Virginia Tech team, but I haven't had time to try
it, and they seemed a little skeptical.[2]

[1] [https://cs.umd.edu/~miyyer/qblearn/](https://cs.umd.edu/~miyyer/qblearn/)

[2] [https://github.com/VT-vision-
lab/VQA_LSTM_CNN/issues/14](https://github.com/VT-vision-
lab/VQA_LSTM_CNN/issues/14)

~~~
iamaaditya
hi syllogism

Right now everyone doing this is highly focussed on the competition and trying
to beat the numbers. For that purpose certainly they would want to stick to
predicting Top K answers.

For e.g see this table

    
    
      Model 	Q+I [1]	 Q+I+C [1] 	ATT 1000 	ATT Full
    
      ACC. 	0.2678 	0.2939 		0.4838 		0.4651
    

Where ATT full is when using all the words, it performs worse than ATT 1000
(Source)[Chen, K., Wang, J., Chen, L. C., Gao, H., Xu, W., & Nevatia, R.
(2015). ABC-CNN: An Attention Based Convolutional Neural Network for Visual
Question Answering. arXiv preprint arXiv:1511.05960.]

Once the competition is over (in rough two months), there will be more focus
on actual AI part, where generating the answers would be the right thing to
do. There are other papers where they use external knowledge base like
DBPedia, certainly "answer word" could be picked up from there.

What you have suggested is a very interesting approach, and I am not aware of
any paper which has tried that. Certainly quite a few paper have tried to
extend NLP QA to Visual QA but with limited success (expect Metamind people).
I will certainly keep that in my ideas to try list. I will update you if I get
some results.

P.S: Thank you for creating Spacy, I love it and I use it everyday !

------
harperlee
There is a very weird behaviour in this image that (to me, at least) speaks a
lot about the (low) consistency of this method: if upon the last image a
sport-neutral question is asked, then the answer is:

    
    
        40.52 %  tennis
        28.45 %  soccer
    

But then "Are they playing soccer?" is asked, and the answer jumps to the
following:

    
    
        93.15 % yes
    

What would it happen with tennis then? How can this make sense?

My only rationalization would be along the lines of, "Well, it guessed it was
a sport at least", but no person would answer these question like that...

~~~
iamaaditya
Hi Harperlee

I ran the question you asked --

"Are they playing tennis?"

and following is the result

99.93 % yes

00.05 % no

000.0 % right

000.0 % left

000.0 % black and white

~~~
harperlee
Thanks for adding more information! This is very interesting. So what's
limited in the response then is being able to discriminate among the separate
semas of the answer. The network is sure of a lot of common things among the
two sports but can't communicate that perhaps tennis is more likely than
soccer...

