
Visual question answering using CNN+RNN - abhshkdz
https://github.com/abhshkdz/neural-vqa
======
n0us
Amazing. I only had a chance to read the README.md but my question is this.
What happens if you ask it questions that it could not possibly answer, as in
if it were given a picture of the man playing tennis and you asked it what the
score was? Is it capable of discerning between questions that cannot be
answered (given a particular input) and those that can?

~~~
abhshkdz
Priors from the language play a much bigger role in the answers that are
predicted than the image itself. So for example, if you ask 'What color is
...?', irrespective of the image, it is more likely to spit out colors as the
answer. The answers are usually well-aligned with the question that is being
asked. 'Yes/no' for binary questions, 'red/blue/etc' for 'What color...',
'tennis/baseball/etc' for 'What sport...' and so on.

------
KnightHawk3
Is there a catch to the effectiveness of this?

I haven't seen it before and it seems pretty magical.

~~~
abhshkdz
Although there is no catch, it's far from perfect and hardly magical. Its
accuracy goes up to ~55% on the VQA
([http://visualqa.org/](http://visualqa.org/)) dataset (which is short of
state-of-the-art by ~7%).

------
arocks
I have seen sites using captchas which ask such visual questions thinking that
only a human can answer them. This project really makes me doubt the
effectiveness of such techniques.

~~~
abhshkdz
As it stands currently, it's quite far off from cracking captchas. :-)

------
mrdrozdov
How do you measure accuracy? Is this a new baseline?

~~~
abhshkdz
Accuracy is measured as min((number of humans that provided that answer)/3, 1)
i.e. 100% accurate if at least 3 humans provided that exact answer, as
outlined here:
[http://visualqa.org/evaluation.html](http://visualqa.org/evaluation.html).

No, this model is from the NIPS15 paper by Ren et al
([http://arxiv.org/abs/1505.02074](http://arxiv.org/abs/1505.02074)).

