
Toward a deeper understanding of the way AI agents see things - muse900
https://code.fb.com/ai-research/ai-agents-see/
======
sgt101
I've been playing with the MAC network demo ([https://github.com/KnetML/MAC-
Network](https://github.com/KnetML/MAC-Network) for the julia version) and
this study chimes with what it's got me thinking. I was really impressed when
I saw it, but getting it running and experimenting with got me thinking.

The dictionary shows that the numbers in the response system are labels, it's
hard to evaluate the spacial reasoning and I (and other people looking at the
demo) tend to attribute "near" misses as good efforts rather that
mislabelling.

The images in CLEVR are artificial, and the mappings are the grammar that the
scene graphs generate, and these graphs are what are being learned and mapped
via the images.

Also there are systematic issues; large metal spheres generate reflections
that then deceive the classifier, metallic cylinders are often classified as
spheres. Blocks that don't present a "square" aspect are often ignored.

The big issue for me is that what seems like a significant capability that
might map into a useful tool turns out to be essentially an "exploit" of my
psychology, and even a small step towards a "mind" in a machine ends up
looking to me to be yet another set of complex transformations over the
intents of the humans that built it.

~~~
chriskanan
How well does deep learning work in Julia and how mature are the toolboxes?

There is a new model VQA model for CLVR published in Neural information
processing systems this year that gets better results than MAC, but they used
specialized visual features. They didn't really explore the impact of using
those, but I wonder if some of their performance is due to using them to
overcome the systematic issues you brought up.

~~~
sgt101
Interesting - I'll have a look for that paper.

There's some good DL Packages in Julia - Knet is very elegant but I have to
admit (as a Julia fan) that there are challenges in getting things to run
properly. I had a long battle with the linker to get the MAC demo's working
properly - resolved with a single line of code - but still... I've used the
MXNet wrapper to build a LSTM in the summer as well, and that was much
smoother (the network itself didn't perform well on the task I had for it, but
that's another issue).

------
zamalek
> without determining, for example, that pictures of a Boston terrier and a
> Chihuahua both represent dogs.

I wonder if that is an artifact of how the training occurs. If you look at how
humans generally learn, they learn what a dog is _first._ For many years they
don't learn about breed (possibly ever). Cats and dogs eventually earn the
shared labels "pet" and "animal".

However, at least in a heterosexual unit, we learn "mama" and "dada" before
"woman", "man" and "person." However, we eventually learn to apply many labels
to people: "name", "gender", "species" and other, possibly horrible, things.

In order for networks to share representation about hierarchical labels,
assuming that humans are doing nothing novel in this problem space, they would
either have to:

* Learn general labels first.

* Provide hierarchical labels as output.

* Provide multiple labels as output ("Labrador", "dog", "animal").

As a guess.

------
En_gr_Student
I don't know that the distrust in similarity between images of noise is
unwarranted. They are pseudo-random, not purely random. Each is not
independent, but is a pile of sequential draws in a row, perhaps 16384. There
are thought to be shortcuts that the NSA uses to quickly short-circuit
encryption, and is alleged to have salted public methods so as to make that
job easier. Random-number-generation and encryption are related to each
other:a properly encrypted chunk of data looks nearly exactly like pure random
noise, as does a good pseudo-random number. I would not be surprised if there
were similarities that the mathematical methods find that the human eyeball
does not.

This feels like a minimum description length problem. I think that if the
agents had to use hierarchical descriptors, thinking of cat as some assembly
of tail, legs, body, head, eyes, ears, mouth, and all, that an internal
hierarchy would show up in the communication, and a divergence between the
training and communicated hierarchies would have a better defining contrast at
showing an inferred structure.

------
moneil971
Like many studies, in retrospect, it makes sense that agents would go to the
simplest method of 1:1 comparison, rather than "learning" what is actually
depicted in the image. Like a difference between how your brain tackles those
games to find the differences between two nearly identical photos vs. a game
where you try to name all the distinct items shown in an image.

