Imagine how would you classify even a simple concept, such as "riding" - a human could be riding a horse, or a monkey could be riding an elephant, or there could be tons of other cases. Riding would be much simpler to detect if there is a classifier that selects "objects used for rides" and "agents that can ride" and the spatial relation between them - that would be a high order concept that we can't simply learn from images, we have to have preliminary abstractions that help in classifying it.
I think this is the future in AI - higher order concepts, based on compositions of previously known concepts. Both language and the physical world are made of objects and relations between objects, so it would be necessary to learn to combine concepts into new concepts, even when training data is very small. It would solve the problem of sharing knowledge between tasks. Another benefit would be the ease of inspecting the internal state of the system - which would be a graph of language based concepts, unlike the internal states of neural nets which are inscrutable. An agent that has higher order abstractions and an object-relations graph would also be programmable in plain language and capable of reasoning over facts - that would make AI accessible to the public at large.
Another way of putting it is that up until now, we used plain vectors, as if it was untyped data, but now we need to operate over strongly typed vectors with higher order operators. We need type theory into neural nets, to apply type constraints and to convert from one type to another, by applying operations. Such operations are hard to learn directly from labeled sets of images.
Kind of like how computers work but learn the code and reason about data structures automatically
How hard it would be to make the transition to more complex games, like Zelda or Mario for the NES ?
But I still like idea, on x86 'INC' is just a human readable label for 0xCD, and we instruct our processor to do a series of tasks. The difference is on an x86, we specify the instructions and we get the reward (yay). Here you specify the meta-instructions and the agent is rewarded for completing those. The Atari game's end goal reward is sort of irrelevant to the agent, its real goal is completing each meta-instruction we have given it. Only the instruction giver knows the end-goal and the sequence of instructions to give.
Perhaps you could teach a second agent to be the instruction giver, the first agent the instruction performer (gets rewarded for completing subtasks only), and together they solve the problem. The job of the first agent still seems intractable for a game like Montezuma's revenge.
The real natural language work will be if you can actually use human descriptive sentences to create the subtasks without making hand-made templates.
