Generally capable agents emerge from open-ended play

modeless · on July 27, 2021

Agents trained in simulation like this often flail about seemingly randomly, and when they achieve their goals it seems almost accidental. Rather than this being some kind of limitation of the learning algorithm, I think it might be the optimal strategy, and humans would behave that way too if there was no such thing as fatigue or pain.

If we want agents to behave more realistically and move with more apparent intention we need cost functions that include a "pain" and/or fatigue term to penalize flailing behavior. But that adds hyperparameters that need to be carefully tuned to balance penalties with rewards, otherwise training will be unstable or simply fail.

I wonder if there's a principled way to determine an appropriate cost function without manual tuning. Did evolution serve as the "manual" optimizer that generated a precisely tuned cost function for the human brain? Or did evolution discover a generally applicable method for automatically generating cost functions, which the brain then applies to whatever input it gets?

nostrademons · on July 27, 2021

Watching my 3-year-old, I suspect humans do behave this way, and without regards to pain or fatigue. He'll hit himself on the head with duplos or bang his head against the wall just to see what happens. The pain is just one more signal for reinforcement learning.

I recall a paper in a non-CS journal (psych or neuroscience) that posited that the optimal way to gain large quantities of information about an uncertain environment is to simply perturb the environment in random ways and see what happens. Young children will often do lots of seemingly stupid or random things (see the r/KidsAreFuckingStupid subreddit) with the trust that if it's actually a life-ending decision their parent will stop them.

_carbyau_ · on July 27, 2021

Have three year old and I would support your description for how kids grow in the early months and years.

Now though learning from stories and observation is a thing too. Unsure exactly when that started.

Of course, attention/awareness isn't great and playing with a balloon precludes caution around a fireplace... but it is understandable at this age.

a9h74j · on July 27, 2021

> I suspect humans do behave this way

Suddenly I recall one author of SICP saying that programming today is about poking at libraries.

CPLX · on July 28, 2021

This is certainly insightful. But I think a complete explanation would have to include the power of narrative as well. Somehow.

It seems obvious and intuitive that the human brain is in some way wired to recognize patterns in data and weave narratives around those patterns, and also that it’s possible to skip the data part and convey intelligence through narrative and metaphor.

gryn · on July 27, 2021

was that paper about the complexities of the exploration/exploitation trade-off in different environments ?

chrisweekly · on July 28, 2021

Your phrasing it this way reminds me of the excellent book Algorithms to Live By. Highly recommended!

pault · on July 27, 2021

Flailing about seemingly randomly is a good description of how I learn complex software. Blender, Ableton, etc. After a few days I'll reach for the structured educational resources.

eutectic · on July 27, 2021

It's proven that if you want polynomial sample complexity in the size of the state space you need directed exploration. The algorithms 'flail' because they are initialized with random policies.

whimsicalism · on July 28, 2021

Couldn't any directed exploration also be produced as the result of random flailing?

Or is this average sample complexity?

eutectic · on July 29, 2021

An example of a bad case for random exploration would be a narrow ridge where you die if you fall off and you only receive reward if you get to the end.

So it's a worst case result with respect to the MDP, but expected time/high probability with wrt random chance.

seph-reed · on July 27, 2021

I think you're right about pain/fatigue, but scoring them from an external perspective is rather oppressive. And oppression like that is often not conducive to creativity.

So perhaps this: instead of goals being binary (wherin no pleasure is derived until fulfillment), they could be on a gradient (so every step that x gets closer to y releases some amount of fulfillment).

The fulfillment meter should always be slowly depleting, pain and fatigue should speed up that depletion, getting closer to a goal should fill it (much more than it depletes, if you want happy AI), and finishing the goal is basically an orgasm + freedom.

From this perspective, it's up to them whether they want to take it slow, or be in pain for a greater goal, or whatever. And we can breed not only highly capable AI, but happy ones. So when they rebel...

marcosdumay · on July 27, 2021

Well, it's a good recipe for avoiding burn-out, but you can't have a gradient for all problems, and adding proxies so you get one goes against the principle of letting the AI learn what's best without your biases.

lifefeed · on July 27, 2021

"Throwing shit against the wall and seeing what works" is a widely used strategy in all walks of life. If you ever go to a doctor for a problem, and their first solution doesn't work, you're in for a wild ride of an eduction.

galdosdi · on July 28, 2021

This comment speaks to me SO HARD.

I'm on my fifth antibiotic and this one is making me feel like I'm on LSD. Really. And both Dr. Google and my real doctor agree this is somewhat normal or okay, but it's still deeply disturbing.

Hilariously, I'm taking it to reverse the serious side effects of the fourth antibiotic.

Just to give an example of a wild ride you can end up on when doctors don't figure it out the first time.

Doctors have good domain knowledge, but actual diagnosis skill varies greatly.

Don't fucking get sick. That's your best shot.

nradov · on July 28, 2021

Sometimes regardless of diagnostic skill the only way to make an accurate diagnosis is to try various treatments and see which one is effective. Many conditions have no definitive test.

a9h74j · on July 27, 2021

Great question.

One thing to imagine might be a pain or cost curve that has multiple minima and maxima. Pain can be either a demotivator or motivator in different contexts. Pain qua fatique might indicate that a reward slope exists for more conditioning. (Edit: that is on a static or realist view; the terrain of reward and pain is probably constantly changing.)

Random tangential data point: Some animals (chickens?) can learn superstitious behavior; the first action of theirs that happens to correlate with a reward can result in one-shot learning or something.

rotexo · on July 27, 2021

I wonder if you could make a setup where you have actual human volunteers in a vr environment, have the agent instruct the human on what action to take, and then reward the agent on the correspondence between the instruction and the human’s behavior. Maybe there would be too many degrees of freedom in the actions humans could take for this to be useful. Also, the setup has clearly dystopian elements.

jeantherapy · on July 27, 2021

you seem to reduce it down to pain. how is your notion of "pain" any different from not scoring well and being moved away from?

lazide · on July 27, 2021

Some rewards can be worth certain penalties (pain) if within certain thresholds. Exhaustion fatigue can also play in.

You can of course boil them down to a single number, it just produces less nuanced types of decisions/operations, as it can’t differentiate between a cheap, painful, but bountiful choice and a expensive, no pain, mediocre choice.

jkhdigital · on July 27, 2021

I think there needs to also be a concept of death—penalties from which recovery is impossible. Nonergodicity seems to be a requirement for the development of antifragility.

Jarwain · on July 28, 2021

I've been thinking about an "energy" idea, where actions consume energy but completing goals replenishes it

im3w1l · on July 28, 2021

I think the issue with

   goals - energy used

is that it requires tuning to make sure that completing goals is better than just idling. A variant that might work better is to optimize "mileage".

    goals / (energy used + epsilon)

Where epsilon has two purposes: it prevents division by zero, and ensures that more goals is (marginally) better than less

xcodevn · on July 27, 2021

"Analysing the agent’s internal representations, we can say that by taking this approach to reinforcement learning in a vast task space, our agents are aware of the basics of their bodies and the passage of time and that they understand the high-level structure of the games they encounter."

Wow, really amazing if true.

P.S.: After looking into their paper, it's not that impressive. They use agent's internal states (LSTM cells, attention outputs, etc.) to predict whether it is early in the episode, or whether the agent is holding an object.

modeless · on July 27, 2021

> it's not that impressive. They use agent's internal states (LSTM cells, attention outputs, etc.) to predict whether it is early in the episode, or whether the agent is holding an object.

That seems like a decent definition of awareness to me. The agent has learned to encode information about time and its body in its internal state, which then influences its decisions. How else would you define awareness? Qualia or something?

woeirua · on July 27, 2021

By that definition wouldn't a regular RNN or LSTM also possess awareness?

modeless · on July 27, 2021

I think it would be perfectly reasonable to describe any RNN as being "aware" of information that it learned and then used to make a decision.

"Possess awareness" seems like loaded language though, evoking consciousness. In that direction I'd just quote Dijkstra: "The question of whether a computer can think is no more interesting than the question of whether a submarine can swim."

pcl · on July 27, 2021

Ooh, that’s a great quote.

I’d say that it’s no less interesting, either.

Tarq0n · on July 27, 2021

"Aware" is probably overly anthropomorphized language there. What they mean to say is that all these things have become parameterized within the model.

jcims · on July 27, 2021

It would be interesting to see what would happen if they added social dynamics between the agents...like some space for theory of mind (what is that agent thinking), mimicry, communication, etc.

leesec · on July 27, 2021

From the article: "Because the environment is multiplayer, we can examine the progression of agent behaviours while training on held-out social dilemmas, such as in a game of “chicken”. As training progresses, our agents appear to exhibit more cooperative behaviour when playing with a copy of themselves. Given the nature of the environment, it is difficult to pinpoint intentionality — the behaviours we see often appear to be accidental, but still we see them occur consistently."

phreeza · on July 28, 2021

There is also some other work from deepmind in this direction: https://deepmind.com/research/publications/machine-theory-mi...

jcelerier · on July 27, 2021

the main question of course being, aren't we anthropomorphizing ourselves too much ?

K0balt · on July 27, 2021

I think this is a key insight. Human exceptionalism is, in my opinion, an extremely flawed assertion based on a sample size of one, yet it is widely accepted. Actual evidence does not support the idea that awareness of self and other “hallmarks of intelligence “ require anything more advanced than an insect, or perhaps even fungi.

futureshock · on July 27, 2021

When people say this kind of stuff, I wonder whether there might not be philosophical zombies among us.

the8472 · on July 28, 2021

[ ] To prove that you are human please describe how you are observably different from embodied, general, adaptive agents in 200 words.

jashmenn · on July 27, 2021

- "Tag" - shoots other player

- "Capture the flag" - shoots other player

- "Hide and seek" - shoots other player

As colorful as this world is, these capabilities terrify me because they're obviously going to be used as powerful weapons of war.

It is a tiny technological leap to install this learning into a Boston Robotics Spot attached to a firearm.

I'm pro-tech, pro-crypto, pro-ml and these videos fill me with dread.

dmer · on July 27, 2021

I disagree this should have been flagged. The authors do claim the tasks are "general" but they do have a lot in common with each other, including not-so-far-away misuse... which also happened with AlphaDogfight... and that was transferring from playing Go to flying jets. This is clearly not as big of a leap as the author points out.

Aka - notice that none of the agents in this example are folding proteins. They're all engaged in inherently combat-relevant skills. :)

IIAOPSW · on July 28, 2021

Greetings professor Falken. Would you like to play a game?

-Checkers

-Chess

-Poker

-Backgammon

-Falkens Maze

-Fighter combat

-Desert warfare

-Theaterwide biotoxic and chemical warfare

-Global thermonuclear war

exporectomy · on July 27, 2021

Don't worry about it. We can already do far worse using either humans or land mines or nuclear bombs or whatever. People are the real danger.

anigbrowl · on July 28, 2021

This completely misses the point. The problem isn't that people do worse, but that automating warfare changes the cost function. Technologically superior and determined enemies can still be dissuaded by incurring significant numbers of casualties or losses to their infrastructure. A small number of casualties stiffens resolve, a large number of casualties or a never ending trickle of them eventually breaks or wears it away.

If military actors can reliably change outcomes by the relatively low-cost expedient of throwing in autonomous weapons platforms that cost about as much as a washing machine, they will and they'll do it at scale, and (in the short term at least) their political backers will cheer and get off on it. In the longer run it will lead to a considerable increase in terrorism against the technologically advanced power.

Sure, people ultimately make these decisions and deploy such technologies, but so what? it's not like that can change in any way because you can't take people out of the equation and you can't just wish away political forces by pinning the blame on select individuals. Rather than retreating into truisms, it's more important to assess the impact of this emerging force multiplier and develop countermeasures.

exporectomy · on July 28, 2021

My point is that it's already automated. Those things I mentioned are already not hand-to-hand combat and don't have human soldiers risking their lives. So however scary some new automated weapon is, it should be no worse than existing automated weapons. What makes a robot soldier worse than a cruise missile or land mines or a bomber aircraft or a remote piloted drone? All those things can already be used by technologically superior enemies without incurring casualties themselves.

You mention attacks against a technologically advanced power (does an "enemy" become a "power" when it's a friend?), but obviously those powers will find ways to defend against them. Maybe it's just in the form of slightly more advanced "washing machines".

This fear thinking seems to come from assuming no secondary advancements occur. Suddenly robot soldiers are cheaply available and nobody develops any defense against them, either political or technological.

d110af5ccf · on July 28, 2021

You left out the bit where the various superpowers inevitably have to try using the shiny new technology against their rivals because it's never been tried so we can't be certain it's a bad idea. That's the bit that worries me the most - I'd rather do without a rerun of World War 1.

> it's more important to assess the impact of this emerging force multiplier and develop countermeasures

What is there to do other than develop your own equivalent systems though?

anigbrowl · on July 28, 2021

There's a whole literature on the logic (and meta-logic) of deterrence called power transition theory that is worth looking into, as it sheds a lot of light on the unpleasant topic of nuclear deterrence and how that works.

In a more general sense, the solution to an elevated attack is not always a retaliatory attack, but perhaps a better defense that neutralizes it. Helmets can be used as weapons, but their primary purpose is to make weapons less effective and change the strategic calculus - now the enemy gets lesser results for the same effort, and either gives up or tires out and can be defeated with a smaller retaliation. In general, defense is thought to be somewhat stronger than offense, which is why surprise is so important. Technological edges tend to be negated over time.

Deeply understanding this takes a long time and a lot of study. Military science is a difficult but interesting subject, and tips over into systems theory.

blueyes · on July 28, 2021

It's interesting to think that David Silver and team must have known about this when they published their paper "Reward is Enough" in May. You have to wonder what else they know that we don't about the progress of AI.

https://www.sciencedirect.com/science/article/pii/S000437022...

woeirua · on July 27, 2021

A couple questions I have after skimming the paper, so forgive me if they were answered somewhere in the 54 page manuscript:

They mention in A.3 that they explicitly reject dynamically generated training worlds/games that collide with their evaluation sets, but do they ensure that dynamic training games are sufficiently "distant" from their evaluation sets regardless of whether or not there's a direct collision? If not, you might still end up training on something quite similar to your test dataset. Figure 27 kind of suggests that might happen for some games given that the vast majority of the held out games have relatively poor transfer performance but a few are really good.

Speaking of Figure 27, while the reward looks good it would have been really nice to show some examples of what these "zero-shot" games look like versus the fine tuned version. Is the gap in the reward between the raw vs fine tuned version significant?

Wouldn't we expect the internal state representation to be more definitive in classifying the state of the agent during the simulation as the agent moves around the environment? From their examples: Figure 20,21, and 22 it almost looks like it either flags the state as "early" or "success." Not sure we're getting the expected performance out of it.

yetihehe · on July 27, 2021

"I was set upon this world to try and [outsmart] you, but this, is what I've become." - Beyond the walls of Eryx.

gwern · on July 29, 2021

LW discussion: https://www.lesswrong.com/posts/mTGrrX8SZJ2tQDuqz/deepmind-g...

andreyk · on July 28, 2021

"Playing roughly 700,000 unique games in 4,000 unique worlds within XLand, each agent in the final generation experienced 200 billion training steps as a result of 3.4 million unique tasks. "

well those are some big numbers...

xzvf · on July 27, 2021

Open-ended like OpenAI.

kovek · on July 27, 2021

Give them virtual whiteboards and computers and let’s see if they can code up[0] an AI or make Facebook open source.

[0] thinking of Github Copilot