The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games

paulette449 · on July 14, 2021

PPO = Proximal Policy Optimization

[https://openai.com/blog/openai-baselines-ppo/]

cratermoon · on July 14, 2021

https://jonathan-hui.medium.com/rl-proximal-policy-optimizat...

politician · on July 15, 2021

The hero we need.

jdlyga · on July 14, 2021

Thank you!

Robotbeat · on July 14, 2021

Indeed. I looked for the definition in the whole webpage but couldn’t find it. Even Googling initially failed. https://arxiv.org/abs/1707.06347

throwaway81523 · on July 14, 2021

Yeah, did the same, then looked at the linked article. Its abstract:

     Proximal Policy Optimization (PPO) is a popular on-policy reinforcement learning algorithm but is significantly less utilized than off-policy learning algorithms in multi-agent settings. This is often due the belief that on-policy methods are significantly less sample efficient than their off-policy counterparts in multi-agent problems. In this work, we investigate Multi-Agent PPO (MAPPO), a variant of PPO which is specialized for multi-agent settings. Using a 1-GPU desktop, we show that MAPPO achieves surprisingly strong performance in three popular multi-agent testbeds: the particle-world environments, the Starcraft multi-agent challenge, and the Hanabi challenge, with minimal hyperparameter tuning and without any domain-specific algorithmic modifications or architectures. In the majority of environments, we find that compared to off-policy baselines, MAPPO achieves strong results while exhibiting comparable sample efficiency. Finally, through ablation studies, we present the implementation and algorithmic factors which are most influential to MAPPO's practical performance.

isaacimagine · on July 14, 2021

PPO is awesome, but so is GPT-style reward-trajectory prediction! http://arxiv.org/pdf/2106.01345v1.

As a RL hobbyist, I'd love to see some sort of hybrid approach. Thoughts?

bsder · on July 14, 2021

Are these things amazingly effective or are they simply demonstrating that Starcraft/DOTA aren't as difficult as we thought?

seabird · on July 14, 2021

Neither. Once knowledgeable people get a read on these type of things, they can usually handle it. The OpenAI Dota 2 "team" was open for the public to play -- it was certainly very good, but multiple teams beat it, sometimes even multiple times in a row. It was great at cheesy stuff like superhuman Force Staff plays that humans could never reliably pull off, but could be beat through macro pressure.

sillysaurusx · on July 15, 2021

I’m not sure this is true. Humans were unable to beat the bot when playing by the rules the bot trained on. The bot was only trained with a very specific hero pool, and special circumstances such as having a courier for each hero (highly unusual in competitive dota; basically unheard of) which it used to ferry health potions to itself constantly. The humans couldn't deal with this constant source of health regeneration.

It was only after the hero pool was expanded to additional heroes that the humans were able to win. But this is like losing because 86 new heroes were just released that day; it would be unfair, to say the least, if you had to face heroes you knew nothing about, especially when your opponents were experts.

OpenAI gave up working on it after they had proven that "if you spend $X amount of money, you can win at whatever rules you decide." Unfortunately the cost of $X is very high for this method, because it requires thousands (tens of thousands?) of simulations, each of which need to happen in real time on an actual dota client (i.e. it's running the game). The game rules also change continuously with new patch releases, so the training quickly becomes more or less obsolete, especially if the bot overfits on a particular ruleset.

That doesn't change that the bot can overfit on a specific ruleset, which I think is a positive statement: it's quite remarkable, really, that it's possible at all.

Buttons840 · on July 15, 2021

> it requires thousands (tens of thousands?) of simulations, each of which need to happen in real time on an actual dota client

Wonder how much more optimized the core Dota rules could be made without having to worry about rendering, and pushing yourself to not be satisfied with "good enough for real time performance"?

ctchocula · on July 15, 2021

I think GP was exactly right when he said AI is amazing at micro, but that humans could win using macro pressure. Open AI 5 is an incredible achievement, but for Dota players it hasn't solved Dota. The best analogy is probably something like playing a fixed strategy in Poker, but never varying it; your opponents will eventually figure what your strategy is and ways to exploit it.

What you mentioned with having a courier for each hero is actually a well-known tactic called "bottle-crow" and was popular in Filipino Dota from 2010-2015. Each support hero would buy a dedicated crow for each lane and a bottle. Then the cores would ferry the bottle to themselves constantly from base, but this tactic got nerfed to death with the change "crow flies at half speed with an empty bottle". You are right that OG didn't deal well with this constant source of health regeneration, but it was largely limited to the laning phase. After the laning phase ends, humans can use their superior macro (i.e. understanding of map movement) which is a definite weakness of the AI.

Against the simple AI that comes built-in to Dota 2, you can be behind by a ton and exploit splitpushing against the AI, because doing any structural damage against buildings will force the AI (who are gathered up for a push) back, and you will win if you drag the game sufficiently into the lategame. There are indications that similar tactics can be used against the OpenAI 5, which is bad countering invisible items/heroes and playing against splitpush [1]. Deciding whether or not to continue to push and which heroes to send back to defend is one of the hardest strategic decisions to make in the game, even for humans. The finals of one of the TIs, NaVi vs. Alliance, was lost on getting such a decision wrong. Eliminating some key splitpush heroes minimized the probability that OpenAI 5 would have been forced into having to make such a decision.

OpenAI 5 would have lost against OG if any of the following had been satisfied:

- had the entire heropool been available (certain natural splitpushing heroes like Furion, Tinker, Phantom Lancer, etc.)

- had the series gone long enough for them to uncover AI's weaknesses (which were apparent to the viewers and later teams). OG's captain even explicitly said "Give us 5 games and we will figure it out."

- had the prizepool been big enough for OG to take the game seriously enough to warrant picking a splitpushing strategy which is considered "cowardly" in some circles.

[1] https://www.reddit.com/r/DotA2/comments/beyilz/openai_live_u...

sillysaurusx · on July 15, 2021

Lmao, imagine calling AdmiralBulldog’s primary strategy “cowardly”.

Sorry for the rare persona break, but Dota happens to be my old passion. It was a bit like saying that Magnus wouldn’t play a certain chess strategy because it would be “cowardly.” I assure you, the top teams are there to win. They also saw how ungodly effective OpenAI’s 1v1 bot was, and everyone respected its skill.

You can make those claims, but I think we’ll have to agree to disagree. More specifically, I agree that OpenAI solved a limited subset of Dota. But I disagree that OG would have “figured it out” under the specific rules the bot was trained on. It’s a totally different game, and it’s nothing like bottle crow (which I was fond of back in 2010). It also wasn’t bottles, because only the mid lane buys a bottle. The other lanes were usually health regen to prevent your cores from having to go back to base - which of course the AI did here, but to a ridiculous degree that no one was expecting.

Which is more likely: that Dota happens to be the one game that AI can’t become superhuman at, or OpenAI didn’t train quite long enough to prove it?

ctchocula · on July 15, 2021

Yep, we'll have to agree to disagree.

One reason I disagree is that Dota (and Starcraft 2) are both games of imperfect information (more like Poker than Chess or Go).

That means a lot of the time, the game revolves around deception and subterfuge. If you've ever had to hunt down an enemy splitpusher and have to decide between spending 10s checking a hiding place and having a chance at getting a potentially game-winning kill, but also having a chance of not finding them and wasting your time, you'll know what I'm talking about. There's no optimal strategy, because all strategies have weaknesses to be exploited (e.g. In BW, if Terran knows you as Protoss always go greedy such as 12 Nexus, they can punish you with BBS. However if you play safe, they can play greedy themselves with 14CC, so there are a lot of mindgames.)

After AlphaZero's victories or DeepBlue's for that matter, professional Go and Chess players could find no weaknesses. Nada.

After OpenAi 5 became available, a professional player compiled a list of 20 weaknesses that I'm not certain OpenAI can ever fix (see link above). How do you tell the neural net when to dust or ward?

The same story with DeepMind's AlphaStar. Despite playing a good macro game, players online have figured out lots of ways to cheese it (e.g. send just one unit will cause all workers to be pulled and stop mining). I understand Poker's been "solved", but these computer games with much larger action spaces and imperfect information might turn out to be significantly harder to pin down than Chess or Go. There are enough edgecases that it's a much better representation of real life problems like self-driving cars (negotiating a merge for example).

_manifold · on July 16, 2021

>How do you tell the neural net when to dust or ward?

My knowledge on what is feasible with AI is fairly limited, so maybe a proposition like this is a bit naive - but wouldn't a more effective solution be to use some sort of machine learning/neural net hybrid that also incorporates data from matches played high-skill teams?

From what I'm reading, OpenAI trains the neural net entirely by playing games against itself, and only uses matches against humans as benchmarks - so it essentially only develops strategies of play that primarily work well against itself. In most cases those probably coincide with strategies that work against human teams, but it seems like a lot of information is going to go missing - which is probably why it's only been successful in certain subsets of the game.

It reminds me of a story I read a while back about two children who grew up relatively isolated from the rest of society, but had access to musical instruments. Since they had no teacher and apparently no other training materials, they developed an entirely unique style of performance and composition. Obviously in the case of music "success" is entirely subjective so simply being different doesn't invalidate it. But the point of OpenAI is to eventually beat human players in all aspects of the game - and if it's not going to actually train against humans, then success is going to be at least partially coincidental. So I would definitely agree that OpenAI (as it stands right now) has some potentially insurmountable weaknesses.

fsn4dN69ey · on July 18, 2021

I don't think it changes the fact that given enough compute time, there is a "GTO" strategy for imperfect information games like Dota, just as there is for poker. The AI will lose some proportion of games, but overall it'll win.

https://www.deepstack.ai/

Buttons840 · on July 15, 2021

Facebook is hosting a NetHack AI competition I'm watching with interest. I want to see how AIs perform in a challenging environment that's pure decision making with no chance for superhuman performance sneaking in like we've seen with StarCraft and Dota.

sitkack · on July 15, 2021

I feel like the AIs should be handicapped to have a human level of concurrency and physical reaction time.

A bot can pay attention to 50 things in a way a human never could.

tedunangst · on July 14, 2021

Research titled "surprising effectiveness" should quantify the surprise, not just the effect.

carrolldunham · on July 15, 2021

No, it should be retitled entirely because it's a terrible verbal-tic tier meme. Sick to death of every third post on hackernews being "The ___ ___ness of " especially "The un__ __ness of ". Language is so rich but we have this cesspit

gradschoolfail · on July 15, 2021

I upvoted you, but the bottom of this cesspit unfortunately lie the outputs of some otherwise illuminated people like… ugh... feel free to find out for yourself if you have the stomach for such a thing..:

https://en.wikipedia.org/wiki/The_Unreasonable_Effectiveness...