Hacker News new | past | comments | ask | show | jobs | submit login
Mastering Real-Time Strategy Games with Deep RL: Mere Mortal Edition (clemenswinter.com)
129 points by cwinter on March 24, 2021 | hide | past | favorite | 66 comments



In case anyone misses the links, this is twinned with two other superb posts - one about general lessons the author learned over the course of the project

https://clemenswinter.com/2021/03/24/my-reinforcement-learni...

and one history of the project

https://clemenswinter.com/2021/03/24/conjuring-a-codecraft-m...


> This trend has culminated in the defeat of top human players in the complex real-time strategy (RTS) games of DoTA 2 [1] and StarCraft II [2] in 2019.

Not quite:

- OpenAI's DoTA 2 system wasn't playing the full game. I think the final version could play 17 of the 117 heroes, and the opposing human players were also restricted to playing this subset of the game.

- DeepMind's StarCraft II system reached a level above "above 99.8% of officially ranked human players.", so it isn't trivial to argue that this amounts to defeating top players.


< OpenAI's DoTA 2 system wasn't playing the full game. I think the final version could play 17 of the 117 heroes, and the opposing human players were also restricted to playing this subset of the game.

The bigger issue in my eyes was that while OpenAI 5 defeated the world champion team OG, when they let anyone in the world fight it, some ingenious players figured out a pretty robust method to consistently exploit and defeat the bot. As I haven't heard any buzz about OpenAI 5 since then, I think it was more or less unsuccessful unless they can show that their training method produces unexploitable bots (instead of bots that are really good against certain strategies)


They can train the bots based on on those games though, right? Seems more like a flaw in the training data than the principle.

I am not sure if the training is done live or not--that is does the algorithm learn based on each game against a real, live player? Or do they just train the model offline, then allow players to play against the static model?


Sibling posters have pointed out technical issues but I'd like to point out that while chasing after each successive exploit might let them stay on top of the Starcraft rankings, it changes the accomplishment from "we made a model that understands Starcraft better than almost any human" to "we made a model that memorized a selection of very good strategies and tactics."

When it was first demonstrated, it really looked like it was doing very smart things (while also taking advantage of the fact that it doesn't have attention lapses and hand fatigue) and reacting well to different strategies on a level that was freaky to me.


This doesn't seem any different than what humans do. Whenever I've played games everyone is usually along some progression of memorizing a selection of very good strategies and tactics. Then someone figures out a new tactic that exploits a weakness. Then people start to learn from that and figure out ways to protect against it. Rinse and repeat.


I always wanted to see if the AI could get to the point where it developed a novel strategy, that then changed the meta.

Something humans hadn't thought of yet, but were also psychically capable of doing.

Too me that will be the true moment that the AI has really surpassed us.


This is why I stick to the definition some other HN poster provided: Cognitive Automation.

But still really damn handy!


Learning through encounter seems to parallel a lot of human learning though.

The question is, can it then apply partial applications of learned techniques to create unique offenses and defenses in novel situations I think.


> They can train the bots based on on those games though, right? Seems more like a flaw in the training data than the principle.

I guess you could phrase it that way, but that's essentially the problem statement for developing a strategy for an imperfect-information game. So I would say it is a flaw in the principle if their final output is exploitable.


Training requires millions of games. playing against humans is only for evaluation purposes, not for training.

In both cases, it was indeed a static model but more recent work which is called MuZero is not static and achieves great results in board games and atari.


Assuming the OpenAI model is similar to DeepMind's AlphaStar model, the model is static.

And a few games of being exploited is nowhere near enough data for the AI to be re-trained.


The problem is how quickly the AI learns. When a player gets caught off guard by a strategy in one match, they will already react to it in the very next match. The AI we have does not.

Also, did the StarCraft AI have to move the camera around? I remember watching the show matches. In those the AI lost the match that it had to handle the camera and couldn't just give impossible orders that a player's interface would not allow.


I think this is kind of pedantic. They built an AI agent to take pixel data as input, and provide mouse movements and clicks as output, and rather than just flail around like a baby it actually played the games with a sophisticated competency. This to me is such an incredible achievement that I have no doubt that it could be enhanced to defeat top players consistently and easily.

As another commenter remarks, there are holes to plug in terms of exploitable behaviours that are locked into the model, but this too I'm confident they will find a general method of preventing; on the other hand, it's not like humans aren't susceptible to similar exploits by competitors in situations where they decide to cease innovation/learning


> As another commenter remarks, there are holes to plug in terms of exploitable behaviours that are locked into the model, but this too I'm confident they will find a general method of preventing

The problem is, I don't think there is a "general method of [prevention]" because that's not how neural networks work.

It's not easy to fix things like this because you can't just say "yeah just don't do that dumb thing anymore", the network has to be re-trained to learn the exploit.

The way DeepMind tried to get around this is by having a league of AIs playing against each other which try to exploit each other and expose their weaknesses. It worked pretty damn well, but people still found ways to exploit the AI.


Isn't the general method of prevention just to train a bigger model for longer, so that all these niche edge case exploits get found and addressed during policy exploration?

If there's an exploit that's sufficiently rare and unpredictable, then that seems like the only way (and indeed it should be a sufficient way, if done right) to address it.


That is the obvious answer, but I have no idea if it's true in practice.


It worked in adversarial board games (Go, Chess) and poker. We now have unexploitable bots for these games.

It hasn't worked yet in Starcraft because the strategy space is so much larger and the action space is also much larger. The networks are too small relative to this space, and humans can still put the bot into a situation it can't handle.

I'm going to guess that Starcraft will end up like these other games once the hardware etc advances another 5-10 years, and we'll have an unexploitable bot. The main reason I'm thinking this is we have unlimited training data, unlike with self-driving. We can make the models arbitrarily good.

The bot still won't have an ounce of common sense beyond what it's trained to do. It's just that it will have been so exhaustively exposed to every nook of the search space that a human won't be able to find any exploits.


I don't think either system was trained from pixels, but I agree it's impressive nonetheless.

However, it doesn't follow that it's easy to extend to beating top players consistently. If that was the case, it probably would have been done.


Not at all. It is a computer, of course it beats humans at optimization problems and speed. Not remotely surprising or interesting, any more than a calculator doing arithmetic faster than a human.


This is not accurate.

AlphaStar was nerfed quite heavily to achieve near APM parity with humans. The early version that beat Liquid-TLO had superhuman spikes in APM (despite having the same mean APM) but they addressed that later. The bot's APM is now significantly less than the best humans' typical APM, which makes it roughly fair since the bot never misclicks.

AlphaStar is legitimately good at strategic reasoning, strategic planning, responsive build orders, responsive scouting, long macro games, etc. It's not the best in the world at these things yet (only top 1 percent+ level) and it does still rely on cheese build orders a lot, but still, what has been achieved is incredible.


> The bot's APM is now significantly less than the best humans' typical APM, which makes it roughly fair since the bot never misclicks.

This isn't really true, it still has a heavy mechanical advantage.

I'd ballpark it at low or mid masters for overall strategy and tactics, top tier for 'reading' an incoming fight and deciding whether or not it should take the fight, and superhuman on mechanics/control.


The in-game EPM ended up about 180 with spikes smaller than humans' spikes. It also had camera constraints on where it could click unlike the bot that Liquid-TLO faced. Serral has a 300+ EPM for comparison. The bot's EPM is quite a bit worse than top humans, so I'm going to conclude that its mechanics are worse too.

Also I don't agree that it was low masters for strats or tactics. You can't beat GMs with worse/same mechanics unless you have good strats. Besides, low masters players are pretty bad and the replays show that this bot had super tight and highly optimized builds, it was a big fan and good at early to midgame cheeses.

The main valid criticism that I remember is that it wasn't the best at long games when the utility of its sharp build wore off and it needed to think on its feet, but I don't recall the details.


AlphaStar wasn't beating top pros, though. And even the players it was beating, it was with a heavy mechanical advantage -- not just on strategy and tactics.

That AI's can beat human players via superior interface control is obvious, of course, and uninteresting from an AI standpoint. Starcraft has had AI's with perfect roach/stalker/marine/etc control for a while. The problem was that the overall strategies weren't good enough.

AlphaStar did make massive improvements there, to be sure, very impressive ones. But it still relied on out-controlling human players to get the edge against pros.


A legacy of the AlphaStar work is ongoing amateur SC2 AI play. If you're comfortable writing software in, say, Python, and can play SC2 at least a little, you can see whether the reason you aren't a world famous player is just that you were too slow or whether your strategy actually isn't that great even if executed perfectly :)

https://sc2ai.net/competitions/3/

Unlike Alphastar, these AIs are intended primarily to play each other because as you say inhuman perfection in execution is not an interesting difference. This means it makes sense for them to exploit behaviour in the game itself that would be inaccessible to humans (e.g. "speed mining" by individually controlling every worker) as well as executing ludicrously multi-pronged mid-game attacks since they can just as easily manage six individual small battles as one larger frontal assault.

That site links a Twitch channel which automatically plays random games between bots with auto-camera, but if you prefer human commentators (and as a bonus, speeding up the period when the game is clearly lost but bots rudely never resign since it's not as though politeness scores points) there's https://www.youtube.com/watch?v=oLpEzq_6_go which is the next ESChamps tournament cast later on Thursday.


At Blizzcon 2019, Alphastar beat Serral, undeniably a top player (although he didn't get to use his own keyboard and settings or get to prepare). Serral was able to beat the Terran agent though.

https://www.youtube.com/watch?v=nbiVbd_CEIA


In StarCraft 2, if you're used to playing with specific settings (esp. graphics, keybindings, mouse speed), having to revert to standard is a huge handicap. There are also OS settings (keyboard delay and repeat rate) that if left at default basically make a "standard" game unplayable, especially for Zerg.


He didn't get to use his own hotkeys? Are we sure about that? I'm sceptical, it'd make the game unplayable and it's easy to import hotkeys.


It was at one of those public booths, not really a sanctioned showmatch. So Serral was using a public computer rather than one that he can log into his own account with.


People also cheesed the shit out the bot and won though. None of these AIs have proved to be robust to exploitation yet.


StarCraft is built in such a way that you can't create a perfect, 100% winrate agent.

Since there is hidden information, you could always miss a corner of the map where the enemy hidden some units and you lose the game.

Is Alphastar "perfect"? no. Is it better than 99.9% of all humans? absolutely.

You don't need to create a perfect agent in most cases, self driving a classic example.

If you were to deploy an agent that drives 95% better than all humans the effects would be huge.

It would still fail in some scenarios where professional drivers won't be it doesn't really matter because most people are not that.


I know the bot will never have 100% winrate, but I think it shouldn't be able to be exploited (I.e. repeatedly beaten using the same strategy).

Let me give you an example [0]. When AlphaStar was playing on the ladder a player in Diamond league (~70-80th percentile) beat AlphaStar easily using mass Ravens. If you're not aware of the strategy, it's a turtle strategy where the player masses air units and is generally terrible.

But AlphaStar was confused by the strategy, and so it lost by a large margin.

Deploying an AI which can be exploited like this is asking for trouble.

[0] https://www.reddit.com/r/starcraft/comments/cgzieq/alphastar...


But that could be fixed technically. Deepmind's goal was not to create an "unexploitable" agent but to prove that ML algorithms can cope with complex, dynamic environments such as StarCraft.

It seems to you weird but the same agent probably wins against GM's most of the time. humans have weaknesses too.

The AI simply leans on its strengths just like humans do.


> It seems to you weird but the same agent probably wins against GM's most of the time. humans have weaknesses too

This is the whole problem though. AlphaStar beats GMs but can lose to weird strategies.

On the other hand, GMs will almost never lose (Most likely >99% winrate) to a Diamond player no matter how weird their strategies are.

The AI has strengths, but it also has glaring weaknesses. Imagine if you had an AI flying a plane and 99% of the time it was far better than a human pilot but 1% of the time it crashed and killed everyone. I would not fly on that plane.

Maybe a bunch more training data and time would solve this type of problem, but I'm skeptical.


You're beautifully showing the human nature which can be problematic in my opinion.

First of, no human player achieves 99% winrate against diamond players. there are many cheeses, one miss-step and you lose. GM's can lose to Diamond players.

Now for the main part, you're saying and I'm rephrasing here:

Even if the AI is statistically better than humans because it has some weaknesses I'm going to prefer the human.

But still at the end of the day, the AI does a better job on average and will be safer to use than human pilots!

We already heavily rely on software\algorithms for our most important things. all modern vehicles use electronic systems that monitor\manage several key components, stock market is heavily managed by bots.

If AI can do a significantly better job than human, I would choose the AI, even if it behaves strange in that 0.1% of cases. humans are not as reliable as you think.


> First of, no human player achieves 99% winrate against diamond players. there are many cheeses, one miss-step and you lose. GM's can lose to Diamond players

They definitely would. You underestimate the difference in skill. Top players almost always beat other GM players and maintain very high winrates in top GM.

See for yourself: https://www.nephest.com/sc2/?season=46&queue=LOTV_1V1&team-t...

> But still at the end of the day, the AI does a better job on average and will be safer to use than human pilots!

I agree, but only if that 1% or 0.1% or whatever is not exploitable by someone malicious.


The link includes players with vastly lower winrate and players with high winrates but for extremely low number of games.

We need sufficient quantities to claim 99% winrate, for highly ranked players even with 200 games(which is still a low number since a single loss can massively affect results) are not even close to 80% winrate. probably with enough games it will be even lower.

Maintaining 99% winrate is extremely hard as you can only lose a single game out of 100. people get tired, try new stuff, simply don't pay attention or just get caught off guard by a new thing.

As for "malicious exploitation", it does poses a risk in some environments but the question then becomes exactly the same.

Is the AI less exploitable than the average person?

If so, it doesn't matter.


> Is the AI less exploitable than the average person?

People are generally not exploitable in the same way an AI is because we can subjectively assess situations and learn on the fly.

This is a good example of why I think your argument doesn't hold water: https://twitter.com/nikitabier/status/1372726911105855488

On the 99% winrate, I feel like you're either being purposefully obtuse or have no experience with competitive games.

Majority of the winrates are >70%, but even 60% is insane for a competitive game especially at the very highest level. It is ridiculously hard to maintain a winrate this high even over 30 games.

You seem to be thinking about this from a statistical perspective (I.e. moar samples) without realizing that this is baked into MMR (You're matched with opponents as close to your skill level as possible). These players have to maintain high winrates just to stay at this MMR because they can earn as low as literally 0 MMR for a win and lose up to 60 MMR for a loss.

These players are also around 3000 MMR higher than Diamond players. Using the Elo model [0], this equates to a 99.998% winrate.

100 games in a row is also not feasible. That's ~20 hours of playtime assuming 12min games.

[0] https://www.reddit.com/r/starcraft/comments/7fc30w/7_orders_...


This is the difference between memorizing a good strategy and thinking strategically. DeepMind does the former. It's still impressive. It will still beat humans. It may still be a route to successfully address many real world business problems.

But it's not as noteworthy as implied on the path to AGI.


> They definitely would. You underestimate the difference in skill. Top players almost always beat other GM players and maintain very high winrates in top GM.

A diamond player that has mastered one weird cheese will absolutely take more than one in a hundred games off a GM - even off a tournament pro. Even Serral chokes in way more than one in a hundred matches.


> First of, no human player achieves 99% winrate against diamond players.

Complete bullshit. You don't know the game at all if you believe this. The person you're arguing with is well known over on the Starcraft subreddits. Maybe listen to them.

In practice, hidden information cheeses simply aren't enough for a diamond to take a game off a top pro, even one time in a hundred. They'll sniff it out sufficiently and then just outcontrol the fight every time.


It is false that AlphaStar learns like humans do.


by any chance, do you know replay ID of this game ?



> StarCraft is built in such a way that you can't create a perfect, 100% winrate agent.

The worry isn't about perfect winrate, it's about finding strategies that can consistently cause the AI to lose over and over.

In a cooperative environment, a high percentage is great.

In a competitive environment, that .1% of scenarios where it's really weak will suddenly become the majority of games it faces.


Cheese is a legitimate strategy, though.

I'm not sure its been proven that the most successful overall strategy is unbeatable. Besides perfect skill, you still need to worry about all in strategies. In my eyes, losing to cheese could still be possible even if you're the best overall player.

I do think its fair to say these AIs should be able to grow after losing to a strategy once.


> Cheese is a legitimate strategy, though.

Of course it is. But do the same cheese to a top player over and over and it will rapidly become ineffective, usually within a game or two. Each AlphaStar agent can just be exploited endlessly.


IIRC each individual agent is essentially built around a single strategy/playstyle, not to mention the agents relied heavily on mechanical advantages to win.

AlphaStar basically got to 'good enough' strategy, then won on control, which computers obviously have a massive advantage with.


> OpenAI's DoTA 2 system wasn't playing the full game. I think the final version could play 17 of the 117 heroes

Limiting the number of playable heros in DoTA2 really isn't important when it comes to evaluating the skill of the AI. Most real players trying hard to win already play with a limited hero pool dictated by the current patch verison.


It is important. They removed a lot of champions with complicated mechanics that could have been much harder for the AI to play against.


As someone who has played HoN & DoTA2 for over a decade I'm telling you it isn't important when evaluating the ability of an AI to actually play the game.

Drafting can be massive when deciding the outcome of games even at the lower skill levels. Opening up the entire hero pool just means you're largely evaluating the ability to draft in the current patch more than actual playing ability.


IIRC, amongst banned heroes were key splitpush heroes such as Nature's Prophet, Tinker and Phantom Lancer, which are a known counter to the early-game advantage, mid-game push strategy that the OpenAI 5 executed. Early-game laning and mid-game teamfight combination are micro-intensive, and consequently the biggest advantage the OpenAI 5 had over OG. Against the simple AI that comes built-in to Dota 2, you can be behind by a ton and exploit splitpushing against the AI, because doing any structural damage against buildings will force the AI (who are gathered up for a push) back, and you will still win if you drag the game sufficiently into the lategame. Deciding whether or not to continue to push and which heroes to send back to defend is one of the hardest strategic decisions to make in the game, even for humans. The finals of one of the TIs, NaVi vs. Alliance, was lost on getting such a decision wrong. Eliminating some key splitpush heroes minimized the probability that OpenAI 5 would have been forced into having to make such a decision.

It would not surprise me if the OpenAI 5 would have lost against OG had the entire heropool been available, had the series been long enough or had the prizepool been big enough for OG to take the game seriously enough to warrant picking a splitpushing strategy (which is considered cowardly in some circles).


> IIRC, amongst banned heroes were key splitpush heroes such as Nature's Prophet and Tinker, which are a known counter to the early-game advantage, mid-game push strategy that the OpenAI 5 executed.

Tinker is an example of a hero that took (and still takes) a lot of gold to useful. If you're drafting tinker to stop early game push strats you're going to have a bad time. If anything Tinker is actually an example of an hero that the AI would probably excel at, the same goes for any micro dependent hero.

In general including heroes that excel with great mechanics is a poor way to evaluate how well the AI actually plays the game. No one doubts an AI can send inputs faster than humans.


okay? so what's the rebuttal if I, someone who has also played dota for over a decade, say it _is_ important?

the game is balanced around the entire roster. in a pool of 17 heroes, the answers to the locally optimal strategy likely don't even exist within said subset. drafting proper compositions, adapting to the opponent's heroes, and dynamically changing your gameplan during the match are all part of actual playing ability.


> okay? so what's the rebuttal if I, someone who has also played dota for over a decade, say it _is_ important?

No, the rebuttal is how can someone who has never played the game before claim x,y,z is important or not?


It doesn't take a Dota expert to realize that removing most of the heroes is going to have a huge impact on strategy.


It just doesn't when you're evaluating how well an AI actually plays the game. You're missing the forest for the trees by focusing on the hero pool size rather than the AI's map movements, managing power spikes, division of resources, etc


The difference between the top 0.2% and top 500 players is huge too


Interesting blog-post.

I found some similarities with what occurred with Deepmind's Alphastar AI.

One of the weaknesses that seem to manifest in this piece too is the handling of unfamiliar scenarios.

The AI is very confused once it experiences something that was rarely seen in its learning data. Destroyer's big drones confused the bot quite a bit.

Deepmind solved it by intentionally creating agents that introduce different\bizzare strategies(which they called exploiters) in order to develop robustness against such strategies.


The bot has actually never seen Destroyer's big drones during training even once, so I found it somewhat surprising that it even works as well as it does!

Completely agree that adding something like the "League" used by AlphaStar would be one of the top priorities if you wanted to push this project further. I don't think CodeCraft is sufficiently complex to really allow for several very distinct strategies in the same way as StarCraft II, but I would still expect training against a larger pool of more diverse agents to increase robustness quite a bit.


What amazes me at the end of the day is that brute-forcing seem to do much better than I initially thought it would do.

Trying random stuff just sounds stupid but with enough compute and data, I guess it could overpower smart creatures like us.

I agree that CodeCraft is vastly simpler than StarCraft but the idea is the same. just try random stuff(sometimes with better logic behind it) until something works and then optimize it to perfection.


That randomness has to be massively constrained, though. Well over 99.9 percent of inputs are guaranteed to lead to bad results. For example if we're randomly way pointing a drone, we're almost guaranteed not to be sending it somewhere useful.


This is an excellent project with a great write-up. Most articles this long would loose me but this is engaging and clear, a joy to read. And I'm in awe of the amount of work that has gone into every aspect of this.

>Seeing as my policies are currently the world’s best CodeCraft players, we’ll just have to take their word for it for the time being.

I really hope this inspires some competition! How long until there is a leaderboard? :)


Agreed, this is better than the vast majority of machine learning papers that actually get published. The ablation section is particularly nice. It is really a major failing of the field that in most papers, it's entirely unclear what aspect of the model (or which particular hacks) are really carrying the weight.


This is a fantastic project and a great blog! As games start to include RL, it will be a lot of fun that could spawn a while new generation of interesting games (especially if games are made with an RL-first mindset as opposed to using RL later on to beat human beings).

Do you have recommendations to learn more about RL? Is CodeCraft a game?


Thank you for the kind words! I am also quite excited about the new points in game design space that RL will unlock and am planning write another blogpost on that topic.

I quite like https://karpathy.github.io/2016/05/31/rl/ as an introduction to some of the ideas behind modern RL. Beyond that, I just recently found out about https://github.com/andyljones/reinforcement-learning-discord... which lists a lot of other high-quality resources.

CodeCraft is a programming game which you can "play" by writing a Scala/Java program that controls the game units. It's not actively developed anymore but still functional: http://codecraftgame.org/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: