For me, one of the most amazing things about this work is that a small group of people (admittedly well funded) can show up and do what used to be the purview of only giant corporations.
The 256 P100 optimizers are less than $400/hr. You can rent 128000 preemptible vcpus for another $1280/hr. Toss in some more support GPUs and we're at maybe $2500/hr all in. That sounds like a lot, until you realize that some of these results ran for just a weekend.
In days past, researchers would never have had access to this kind of computing unless they worked for a national lab. Now it's just a budgetary decision. We're getting closer to a (more) level playing field, and this is a wonderful example.
Determining the scale needed, fiddling with the state/action/reward model, massively parallel hyper-parameter tuning.
I may be overestimating but I would reckon with hyper-parameter tuning and all that was easily in the 7-8 figure range for retail cost.
This is slightly frustrating in an academic environment when people tout results for just a few days of training (even with much smaller resources, say 16 gpus and 512 CPUs) when the cost of getting there is just not practical, especially for timing reasons. E.g. if an experiment runs 5 days, it doesn't matter that it doesnt use large scale resources, because realistically you need 100s of runs to evaluate a new technique and get it to the point of publishing the result, so you can only do that on a reasonable time scale if you actually have at least 10x the resources needed to run it.
Sorry, slightly off topic, but it's becoming a more and more salient point from the point of academic RL users.
Depending on your institution, this is precisely why we (and other providers) give out credits though. Similar to Intel/NVIDIA/Dell donating hardware historically, we understand we need to help support academia.
Amazing, indeed. That's only 5/8 of my entire travelling allowance, from my PhD studentship.
Hey, I'd even have some pocket money left over to go to a conference or two!
I agree. One of the most amazing things about watching this project unfold is just how quickly it went from 0 to 100 with minimal overhead. It's amazing to watch companies and individuals push the boundaries of what is possible with just the push of a button.
Edit: I guess https://blog.openai.com/content/images/2018/06/bug-compariso... is approximately indicative (you currently need about 3 days to beat humans).
> This logic takes milliseconds per tick to execute, versus nanoseconds for Chess or Go engines.
So this is game engine itself, taking up the CPUs. Maybe the DoTA code can be optimized x2 for self play?!
IIRC AlphaZero was about x10 more efficient than AlphaGo Zero due to algorithm improvement.
So overall, $100K for the final training run, which maybe can go down to $10K for a different domain of similar complexity.
Best case, I'd assume at least a few ms per tick, because games become as complex as possible and still fit in 30 fps (33 ms, much of which is rendering, but still much happens regardless of producing pixels).
Please don't. Every time they change something, several other things break.
Ok, just kidding.
But their fix logs are really look like the game logic is built by adding a hack on top of a hack with no automatic testing. Everything seems to hold on the playtesting.
Getting budgetary approval isn't easy for everyone. Especially with an unproven process. And even then, there could be a mistake in the pipeline. All that money down the drain.
I will note this paragraph from the post:
> RL researchers (including ourselves) have generally believed that long time horizons would require fundamentally new advances, such as hierarchical reinforcement learning. Our results suggest that we haven’t been giving today’s algorithms enough credit — at least when they’re run at sufficient scale and with a reasonable way of exploring.
which is mostly about the challenge of longer time horizons (and therefore LSTM related). If your problem is different / has a smaller space, I think this is soon going to be very approachable. That is, we recently demonstrated training ResNet-50 for $7.50.
There certainly exist a set of problems for which RL shouldn't cost you more than the value you get out of it, and for which you can demonstrate enough likelihood of success. RL itself though is still at the bleeding edge of ML research, so I don't consider it unusual that it's unproven.
The resources used for this are almost absurd and my suspicion is, especially considering , that this comes down to an incredibly expensive random search in the policy space. Or rather, I would want to see a fair bit of analysis to be shown otherwise.
Especially given all the work in intrinsic motivation, hierarchical learning, subtask learning, etc, the sort of intermediate summary of most of these papers from 2015-2018 is that so many of these newer heuristics are too brittle/difficult to make work, so we resort to slightly-better-than brute force.
Dota is far too complex for random search (and if that weren't true, it would say something about human capability...). See our gameplay reel for an example of some of the combos that our system learns: https://www.youtube.com/watch?v=UZHTNBMAfAA&feature=youtu.be. Our system learns to generalize behaviors in a sophisticated way.
What I personally find most interesting here is that we see qualitatively different behavior from PPO at large scale. Many of the issues people pointed to as fundamental limitations of RL are not truly fundamental, and are just entering the realm of practical with modern hardware.
We are very encouraged by the algorithmic implication of this result — in fact, it mirrors closely the story of deep learning (existing algorithms at large scale solve otherwise unsolvable problems). If you have a very hard problem for which you have a simulator, our results imply there is a real, practical path towards solving it. This still needs to be proven out in real-world domains, but it will be very interesting to see the full ramifications of this finding.
Well I guess my question regarding the expensiveness comes down to wondering about the sample efficiency, i.e. are there not many games that share large similar state trajectories that can be re-used? Are you using any off-policy corrections, e.g. IMPALA style?
Or is that just a source off noise that is too difficult to deal with and/or the state space is so large and diverse that that many samples are really needed? Maybe my intuition is just way off, it just feels like a very very large sample size.
Reminds me slightly of the first version of the non-hierarchical TensorFlow device placement work which needed a fair bit of samples, and a large sample efficiency improvement in the subsequent hierarchical placer. So I recognise there is large value in knowing the limits of a non-hierarchical model now and subsequent models should rapidly improve sample efficiency by doing similar task decomposition?
In a hard environment, your gradients will be very noisy — but effectively no more than linear in the duration you are optimizing over, provided that you have a reasonable solution for exploration. As you scale your batch size, you can decrease your variance linearly. So you can use good ol' gradient descent if you can scale up linearly in the hardness of the problem.
This is a handwavy argument admittedly, but seems to match what we are seeing in practice.
Simulators are nice because it is possible to take lots of samples from them — but there's a limit to how many samples can be taken from the real world. In order to decrease the number of samples needed from the environment, we expect that ideas related to model-based RL — where you spend a huge number of neural network flops to learn a model of the environment — will be the way to go. As a community, we are just starting to get fast enough computers to test out ideas there.
They also understand all of the nuances, similar to HN. Last year when you guys beat Arteezy, everyone grokked that 5v5 was a completely different and immensely difficult problem in comparison. There's a lot of talent floating around /r/dota2, amidst all the memes and silliness. And for whatever reason, the community loves programming stories, so people really listen and pay attention.
So yeah, we're all rooting for you. Regardless of how it turns out this year, it's one of the coolest things to happen to the dota 2 scene period! Many of us grew up with the game, so it's wild to see our little mod suddenly be a decisive factor in the battle for worldwide AI dominance.
Also 1v1 me scrub
I wanted to play SF against the bot so badly - even knowing I'd get absolutely destroyed over and over agin
Will those models be introspectible / transferrable? One thing I'm curious about is how AI's learn about novel actions / scenarios which are "fatal" in the real world? Humans generally spend a lot of time being taught these things (rather than finding out for themselves obviously) and eventually come up with a fairly good set of rules about how not to die in stupid ways.
Introspectable: given that you can ask unlimited "What if" questions models, we should be able to get a lot of insights into how the models work internally. And you can often design them to be introspectable as some performance or complexity cost. (if that's what you meant by introspectable).
This argument is likely accurate in the case where exploration is adequately addressed (for example, with a well chosen reward function, self play, or some kind of an exploration bonus). However, if exploration is truly hard, then it may be possible for the variance of the gradient to be huge relative to the norm of the gradient (which would be exponentially small), even though the absolute variance of the gradient is still linear in the time horizon.
Why? We know that random search is smart enough to find a solution if given arbitrarily large computation. So, that random search is not smart enough for Dota with the computational budget you used, is not obvious. Maybe random search would work with 2x your resources? Maybe something slightly smarter than random search (simulated annealing) would work with 2x your resources?
> and if that weren't true, it would say something about human capability
No it would not. A human learning a game by playing a few thousand games is a very different problem than a bot using random search over billions of games. The policy space remains large, and the human is not doing a dumb search, because the human does not have billions of games to work with.
> See our gameplay reel for an example of some of the combos that our system learns
> Our system learns to generalize behaviors in a sophisticated way.
You're underestimating random search. It's ironic, because you guys did the ES paper.
Are there that many domains for which this is relevant?
Game AI seems to be the most obvious case and, on a tangent, I did find it kind of interesting that DeepMind was founded to make AI plug and play for commercial games.
But unless Sim-to-Real can be made to work it seems pretty narrow. So it sort of seems like exchanging one research problem (sample-efficient RL) for another.
Not to say these results aren't cool and interesting, but I'm not sold on the idea that this is really practical yet.
Transfer learning, which seems more widely researched, has also been making progress at least in the visual domain.
And it's clearly not solved yet either - 76% grab success doesn't really seem good enough to actually use, and that with 100k real runs.
I don't really know how to compare the difficulty of sim-to-real transfer research to sample efficient RL research, and it's good to have both research directions as viable, but neither seems solved, so I'm not really convinced that "just scaling up PPO" is that practical.
I'm hoping gdb will be able to tell me I'm missing something though.
Could you elaborate? One of the criticisms of RL and statistical machine learning in general is that models generalise extremely poorly, unless provided with unrealistic amounts of training data.
The point being that that the bells and whistles of PPO and other relatively complaticated algorithms (e.g. Q-PROP), namely the specific clipped objective, subsampling, and a (in my experience) very difficult to tune baseline using the same objective, do not significantly improve over gradient descent.
And I think Ben Recht's arguments  expands on that a bit in terms of what we are actually doing with policy gradient (not using a likelihood ratio model like in PPO) but still conceptually similar enough for the argument to hold.
So I think it comes down to two questions: How much do 'modern' policy gradient models improve on REINFORCE, and how much better is REINFORCE really than random search? The answer thus far seemed to be: not that much better, and I am trying to get a sense of if this was a wrong intuition.
My takeaway from  and Rajeswaran's earlier paper is that one can solve the MuJoCo tasks with linear policies after appropriate preprocessing, so we shouldn't take them too seriously. That paper doesn't do an apples-to-apples comparison between ES and PG methods on sample complexity.
All of that said, there's not enough careful analysis comparing different policy optimization methods.
(Disclaimer: I am an author of PPO)
Which, in turn, requires you to understand the concept of what a creep is, and how blocking them contributes to creep equilibrium (and what creep equilibrium is) and how the various states of equilibrium contribute to gameplay, and how/why/when you want to manipulate that (for example, you want to block creeps at some early points in the game so your opponent has to attack uphill, but between those particular points in time you want to push your creeps in deeper to ensure you have time to complete other objectives). :)
Obviously, you don't need to know anything above, but once you start diving into the depth of things OpenAI (and human players) deal with every game, it gets pretty insane that a bot can learn at such a high level so quickly.
That it managed to learn creep blocking from scratch was really surprising for me. To creep block you need to go out of your way to stand in front of the creeps and consciously keep doing so until they reach their destination. Creep blocking just a bit is almost imperceptible and you need to do it all the way to get a big reward out of it.
I also wonder if their reward function directly rewarded good lane equilibrium or if that came indirectly from the other reward functions
- The 1v1 bot played at The International used a special creep block reward (and a big if statement separating that part of the agent from the self-play trained part). It trained for two weeks.
- A 2v2 bot discovered creep blocking on its own, no special reward. It trained for four weeks.
- OpenAI Five does not have a creep blocking reward, but neither (to our knowledge) does it creep block currently. Trained for 19 days!
But my larger point is, the early game doesn't have a lot of strategic elements in it. You have to last hit, not die, harass opponent, get items. You can play it by the book pretty much. The challenge in early game is to be able to handle 5 different things at the same time. So there's never really a question of what to do, but doing it does require mechanical prowess, which we know bots can easily be better at, than humans.
The team composition chosen is very early game snowball oriented. So is the bot winning simply due to mechanical superiority and early game advantage? Access to last hits @ 10 mins, gold and net worth graphs would allow us to answer that question.
How does training RL with preemptible VMs work when they can shut down at any time with no warning? A PM of that project asked me the same question awhile ago (https://news.ycombinator.com/item?id=14728476) and I'm not sure model checkpointing works as well for RL. (maybe after each episode?)
Cost efficiency is always important, regardless of your total resources.
The preemptibles are just used for the rollouts — i.e. to run copies of the model and the game. The training and parameter storage is not done with preemptibles.
Also one could look at the cost of the custom development of bots and AIs using other more specialized techniques: sure, it might require more processing power to train this network, but it will not require as much specialized human interaction to adapt this network to a different task. In which case, the human labor cost is decreased significantly, even if initial processing costs are higher. So in a way you guys do actually optimize cost efficiency.
As gdb said below, the GPUs doing the training aren't preemptible. Just the workers running the game (which don't need GPUs).
I'm surprised you felt cost isn't interesting. While OpenAI has lots of cash, that doesn't mean they shouldn't do 3-5x more computing for the same budget. The 256 "optimizers" cost less than $400/hr, while if you were using regular cores the 128k workers would be over $6k/hr. So using preemptible is just the responsible choice :).
There's lots of low hanging fruit in any of these setups, and OpenAI is executing towards a deadline, so they need to be optimizing for their human time. That said, I did just encourage the team to consider checkpointing the DOTA state on preemption though, to try to eke out even more utilization. Similarly, being tighter on the custom shapes is another 5-10% "easily".
Don't forget, they're hiring!
A bit disappointing, it would be very cool to see what kind of communication they'd develop.
Since playing a lane as a ranged hero is very different from playing the same lane as a melee hero, I wonder whether the AI has learned to play melee heroes yet.
Although it's extremely impressive, all the restrictions will definitely make this less appealing to the audience (shown in the Reddit thread comments).
Actually, this is true on multiple levels. There is fog of war, but then there is the fact that a human player can only look at a given window of the game at a time, and has to pan the window to see the area away from their character. (The mini-map shows some level of detail for the rest of the map, but isn't high resolution and doesn't show everything that might be of interest.) Also, you can only issue orders on what is directly visible to you, so if you pan away from your character that restricts what you can do.
Is OpenAI Five modeling this aspect of the game? Otherwise it's still "cheating" in some sense vs how a human would be forced to play.
>OpenAI Five is given access to the same information as humans, but instantly sees data like positions, healths, and item inventories that humans have to check manually. Our method isn’t fundamentally tied to observing state, but just rendering pixels from the game would require thousands of GPUs.
Secondly, while it's still a simplification and abstraction, DotA's ruleset is orders-of-magnitude more similar to operating in the real world than Chess's is.
Thirdly, I'd argue that the adversarial nature of games makes it _easier_ to track progress, and to ensure that measure of progress is honest.
There's a lot of ways you can define "progress" in self-driving cars. Passengers killed per year in self-driving vs. human-driven cars? Passengers killed per passenger-mile? Average travel time per passenger-mile in a city? etc.
With games, you either win, or you don't.
One of the hard challenge of DOTA is whether or not to "trust" your teammate to do the right action. I.e. One can aggressively go for a kill knowing that their support will back them.. but one can also aggressively go for a kill while their support let them die, and then the whole team starts blaming and tilting because the dps "threw". It's a fine balance.. From personal experience, it seems like in lower leagues it's better to always assume that you're by yourself, whereas in higher leagues you can start expecting more team plays.
Another example is often many players will use their ultimate ability at the same time and "wasting" it. It would be easy for an agent controlling all 5 players to avoid this.. but how would a individual agent knows whether or not to use their ult? Are the agents able to communicate between each others? If so, is there a cap to "how fast it does it?". I.e. on voice, it takes a few seconds to give orders.
"OpenAI Five does not contain an explicit communication channel between the heroes’ neural networks. Teamwork is controlled by a hyperparameter we dubbed “team spirit”. Team spirit ranges from 0 to 1, putting a weight on how much each of OpenAI Five’s heroes should care about its individual reward function versus the average of the team’s reward functions. We anneal its value from 0 to 1 over training."
So pretty much like pubs.
OpenAI Five does not contain an explicit communication channel between the heroes’ neural networks. Teamwork is controlled by a hyperparameter we dubbed “team spirit”. Team spirit ranges from 0 to 1, putting a weight on how much each of OpenAI Five’s heroes should care about its individual reward function versus the average of the team’s reward functions. We anneal its value from 0 to 1 over training.
To me, that reads as 5 individual agents, one for each character.
They can't mess up the help in the teamfights because they see the intentions of each other by the way the heroes move.
That's why Blitz is saying that the bots are perfect in a teamfight.
With sparse reward (kills, health, etc), scored a better 80 and learned much faster.
Normally, "reward engineering" uses human knowledge to give more continuous, richer rewards. This was not used here.
The "dense orange graph" - uses more dense rewards - kills, health - and learns better. I referred to this as a "sparse reward" - since it is still a fairly lean and sparse function.
But this is just my opinion. Also note this is for the older 1v1 agent.
The current reward function is even more detailed, and they blend and anneal the 5 agent score, so i dunno...
This will likely dramatically simplify the problem vs. what the DeepMind/Blizzard framework does for StarCraft II, which provides a game state representation closer to what a human player would actually see. I would guess that the action API is also much more "bot-friendly" in this case, i.e., it does not need to do low-level actions such as boxing to select.
It makes sense to solve this easier problem first as there will be more headlines faster
Also, how does this system cope with gameplay changes that arise when the game is patched? It's new news to any experienced Dota player that even small changes can have major impact on the meadow gam that even small changes can have major impact on winning strategy. Would it need to be re-trained every patch?
The writeup implies that it's entirely self-play.
> Also, how does this system cope with gameplay changes that arise when the game is patched?
From the sound of it, they don't. Since it's a policy gradient method, which learns only from the last set of samples, hypothetically, they could simply swap out the DoTA binary on the fly in parallel and let it automatically update itself by continued training. (The difference between optimal pre/post-patch is a lot smaller than the difference between a random policy and an optimal policy...)
The fact that it's not seeded at all is very interesting. A lot of Dota expertise derives from knowing what the opponent is going to do at a particular time. I remember many comments from experienced Go players that AlphaGo made moves that no human player would make, so I wonder if that will appear in this case as well.
Whether these are better is hard to say. It's not superhuman, after all, unlike AlphaGo, so it's not presumptively right, and you can't doublecheck by doing a very deep tree evaluation (because DoTA doesn't lend itself to tree exploration - far too many actions and long-range).
Rough guesses for available actions:
32 (directions for movement)
* 20 (potential targets. heroes + near by creeps)
Which still leaves... approximately 170,000 actions unaccounted for
For a hero like invoker who can cast Sunstrike anywhere on the map at any time, would you try to come up with domain heuristics (only consider locations in the map near enemy heroes), or deal with an explosion of possible actions (and this applies to a ton of different hero mechanics that are not in scope here)?
In the development for the Magic the Gathering AI (Duels), one of the restrictions is "don't cast harmful spells on your targets" even though for some edge cases this is actually the optimal thing to do. They traded a smaller search space at the expense of optimality.
Sniper and Viper have non-targeted abilities that they were using to zone the enemy in a teamfight.
Based on the examples under the "Model structure" section, I'm guessing they are counting all combinations of spell and target location, including locations on the ground for ground-targetable spells? That could add up quick... e.g. 10 spells * 20 target units * 9x9 grid of locations around each = around 16,000 possibilities.
- Mirror match of Necrophos, Sniper, Viper, Crystal Maiden, and Lich
- No warding
- No Roshan
- No invisibility (consumables and relevant items)
- No summons/illusions
- No Divine Rapier, Bottle, Quelling Blade, Boots of Travel, Tome of Knowledge, Infused Raindrop
- 5 invulnerable couriers, no exploiting them by scouting or tanking
- No Scan
In my projects, the "world" size can change (unlike Go, Chess where the board size is fixed).
Is the DoTA board size fixed?
I guess the LTSM encodes the board history as seen by the agent. But this probably slows the learning.
Some people suggested auto-encoder to compress the world, and then feed it to a regular CNN.
Any comments would be welcome.
1) "At the beginning of each training game, we randomly "assign" each hero to some subset of lanes and penalize it for straying from those lanes until a randomly-chosen time in the game...." Combining this with "team spirit" (weighted combined reward - networth, k/d/a). They were able to learn early game movement for position 4 (farming priority position). For roaming position, identifying which lane to start out with, what timing should I leave the lane to have the biggest impact, how should I gank other lanes are very difficult. I'm very surprised that very complex reasoning can be learned from this simple setup.
2) Sacrificing safe-lane to control enemy's jungle requires overcoming local minimum (considering the rewards), and successfully assign credits over a very very long horizon. I'm very surprised they were able to achieve this with PPO + LSTM. However, one asterik here is if we look at the draft, Sniper, Lich, CM, Viper, Necro. This draft is very versatile with Viper and Necro can play any lane. This draft is also very strong in laning phase and mid game. Whoever win sniper's lane and win laning phase in general is probably going to win. So this makes it a little bit less of a local optimal. (In contrast to having some safe lane heroes that require a lot of farm).
3) "Deviated from current playstyle in a few areas, such as giving support heroes (which usually do not take priority for resources) lots of early experience and gold." Support heroes are strong early game and doesn't require a lot items to be useful in combat. Especially with this draft, CM with enough exp (or a blink, or good positioning) can solo kill almost any hero. So it's not too surprising if CM takes some farm early game, especially when Viper and Necro are naturally strong and doesn't need too much of farm (they still do, but not as much as sniper). This observation is quite interesting, but maybe not something completely new as it might sound like.
4) "Pushed the transitions from early- to mid-game faster than its opponents. It did this by: (1) setting up successful ganks (when players move around the map to ambush an enemy hero — see animation) when players overextended in their lane, and (2) by grouping up to take towers before the opponents could organize a counterplay." I'm a little bit skeptical of this observation. I think with this draft, whoever wins the laning phase will be able to take next objectives much faster. And winning the laning phase is really 1v1 skill since both Lich and CM are not really roaming heroes. If you just look at their winning games and draw conclusion, it will be biased.
5) This draft is also very low mobility. All 5 heroes Sniper, Lich, CM, Necro, Viper share the weakness of small movement speed (except for maybe Lich). Also, none of these heroes can go at Sniper in mid/late game, so if you have better positioning + reaction time, you'll probably win.
Overall, I think this is a great step and great achievement (with some caveats I noted above). As far as next steps, I would love to see if they can try meta-learned agent where they don't have to train from scratch for a new draft. I would love to see they learn item building, courier usage instead of using scripts. I would also love to see they learn drafting (can be simply phrased as a supervised problem). I'm pretty excited about this project, hopefully they release a white paper with some more details so we can try to replicate.
Human players do it in a fraction of their much smaller lifespans.