
OpenAI Five - gdb
https://blog.openai.com/openai-five/
======
boulos
Disclosure: I work on Google Cloud (and vaguely helped with this).

For me, one of the most amazing things about this work is that a small group
of people (admittedly well funded) can show up and do what used to be the
purview of only giant corporations.

The 256 P100 optimizers are less than $400/hr. You can rent 128000 preemptible
vcpus for another $1280/hr. Toss in some more support GPUs and we're at maybe
$2500/hr all in. That sounds like a lot, until you realize that some of these
results ran for just a weekend.

In days past, researchers would never have had access to this kind of
computing unless they worked for a national lab. Now it's just a budgetary
decision. We're getting closer to a (more) level playing field, and this is a
wonderful example.

~~~
naturalgradient
I would just want to comment that while this is true in principle, it's also
slightly misleading because it does not include how much tuning and testing is
necessary until one gets to this result.

Determining the scale needed, fiddling with the state/action/reward model,
massively parallel hyper-parameter tuning.

I may be overestimating but I would reckon with hyper-parameter tuning and all
that was easily in the 7-8 figure range for retail cost.

This is slightly frustrating in an academic environment when people tout
results for just a few days of training (even with much smaller resources, say
16 gpus and 512 CPUs) when the cost of getting there is just not practical,
especially for timing reasons. E.g. if an experiment runs 5 days, it doesn't
matter that it doesnt use large scale resources, because realistically you
need 100s of runs to evaluate a new technique and get it to the point of
publishing the result, so you can only do that on a reasonable time scale if
you actually have at least 10x the resources needed to run it.

Sorry, slightly off topic, but it's becoming a more and more salient point
from the point of academic RL users.

~~~
boulos
I hear you. I would say that this work is tantamount to what would normally be
a giant NSF grant.

Depending on your institution, this is precisely why we (and other providers)
give out credits though. Similar to Intel/NVIDIA/Dell donating hardware
historically, we understand we need to help support academia.

~~~
naturalgradient
Yes, thank you for that by the way, did not want to diminish your efforts.
Just wanted to point out that papers are often misleading about how many
resources are needed to get to the point of running the result. I have
received significant amounts of money from Google, full disclosure.

------
naturalgradient
So as someone working in reinforcement learning who has used PPO a fair bit, I
find this quite disappointing from an algorithmic perspective.

The resources used for this are almost absurd and my suspicion is, especially
considering [0], that this comes down to an incredibly expensive random search
in the policy space. Or rather, I would want to see a fair bit of analysis to
be shown otherwise.

Especially given all the work in intrinsic motivation, hierarchical learning,
subtask learning, etc, the sort of intermediate summary of most of these
papers from 2015-2018 is that so many of these newer heuristics are too
brittle/difficult to make work, so we resort to slightly-better-than brute
force.

[https://arxiv.org/abs/1803.07055](https://arxiv.org/abs/1803.07055)

~~~
gdb
(I work at OpenAI on the Dota team.)

Dota is _far_ too complex for random search (and if that weren't true, it
would say something about human capability...). See our gameplay reel for an
example of some of the combos that our system learns:
[https://www.youtube.com/watch?v=UZHTNBMAfAA&feature=youtu.be](https://www.youtube.com/watch?v=UZHTNBMAfAA&feature=youtu.be).
Our system learns to generalize behaviors in a sophisticated way.

What I personally find most interesting here is that we see qualitatively
different behavior from PPO at large scale. Many of the issues people pointed
to as fundamental limitations of RL are not truly fundamental, and are just
entering the realm of practical with modern hardware.

We are very encouraged by the algorithmic implication of this result — in
fact, it mirrors closely the story of deep learning (existing algorithms at
large scale solve otherwise unsolvable problems). If you have a very hard
problem for which you have a simulator, our results imply there is a _real,
practical path_ towards solving it. This still needs to be proven out in real-
world domains, but it will be very interesting to see the full ramifications
of this finding.

~~~
naturalgradient
Thank you for taking the time to respond, I appreciate it.

Well I guess my question regarding the expensiveness comes down to wondering
about the sample efficiency, i.e. are there not many games that share large
similar state trajectories that can be re-used? Are you using any off-policy
corrections, e.g. IMPALA style?

Or is that just a source off noise that is too difficult to deal with and/or
the state space is so large and diverse that that many samples are really
needed? Maybe my intuition is just way off, it just _feels_ like a very very
large sample size.

Reminds me slightly of the first version of the non-hierarchical TensorFlow
device placement work which needed a fair bit of samples, and a large sample
efficiency improvement in the subsequent hierarchical placer. So I recognise
there is large value in knowing the limits of a non-hierarchical model now and
subsequent models should rapidly improve sample efficiency by doing similar
task decomposition?

~~~
gdb
The best way we know to think of it is in terms of variance of the gradient.

In a hard environment, your gradients will be very noisy — but effectively no
more than linear in the duration you are optimizing over, provided that you
have a reasonable solution for exploration. As you scale your batch size, you
can decrease your variance linearly. So you can use good ol' gradient descent
if you can scale up linearly in the hardness of the problem.

This is a handwavy argument admittedly, but seems to match what we are seeing
in practice.

Simulators are nice because it is possible to take lots of samples from them —
but there's a limit to how many samples can be taken from the real world. In
order to decrease the number of samples needed from the environment, we expect
that ideas related to model-based RL — where you spend a huge number of neural
network flops to learn a model of the environment — will be the way to go. As
a community, we are just starting to get fast enough computers to test out
ideas there.

~~~
shawn
Yo, this probably isn't the type of HN comment you're used to, but I just
wanted to say thanks for enriching the dota community. I know that's not
really why you're doing any of this, but as someone who's deeply involved with
the community, people get _super_ hyped about what you guys have been doing.

They also understand all of the nuances, similar to HN. Last year when you
guys beat Arteezy, everyone grokked that 5v5 was a completely different and
immensely difficult problem in comparison. There's a lot of talent floating
around /r/dota2, amidst all the memes and silliness. And for whatever reason,
the community _loves_ programming stories, so people really listen and pay
attention.

[https://imgur.com/Lh29WuC](https://imgur.com/Lh29WuC)

So yeah, we're all rooting for you. Regardless of how it turns out this year,
it's one of the coolest things to happen to the dota 2 scene period! Many of
us grew up with the game, so it's wild to see our little mod suddenly be a
decisive factor in the battle for worldwide AI dominance.

Also 1v1 me scrub

~~~
oblio
Agreed! Can't wait to not have to play Dota 2 with humans :p

------
gakos
This article (like pretty much all from OpenAI) is really well done. I love
the format and supporting material - makes it waay more digestible and fun to
read in comparison to something from arxiv. The video breakdowns really drive
the results home.

~~~
andreyk
To be fair, there is very little technical content... I don't think they could
repackage this content into an arxiv-style paper if they wanted to.

~~~
gakos
Good point - but I think that the difference is valuable. If that is the
average person's first touch point with the content, then it would do a better
job of making it accessible than a technical paper. Agreed that a follow-up
detailed post or paper would be awesome!

------
ufo
This is a really interesting writeup, specially if you know a bit more about
how Dota works.

That it managed to learn creep blocking from scratch was really surprising for
me. To creep block you need to go out of your way to stand in front of the
creeps and consciously keep doing so until they reach their destination. Creep
blocking just a bit is almost imperceptible and you need to do it all the way
to get a big reward out of it.

I also wonder if their reward function directly rewarded good lane equilibrium
or if that came indirectly from the other reward functions

~~~
ionforce
It's not really "from scratch". The bots are rewarded for the number of creeps
they block, so it's not impossible that they would find some behavior to
influence this score.

~~~
Corence
That was true for their original 1v1 bot, but in the latest blog post they
mention bots can learn it on their own if left to train longer.

~~~
backpropaganda
That's not rigorously supported. It's just an anecdote they mention off-hand.
The final version of the bot does use the creep block reward.

~~~
gdb
To be clear:

\- The 1v1 bot played at The International used a special creep block reward
(and a big if statement separating that part of the agent from the self-play
trained part). It trained for two weeks.

\- A 2v2 bot discovered creep blocking on its own, no special reward. It
trained for four weeks.

\- OpenAI Five does not have a creep blocking reward, but neither (to our
knowledge) does it creep block currently. Trained for 19 days!

~~~
backpropaganda
I see. Thanks! So it manages to win lanes without even creep blocking? That's
quite good. Any chance you could share the last hits @ 10 mins for the games
it has played (for both bots and humans)? I think that's a crucial number to
judge how OpenAI Five is winning its games.

~~~
BirdieNZ
I believe the article said that Blitz rated the bot last-hitting at about
average for humans, although he might over-rate what an average human player
last hits like.

~~~
backpropaganda
Yeah, he might be overestimating 2.5k mmr players, and there's also something
to be said about the consistency by which the bot last hits. A human player
would have a high variance of last-hit performance, while the bot will
probably guarantee a minimum amount, thus ensuring a minimum set of items
needed for the mid-game transition.

But my larger point is, the early game doesn't have a lot of strategic
elements in it. You have to last hit, not die, harass opponent, get items. You
can play it by the book pretty much. The challenge in early game is to be able
to handle 5 different things at the same time. So there's never really a
question of what to do, but doing it does require mechanical prowess, which we
know bots can easily be better at, than humans.

The team composition chosen is very early game snowball oriented. So is the
bot winning simply due to mechanical superiority and early game advantage?
Access to last hits @ 10 mins, gold and net worth graphs would allow us to
answer that question.

------
minimaxir
They are using preemptible CPUs/GPUs on Google Compute Engine for model
training? Interesting. The big pro of that is cost efficiency, which isn't
something I expected OpenAI to be optimizing. :P

How does training RL with preemptible VMs work when they can shut down at any
time with no warning? A PM of that project asked me the same question awhile
ago
([https://news.ycombinator.com/item?id=14728476](https://news.ycombinator.com/item?id=14728476))
and I'm not sure model checkpointing works as well for RL. (maybe after each
episode?)

~~~
gdb
(I work at OpenAI on the Dota team.)

Cost efficiency is always important, regardless of your total resources.

The preemptibles are just used for the rollouts — i.e. to run copies of the
model and the game. The training and parameter storage is not done with
preemptibles.

~~~
Erlich_Bachman
If these (or other similar) experiments would show viability of this network
architecture, the cost could be decreased a lot with development of even more
specialized hardware.

Also one could look at the cost of the custom development of bots and AIs
using other more specialized techniques: sure, it might require more
processing power to train this network, but it will not require as much
specialized human interaction to adapt this network to a different task. In
which case, the human labor cost is decreased significantly, even if initial
processing costs are higher. So in a way you guys do actually optimize cost
efficiency.

------
bobcostas55
>OpenAI Five does not contain an explicit communication channel between the
heroes’ neural networks. Teamwork is controlled by a hyperparameter we dubbed
“team spirit”. Team spirit ranges from 0 to 1, putting a weight on how much
each of OpenAI Five’s heroes should care about its individual reward function
versus the average of the team’s reward functions. We anneal its value from 0
to 1 over training.

A bit disappointing, it would be very cool to see what kind of communication
they'd develop.

~~~
kamac
Would be interesting to see if when one agent declines to help another several
times, the other one would decide against helping him when he calls. The
logical explanation would then be that the agent would come to value his life
more than his comrade's (because he is helping, and his comrade has refused
several times). The human explanation would be that he refuses to help out of
spite. It could even lead to those two agents "hating" the other, though it
would be more like cold calculation.

------
hsrada
I wanted to add the observation that all the restricted heroes are ranged.
Necrophos, Sniper, Viper, Crystal Maiden, and Lich.

Since playing a lane as a ranged hero is very different from playing the same
lane as a melee hero, I wonder whether the AI has learned to play melee heroes
yet.

~~~
backpropaganda
Not only are they ranged, but this lineup is very snowball-oriented, i.e. the
optimal play style with this kind of lineup is to gain a small advantage in
the early game and then keep pushing towers together aggressively. The middle-
to-late game doesn't really matter. Whoever wins the early game wins the game.
And we do know that bots are going to be good at early game last hitting.

~~~
jwin742
The article states the bots are actually rather mediocre at last hitting.

------
foobaw
I've played DotA for over 10 years so this development is quite relevant to
me. So excited to see this next month!

Although it's extremely impressive, all the restrictions will definitely make
this less appealing to the audience (shown in the Reddit thread comments).

~~~
gdb
Thanks! The restrictions are a WIP, and will be significantly lifted even by
our July match.

------
eslaught
> Partially-observed state. Units and buildings can only see the area around
> them. The rest of the map is covered in a fog...

Actually, this is true on multiple levels. There is fog of war, but then there
is the fact that a human player can only look at a given window of the game at
a time, and has to pan the window to see the area away from their character.
(The mini-map shows some level of detail for the rest of the map, but isn't
high resolution and doesn't show everything that might be of interest.) Also,
you can only issue orders on what is directly visible to you, so if you pan
away from your character that restricts what you can do.

Is OpenAI Five modeling this aspect of the game? Otherwise it's still
"cheating" in some sense vs how a human would be forced to play.

~~~
JD557
I'm pretty sure they are not. From [https://blog.openai.com/openai-
five#differencesversushumans](https://blog.openai.com/openai-
five#differencesversushumans):

>OpenAI Five is given access to the same information as humans, but instantly
sees data like positions, healths, and item inventories that humans have to
check manually. Our method isn’t fundamentally tied to observing state, but
just rendering pixels from the game would require thousands of GPUs.

------
jakecrouch
While this is a cool result, I wonder if the focus on games rather than real-
world tasks is a mistake. It was a sign of past AI hype cycles when
researchers focused their attention on artificial worlds - SHRLDU in 1970,
Deep Blue for chess in the late 1990s. We may look back in retrospect and say
that the attention Deepmind got for winning Go signaled a similar peak. The
problem is that it's too hard to measure progress when your results don't have
economic importance. It's more clear that the progress in image processing was
important because it resulted in self-driving cars.

~~~
Yen
Firstly, research into Chess AI has had a surprising amount of beneficial
spin-off, even if we don't call the result "AI".

Secondly, while it's still a simplification and abstraction, DotA's ruleset is
orders-of-magnitude more similar to operating in the real world than Chess's
is.

Thirdly, I'd argue that the adversarial nature of games makes it _easier_ to
track progress, and to ensure that measure of progress is honest.

There's a lot of ways you can define "progress" in self-driving cars.
Passengers killed per year in self-driving vs. human-driven cars? Passengers
killed per passenger-mile? Average travel time per passenger-mile in a city?
etc.

With games, you either win, or you don't.

~~~
jjcm
Another benefit of showing off progress with games is it allows the everyday
reader to follow and understand it as well. It works great as a public
awareness standpoint, especially when an AI can beat a human (i.e. Gary
Kasparov vs Deep Blue). Awareness is a good thing in the space.

------
d0m
Will the agent controls all 5 players or will each agent control a single
player?

One of the hard challenge of DOTA is whether or not to "trust" your teammate
to do the right action. I.e. One can aggressively go for a kill knowing that
their support will back them.. but one can also aggressively go for a kill
while their support let them die, and then the whole team starts blaming and
tilting because the dps "threw". It's a fine balance.. From personal
experience, it seems like in lower leagues it's better to always assume that
you're by yourself, whereas in higher leagues you can start expecting more
team plays.

Another example is often many players will use their ultimate ability at the
same time and "wasting" it. It would be easy for an agent controlling all 5
players to avoid this.. but how would a individual agent knows whether or not
to use their ult? Are the agents able to communicate between each others? If
so, is there a cap to "how fast it does it?". I.e. on voice, it takes a few
seconds to give orders.

~~~
joefkelley
Seems it's five individual agents with no communication, just a reward
function that shifts towards team-based rewards:

"OpenAI Five does not contain an explicit communication channel between the
heroes’ neural networks. Teamwork is controlled by a hyperparameter we dubbed
“team spirit”. Team spirit ranges from 0 to 1, putting a weight on how much
each of OpenAI Five’s heroes should care about its individual reward function
versus the average of the team’s reward functions. We anneal its value from 0
to 1 over training."

So pretty much like pubs.

~~~
BirdieNZ
It would be pretty interesting to see one or two of the bots playing with
humans on their team.

------
obastani
I think this is quite impressive. I'm a bit confused about the section saying
that "binary rewards can give good performance". Is it saying that binary
rewards (instead of continuous rewards) work fine, but end-of-rollout rewards
(instead of intermediate rewards such as kills) work poorly?

~~~
yazr
Binary rewards (win/loss score at the end of the roll out) scored a "good" 70.

With sparse reward (kills, health, etc), scored a better 80 and learned much
faster.

Normally, "reward engineering" uses human knowledge to give more continuous,
richer rewards. This was not used here.

~~~
obastani
Perhaps we are looking at a different graph, but in the one I am looking at,
blue is "sparse" (plateaus at 70) and orange is "dense" (very quickly hits
80). I believe "dense" means they are doing reward engineering.

~~~
yazr
The "sparse blue graph" is just the binary win loss outcome - learns ok-ish
but slow

The "dense orange graph" \- uses more dense rewards - kills, health - and
learns better. I referred to this as a "sparse reward" \- since it is still a
fairly lean and sparse function.

But this is just my opinion. Also note this is for the older 1v1 agent.

The current reward function is even more detailed, and they blend and anneal
the 5 agent score, so i dunno...

[https://gist.github.com/dfarhi/66ec9d760ae0c49a5c492c9fae939...](https://gist.github.com/dfarhi/66ec9d760ae0c49a5c492c9fae93984a)

------
mooneater
I want to see this datapoint on their AI and Compute chart:
[https://blog.openai.com/ai-and-compute/](https://blog.openai.com/ai-and-
compute/)

------
loser777
>Each of OpenAI Five’s networks contain a single-layer, 1024-unit LSTM that
sees the current game state (extracted from Valve’s Bot API)

This will likely dramatically simplify the problem vs. what the
DeepMind/Blizzard framework does for StarCraft II, which provides a game state
representation closer to what a human player would actually see. I would guess
that the action API is also much more "bot-friendly" in this case, i.e., it
does not need to do low-level actions such as boxing to select.

~~~
mikkelam
Definitely reduces the problem excessively, even inside the game itself they
have a big list of restrictions for items and heroes.

It makes sense to solve this easier problem first as there will be more
headlines faster

~~~
Anderkent
The problem they're trying to solve is also not how to recognise actions from
pixels, it's how to outstrategise and outexecute players at the game.
Conceptual rather than mechanical advantage.

------
KPLauritzen
Wow, very excited about this. I don't know too much about RL, but for me the
"170,000 possible actions per hero" seems far too large an output space to be
feasible. What happens if the bot wants to do an invalid action? Nothing, or
some penalty for selecting something invalid?

------
KillcodeX
OpenAI is cover up research AI for the CIA. The main goal will be to kill
innocent folks with this type of AI research. These folks are working for CIA
without noticing the involvement of The Spy Agency. They are ostensibly
private institutions and businesses which are in fact financed and controlled
by the CIA. From behind their commercial and sometimes non-profit covers, the
agency is able to carry out a multitude of clandestine activities—usually
covert-action operations. Many of the firms are legally incorporated in
Delaware because of that state's lenient regulation of corporations, but the
CIA has not hesitated to use other states when it found them more convenient.
The NSA/CIA's best-known proprietaries are Amazon, facebook, Microsoft,
Palantir, OpenAI (cover up research AI via non-profit) and Google.... Good
luck with working inside a military research without decoding the source of
funding.

------
nerdponx
Are those 180 years of games "seeded" by real games, or was it entirely self
play?

Also, how does this system cope with gameplay changes that arise when the game
is patched? It's new news to any experienced Dota player that even small
changes can have major impact on the meadow gam that even small changes can
have major impact on winning strategy. Would it need to be re-trained every
patch?

~~~
gwern
> Are those 180 years of games "seeded" by real games, or was it entirely self
> play?

The writeup implies that it's entirely self-play.

> Also, how does this system cope with gameplay changes that arise when the
> game is patched?

From the sound of it, they don't. Since it's a policy gradient method, which
learns only from the last set of samples, hypothetically, they could simply
swap out the DoTA binary on the fly in parallel and let it automatically
update itself by continued training. (The difference between optimal pre/post-
patch is a lot smaller than the difference between a random policy and an
optimal policy...)

~~~
nerdponx
Makes sense on both counts.

The fact that it's not seeded at all is very interesting. A lot of Dota
expertise derives from knowing what the opponent is going to do at a
particular time. I remember many comments from experienced Go players that
AlphaGo made moves that no human player would make, so I wonder if that will
appear in this case as well.

~~~
gwern
They do discuss some current differences in playstyle toward the bottom, like
faster openings and more use of support heroes, which the self-play has
invented (along with rediscovering standard tactics). So it's at least a
little different.

Whether these are _better_ is hard to say. It's not superhuman, after all,
unlike AlphaGo, so it's not presumptively right, and you can't doublecheck by
doing a very deep tree evaluation (because DoTA doesn't lend itself to tree
exploration - far too many actions and long-range).

------
ericsoderstrom
What are the 170,000 discrete actions?

Rough guesses for available actions:

    
    
      32 (directions for movement)
    

\+ 10 (spell/item activations)

    
    
      * 20 (potential targets. heroes + near by creeps)
    

\+ 15 (attack commands. 5 enemy heroes and ~10 near by creeps)

Which still leaves... approximately 170,000 actions unaccounted for

~~~
Anderkent
You can attempt to attack / move to / cast many spells on arbitrary pixels on
the map. The bots are shown casting spells on targets that aren't visible in
the demo. The amount of available targets probably blows up the count.

~~~
jschmitz28
I figured this much as well, and I think this also begins to explain why some
of the restrictions exist, and how difficult it would be to generalize this to
the entirety of the Dota action space. I'm assuming they were pretty smart at
defining and limiting the possible action space to get down to 170K. For
example, restricting the hero pool down to 5 heroes which only have a
reasonably small number of options in a reasonably small radius around them (I
think Sniper's Q ability might lead to the highest number of discretized
actions among their chosen hero pool), banning Boots of Travel (though I
suppose this shouldn't add too many actions since you have to TP to a friendly
unit of which there are not that many, so maybe this doesn't pose a problem
with respect to the action space size, but it does have strategic
implications), etc.

For a hero like invoker who can cast Sunstrike anywhere on the map at any
time, would you try to come up with domain heuristics (only consider locations
in the map near enemy heroes), or deal with an explosion of possible actions
(and this applies to a ton of different hero mechanics that are not in scope
here)?

~~~
ionforce
If the goal of the project is generalization, you likely want to shy away from
opinionated heuristics like the former you mention.

In the development for the Magic the Gathering AI (Duels), one of the
restrictions is "don't cast harmful spells on your targets" even though for
some edge cases this is actually the optimal thing to do. They traded a
smaller search space at the expense of optimality.

------
formalsystem
Any thoughts from the Dota team on how drafting heroes will work by the time
we get to TI? Am also curious if you've seen more experimental drafts in early
results that aren't as popular in the pro scene.

~~~
forgot-my-pw
The OpenAI bots are still very limited. Current set of restrictions:

\- Mirror match of Necrophos, Sniper, Viper, Crystal Maiden, and Lich

\- No warding

\- No Roshan

\- No invisibility (consumables and relevant items)

\- No summons/illusions

\- No Divine Rapier, Bottle, Quelling Blade, Boots of Travel, Tome of
Knowledge, Infused Raindrop

\- 5 invulnerable couriers, no exploiting them by scouting or tanking

\- No Scan

------
yazr
Any thoughts from the DOTA team on handling a world map which not bounded in
size ?

In my projects, the "world" size can change (unlike Go, Chess where the board
size is fixed).

Is the DoTA board size fixed?

I guess the LTSM encodes the board history as seen by the agent. But this
probably slows the learning.

Some people suggested auto-encoder to compress the world, and then feed it to
a regular CNN.

Any comments would be welcome.

~~~
LukaCEnzo
Played DotA. The map is fixed size.

------
inverse_pi
I'm a Legend dota2 player and also a Machine Learning researcher and I'm
__fascinated __by this result. The main message I take away is, we might
already have powerful enough methods (in terms of learning capabilities), and
we 're limited by hardware (this also makes me a little sad). My thoughts,

1) "At the beginning of each training game, we randomly "assign" each hero to
some subset of lanes and penalize it for straying from those lanes until a
randomly-chosen time in the game...." Combining this with "team spirit"
(weighted combined reward - networth, k/d/a). They were able to learn early
game movement for position 4 (farming priority position). For roaming
position, identifying which lane to start out with, what timing should I leave
the lane to have the biggest impact, how should I gank other lanes are very
difficult. I'm very surprised that very complex reasoning can be learned from
this simple setup.

2) Sacrificing safe-lane to control enemy's jungle requires overcoming local
minimum (considering the rewards), and successfully assign credits over a very
very long horizon. I'm very surprised they were able to achieve this with PPO
+ LSTM. However, one asterik here is if we look at the draft, Sniper, Lich,
CM, Viper, Necro. This draft is very versatile with Viper and Necro can play
any lane. This draft is also very strong in laning phase and mid game. Whoever
win sniper's lane and win laning phase in general is probably going to win. So
this makes it a little bit less of a local optimal. (In contrast to having
some safe lane heroes that require a lot of farm).

3) "Deviated from current playstyle in a few areas, such as giving support
heroes (which usually do not take priority for resources) lots of early
experience and gold." Support heroes are strong early game and doesn't require
a lot items to be useful in combat. Especially with this draft, CM with enough
exp (or a blink, or good positioning) can solo kill almost any hero. So it's
not too surprising if CM takes some farm early game, especially when Viper and
Necro are naturally strong and doesn't need too much of farm (they still do,
but not as much as sniper). This observation is quite interesting, but maybe
not something completely new as it might sound like.

4) "Pushed the transitions from early- to mid-game faster than its opponents.
It did this by: (1) setting up successful ganks (when players move around the
map to ambush an enemy hero — see animation) when players overextended in
their lane, and (2) by grouping up to take towers before the opponents could
organize a counterplay." I'm a little bit skeptical of this observation. I
think with this draft, whoever wins the laning phase will be able to take next
objectives much faster. And winning the laning phase is really 1v1 skill since
both Lich and CM are not really roaming heroes. If you just look at their
winning games and draw conclusion, it will be biased.

5) This draft is also very low mobility. All 5 heroes Sniper, Lich, CM, Necro,
Viper share the weakness of small movement speed (except for maybe Lich).
Also, none of these heroes can go at Sniper in mid/late game, so if you have
better positioning + reaction time, you'll probably win.

Overall, I think this is a great step and great achievement (with some caveats
I noted above). As far as next steps, I would love to see if they can try
meta-learned agent where they don't have to train from scratch for a new
draft. I would love to see they learn item building, courier usage instead of
using scripts. I would also love to see they learn drafting (can be simply
phrased as a supervised problem). I'm pretty excited about this project,
hopefully they release a white paper with some more details so we can try to
replicate.

------
akeck
This feels like Ender's Game without Ender.

------
andreyk
Quite a good read! Impressive results, it seems. Still think much more useful
to research learning complex things without absurd compute/sample
inefficiency/various hacks eg reward shapring (which, lets be honest, this
seems to have a lot of), but still interesting results.

------
matachuan
What are other killer applications of deep learning rather than CV and
gameplaying?

------
zawerf
What's the estimated cost of a project like this?

~~~
KPLauritzen
Without considering salaries you can look up the costs for their compute:
[https://cloud.google.com/compute/pricing](https://cloud.google.com/compute/pricing)
128,000 CPUs and 256 GPUs I think they mention training for 2 months in the
video

~~~
brootstrap
another commenter near the top (with some experience posted), estimating
~$2500/hour. 60grand a day to use hundreds of thousands of cores to learn to
play computer games, roughly 1.8mill for 30 days of active learning. It's
cool, does seem a little bit greedy, that is still expensive as buck yo. you
need a big ol bank to fund you. dropping 60k/day on compute doesnt fly for
many smaller companies if you ask me.

~~~
d0m
As our understanding of "AI" gets better, it'll cost less and less and will
start to be affordable for smaller players; but the initial R&D always cost a
lot.

------
wnevets
The live 5v5 match at TI should be great to watch.

------
lawlessone
>OpenAI Five plays 180 years worth of games against itself every day.

Human players do it in a fraction of their much smaller lifespans.

~~~
gwern
On the other hand, humans couldn't play 180 years even if they wanted to.

------
40945482
pref list item view

