
Building AI that can master complex cooperative games with hidden information - olibaw
https://ai.facebook.com/blog/building-ai-that-can-master-complex-cooperative-games-with-hidden-information/
======
bcoates
If you're wondering why this is interesting, games that AIs excel at like
chess/checkers/go are all two-player, zero-sum, perfect-information (everyone
knows everything), deterministic games, so you can exactly predict your
opponents behavior by simulating "what would I do if I were them, and trying
to make me lose". The only real hard problem in this space is extreme
branching factors.

Everything gets vastly more complicated once you break any of those rules:
non-zero sum games create a prisoner's dilemma cooperate/defect dynamic, every
three or more player game is non-zero sum (and exponentially more for every
player you add), hidden information forces you to manage how much you reveal
to your opponent and requires you to simulate multiple "alternate futures"
based on things you learn after making a decision, and randomness is
equivalent to an extra player that makes irrational unpredictable moves.

Games like that are vastly closer to the messy real world than the
computationally expensive but near-ideal world of games like go, and they're
much more of an open problem.

~~~
JoshTriplett
In particular, _communication_ is in some ways a complex multiplayer game
requiring multi-level modeling of other participants. In the ideal case, it
can be modeled as a cooperative game. In suboptimal cases that sadly occur
often in the real world, it's a game where you're fractionally cooperating and
fractionally competing with every other player, and the degree to which you're
cooperating or competing with any given player depends on your model of them,
which you update over time based on your observations of how they "play". And
it's a helpful shortcut in modeling if you can group expectations of other
"players" into common conventions.

Hanabi's multi-level "what is my model of each other player, and what is my
model for their model of other players including me" is remarkably deeper than
it looks on the surface. Play it for long enough, and you start to handle
situations for which preconceived "conventions" don't help: "OK, of the four
players at the table, three of them understand certain common conventions, one
of them doesn't seem to understand at least one convention based on the
misfire they just had/caused (which is also consistent with their low player
rating), I can probably assume they don't understand any other conventions
commonly considered more challenging than the one they just failed at, so if I
give this hint, how will the more advanced players understand it, can I do so
without the less advanced player misunderstanding it in a harmful way, and
what will happen? And also, for future games, I should remember that this
player doesn't know these conventions (yet) until I see evidence that they've
improved. I might also consider helping them learn more common conventions.
Or, if they don't know enough conventions and don't improve, I might not want
to play with them in the future at all."

~~~
moconnor
Unfortunately Facebook’s approach sidesteps this complexity by ensuring each
player uses the same random seed and searches policy based on information they
can all see. It’s not really solving the problem as intended in my opinion.

~~~
jakobnicolaus
We are entirely focused on the self-play setting in which the goal is to learn
the highest performing policy for a team of agents all trained together. The
Hanabi Challenge also outlines an ad-hoc setting in which you need to adjust
to the diverse policies of other agents in the team on the fly.

------
noambrown
Hi! I'm one of the authors on the paper. We'd be happy to answer any
questions. Ask us anything!

~~~
gjstein
Hey Noam, this is some great work; I'll need to sit down and give the paper a
deeper read. Also, the visualizations on this blog post are incredible.

I saw a talk on the Libratus agent a while back, and one of the most
interesting takeaways was that the behavior of the bot had already started to
impact the professional players, who now spontaneously bet large amounts to
force other players out of a hand. Were there any behaviors your agent
demonstrated that surprised you in the same way? What insights might we draw
from this _cooperative_ AI system that may have more general applicability to
other planning domains?

~~~
noambrown
In terms of Hanabi, this bot arrived at conventions that are pretty different
from how humans play the game. We invited an advanced Hanabi player to play
with the bot and he pointed out a few things in particular that he'd like to
start using. For example, humans usually have a rule that if your teammate
hints multiple cards of the same color/number, you should play the newest one.
The bot uses a more complicated rule: if the card you just picked up was
hinted then play that card, otherwise play the oldest hinted card. That gives
you way more flexibility to hint playable cards that would otherwise be tough
to get played.

I think one important general lesson is that search is really, really
important. Deep RL algorithms are making huge advancements, but Deep RL alone
can't reach superhuman performance in Go or poker with search. Here, too, we
see that search was the key to conquering this game, and I think that will
hold true in more complex real-world settings as well. Figuring out how to
extend search to more complex real-world settings will be a challenge, but
it's one worth pursuing.

~~~
JoshTriplett
> For example, humans usually have a rule that if your teammate hints multiple
> cards of the same color/number, you should play the newest one. The bot uses
> a more complicated rule: if the card you just picked up was hinted then play
> that card, otherwise play the oldest hinted card. That gives you way more
> flexibility to hint playable cards that would otherwise be tough to get
> played.

I've definitely seen advanced Hanabi players use a more subtle version of that
rule: "If your hint looks like it's telling me to play my leftmost hinted
card, how long has that card been playable? If it could have been hinted for
play a long time ago, and it's just being hinted _now_ , it must not be
playable. So what else must you mean...?"

That version of the rule allows for more subtle cases. Suppose you hint that a
player's second-from-the-left and fourth-from-the-left cards are both red. If
there _hasn 't_ been an opportunity to hint the second-from-the-left since it
became playable, go ahead and play the second-from-the-left. If there have
been opportunities to hint second-from-the-left, play fourth-from-the-left.

That rule requires human players to model whether the other players' actions
in the interim have been "urgent" things that needed taking care of before
hinting them, or whether those other players _would_ have hinted them sooner
if their card was playable.

------
hooande
they (basically) applied the ideas from a bot that plays poker to another
game. it's interesting work, though perhaps not groundbreaking.

This idea of selfplay + counterfactual regret minimization does seem to be the
superior way to solve game theoretic problems. Identifying valuable game
theoretic problems remains a challenge...

~~~
noambrown
The search algorithm shares a lot in common with our Pluribus poker AI
([https://ai.facebook.com/blog/pluribus-first-ai-to-beat-
pros-...](https://ai.facebook.com/blog/pluribus-first-ai-to-beat-pros-
in-6-player-poker/)), but we added "retrospective belief updates" which makes
it way more scalable. We also didn't use counterfactual regret minimization
(CFR) because in cooperative games you want to be as _predictable_ as
possible, whereas CFR helps make you unpredictable in a balanced way (useful
in poker).

The most surprising takeaway is just how effective search was. People were
viewing Hanabi as a reinforcement learning challenge, but we showed that
adding even a simple search algorithm can lead to larger gains than any
existing deep RL algorithm could achieve. Of course, search and RL are
completely compatible, so you can combine them to get the best of both worlds,
but I think a lot of researchers underestimated the value of search.

~~~
hooande
I just spent three weeks going through your research. Thank you for that work,
especially the supplementary materials.I wish I'd known how much the ideas in
the pluribus paper depended on reading the libratus paper.

I see what you're saying about the real time search (which took me quite some
time to understand). I came up with a way to do that from disk due to memory
limitations. It limits the number of search iterations but doesn't seem to
have a huge negative impact on quality so far.

Anyway, thanks again!

------
partingshots
They should tackle Starcraft II next like DeepMind has with AlphaStar, or at
least a similar real time RTS like Starcraft with a fog of war and a partially
observable state.

~~~
jakobnicolaus
There are unique challenges around learning effective communication protocols
that appear in cooperative settings, which was the focus of this work. Getting
robust superhuman performance in SC2 remains an interesting challenge, though.

