Hacker News new | comments | ask | show | jobs | submit login
The Hanabi Challenge: A New Frontier for AI Research (arxiv.org)
150 points by iron0013 11 days ago | hide | past | web | favorite | 66 comments

Context for those that don't know the game:

1. Each player has a set of tiles that each have a particular color and a particular number.

2. The tiles a player has are known only to other players (they are placed to face away from the player)

3. On your turn, you may either place one of your tiles, or give a hint to another player about their tiles.

4. Hints can only take the form of "you have exactly K tiles of C color <indicate the K tiles>" or "you have exactly K tiles of N number <indicate the K tiles>". There are some rules for how many hints can be given and ways to "earn back" more hints to give.

4b. You can make more deductions than what the hints provide. There are a fixed number of each tile. e.g. There are 2 RED-4 tiles. So if you can see that two other players both have a RED-4 tile, this means that you cannot have a RED-4 tile. So if someone gives you the hint that you have a *-4 tile, you know it is not RED. But the two players with the RED-4 tiles cannot see their own, so they cannot make this deduction. But if you act in a way that shows that you have made the deduction, and they are really smart, they can observe your act and deduce that you would have only made that act if you knew that you didn't have a RED-4 tile. And since they can only see 1 other RED-4 tile, they can deduce that they must hold the other.

5. No other communication is allowed.

6. In the center of the table there are stacks for each color that start empty. The goal of the game is to populate the stacks by playing the color tiles of that stack in increasing number. (e.g. RED-1, RED-2, RED-3 ...). If a player tries to play a tile that is invalid (e.g. the RED stack is only up to RED-2 and the player selects their RED-4 tile), a counter is incremented.

7. The game ends when the counter reaches a threshold or all tiles are played. The score is equal to the number of tiles played. Higher is better.

I have long thought that it should be possible to write an AI that does superhuman hint-giving and deduction, provided that all of the AIs can collude beforehand. That is, if all of the AIs have a giant lookup table that essentially says "If the board is in this particular state, and I give hint H, that actually means FOO" where FOO contains more bits of information than hint H would by itself.

There's a paper from just a couple years ago about turning hanabi (with >2 players) into a variant of the 100-prisoner hat color problem: http://helios.mi.parisdescartes.fr/~bouzy/publications/bouzy...

For the 5 player variant specifically, the concept is as follows: Out of the 8 moves each other player can make (play any of 4 cards, discard any of 4 cards), you figure out the best move, add them up mod 8, and make one of 8 hints (you have a color or number hint available for each player). Then each of the 4 other players deduces which move you wanted them to make by subtracting out everyone else's best move. There's some room for improvement on figuring out how to avoid one player mistaking another's best play (because it might be predicated on them playing a card they don't know they have first), or how to hint in ways which avoid that.

A friend has put up an implementation up on github: https://github.com/chikinn/hanabi/blob/master/players/hat_pl...

Nice! Any intuition on whether this is practical for humans to implement in a game?

(Side note: in practice, it's almost impossible to avoid the illegal side-channel communication of watching the other players get nervous as you start to do something stupid, so really the only rules-faithful way is to play over a computer.)

(Another side note: there's a game called The Mind[1] which is the board-game version of SleepSport -- the goal is to, without most kinds of communication, play cards from each players hand in increasing order. One way to win is to synchronize everyone's internal clock and play each card when that many seconds have passed.)

[1] https://boardgamegeek.com/boardgame/244992/mind

[2] https://stackoverflow.com/questions/6474318/what-is-the-time...

(I think you meant to refer to "the Mind". "The Game" looks similar in some ways -- but your description is for "the mind".)

For a game that - in principle - should be no fun at all, The Mind is terrific.

Good catch! Fixed!

Played The Mind today with workmates. It was uncanny how everyone independently arrived at the sleep sort strategy independently and without cooperation.

In the linked paper they actually run two bots that use such strategies (they call them "hat strategies") on their platform and show how they do against some AIs.

The WTFWThat bot achieves the rather remarkable average score of 24.89 and 91.5% perfect games in 5-player self-play (that is, all 5 seats are WTFWThat bots).

That's very clever but feels like breaking the game slightly.

In the game Bridge, which involves communication between competing pairs of players, you can communicate a suit and a number, but it is meant to have some kind of meaning within the game, so I think strategies like this would be illegal.

There's no requirement for the bidding to be natural, there are plenty of artificial bidding systems -- the only constraints are that the bidding system you are using must be known to your opponent (to enforce this, I believe that on a bid your opponent can ask your partner what this bid is supposed to communicate), and that you do eventually have to play your bid so it can't match your cards too badly.

This is why calling Hanabi a "game" is generous. It's largely just an exercise in mathematics disguised as a game. There is only one real player: the person declaring the rules of deduction for everyone to follow.

It's a game, because why do you think everyone is going to follow the same rules?

Well if you play differently from how everyone else expects, you all suffer for it. Most games reward you for thinking up something creative that the other players missed; Hanabi punishes you instead. You have to hew to how you played in previous games and play as boringly as possible.

Incorrect. The game is played by one player creating a finite state machine description. That player wins if they convince 4 people to follow it. The other players win if they realize they aren't playing a game.

One of the interesting aspects of Hanabi is that there really isn't enough turns to fully hint. So with every hint, you must ask yourself not just "Why is s/he telling me this?", but "Why are they telling me this now?" At least in the group I play with, the hint is presumed to be immediately useful, but you may have to make some deductions and take some risk to make a play.

Hints made to other players can matter, too. Suppose you play after me, and I make a hint that you can see would make the next player play an unplayable card. You know that I wouldn't make that hint unless you had some course of action that would fix the situation. Since spending a hint to make you fix things via a hint is silly, that leaves playing the card you'd otherwise discard. So you can interpret a hint for an unplayable card you can see as a hint to play the oldest card in your hand.

And it can go a level deeper, where the card you played blind is a 5 (or some other card incompatible with actually making the hinted card playable). Since they see that you thought the card wasn't playable, and then it didn't become possibly playable after your play, they now know that the card isn't playable and can be safely discarded. This isn't actually much better than just direct hinting, though, so the straightforward play is higher value. Both have advantages over straight hinting, though.

> 5. No other communication is allowed.

When playing with friends / casuals I've found that this is the one that trips people up (if you want to play strictly by the rules) - no, you can't ask if there are any reds left. You can't ask what hint Alice gave to Bob on her previous turn. You can't just say "Only one yellow 4 left, let's be careful" etc.

> You can't ask what hint Alice gave to Bob on her previous turn

This sounds like reading the rules too stringently and not by their intent if you ask me, since it is asking only about information that the person is supposed to know.

It would be like saying if someone asks what the rules of the game are you aren't allowed to answer because none of the rules explicitly state that you can convey information about the rules of the game.

The point is that the other player may wonder: "why are they warning me about that rule right now? Do they see that based on how we are playing, someone is going to misplay soon?"

That leaks the players information state. Proper Hanabi is played with AI, who can't feel how boring the game is when played correctly.

There is at least one game where knowledge of the rules is restricted from the players: https://en.m.wikipedia.org/wiki/Paranoia_(role-playing_game)

Considering we are discussing card games, Chairman Mao would have been the more logical game to bring up. Unlike paranoia, but like Hanabi, it's a game for insufferable people.

I don't know about you, but I play Hanabi with my colleagues and they aren't insufferable.

When playing with casuals I actually don't mind communicating deductions that could strictly be made with a very simple application of the rules, such as your example. If someone discards a RED-4, there isn't a deep chain of logical reasoning to arrive at the conclusion that only 1 is left.

The best to teach casuals, I have found, is to play a round where you actually invert it have the experience player announce all of their reasoning (to demonstrate to the casuals the type of deductions that can be made).

That's true, for teaching purposes it does help to simplify, as long as people know that they're not playing by the "real" rules (so when they play again without you, they play correctly). I don't remember the specific details, but the times I've seen it be an issue are when people give different answers to a question since they've each deduced different things about the board state.

It's very hard to strictly play like that though - for example, players are allowed to look through the discard pile at any time, but that act is observable to the other players and may communicate information!

Thanks for your summary of Hanabi! You can find an example of your hypothetical AI in our recent paper: https://arxiv.org/abs/1811.01458. Note that all the conventions and rules are learned though, rather than hand coded. And your 'giant lookup table', is just a feedforward neural network. We also link out 300 games in case you are curious to see how it works.

I've only Hanabbed it up a handful of times (got 25 w strict play, no mistakes w my brother over Christmas!), but I've been of the conviction that there is too much emphasis on these "conventions," which either require a pregame explanation and agreement, or more often in-game argument (which is against the rules). I prefer an emphasis on card holding that constantly displays your knowledge of your own hand. To me this isn't against the communication rule, because the cards have to be on display. They can be held sideways, higher or lower, upside down (The cards are orientable on both faces, people don't notice this!), I hold them between certain fingers to remember number values. While that's just another convention, one can observe it develop from the beginning game state, and every time a player receives a clue, how they redo their hand. In summary, card holding is very effective/information rich and can be deduced without words, and seems to me perfectly natural to the game. This also relieves a lot of the cognition devoted to remembering what other people know about their own hands. When discussing a UI for a digital version of Hanabi, my brother and I discussed mostly how customizable the card holding would be, which would be exactly reflected to the other players.

Thanks for the paper, that is really interesting.

For those who read my previous comment, an example of pre-game communication collusion that the AI in the paper invented was to decide that any hint involving red or yellow _also_ means that your most recently acquired tile is immediately playable.

"Roughly 40% of the information is obtained through conventions rather than through the grounded information and card counting"

(what I have been describing as "collusion" the paper describes as "conventions" but it amounts to the same thing -- you can pass a lot more information than the hints imply if you can plan ahead)

That is super fascinating.

yes - this was the focus of our method: Allowing agents to interpret the actions of others, while also learning to be interpretable when observed by other agents.

Deep Mind continues the venerable tradition of defining their own frontiers of AI research based on what will sound good in a PR release when they inevitably achieve superhuman performance by modests means of a single datacenter stuffed with TPUS and $10K video cards.

I guess we should just accept that in the society of the future robots will play board/computers games and draw paintings while humans will all be slavering on integration of endless badly written web services.

Before you can solve general and difficult problems, it is often useful to solve toy versions of the same problems with many of the parameters simplified, or only focusing on individual aspects of the full problem at a time.

It would be incredibly difficult to attempt to solve the general problem of having an AI making decisions in the world, including taking into account the goals and mind-states of other humans (and general agents).

Much more practical to restrict the problem space very narrowly - can you build something which can learn to strategize and predict future outcomes in a difficult, full-information competitive game? (Go) Okay, can it do something similar without complete information, and with less discrete-looking steps and game states (1v1 Dota) - can it do so whilst cooperating with other similar agents with asymmetric abilities but common knowledge to achieve a common goal? (5v5 Dota)

Okay, now can it cooperate with other agents with symmetric abilities but asymmetric knowledge, and both convey necessary knowledge to other agents via its actions, as well as deduce knowledge from the actions of other agents? (Hanobi)

These are all necessary steps on the path to creating a general AI - if your AI is not at least able to do the above, it definitely won't be able to navigate a general environment. In the absence of an obvious alternative strategy, it seems to me the most productive approach is to tackle the small, important-but-doable seeming problems, one at a time, until hopefully you are able to unify them or else have a breakthrough of some sort.

This is so cynical it is laughable. They are using games to develop algorithms as they are easily defined problems, but they have in turn used those algorithms on real world problems, notably AlphaFold and their data center improvements [1]. I'm sorry breaking world records every year isn't quite impressive enough for you.

Furthermore, their investment into TPU's is going to bring the price down in the long run, which is incredibly beneficial to everyone. And yes we should assume if AGI is possible it will require a lot of compute.

[1] https://deepmind.com/blog/deepmind-ai-reduces-google-data-ce...

I am cynical for a reason.

>They are using games to develop algorithms as they are easily defined problems

I believe they use games because it makes good PR. I believed that before they chose StarCraft, but now I am more convinced in it.

Also, because their algorithms are extremely data-hungry and it's easy to generate tons of training data for a game simply by running it. And because it is possible to evaluate performance of an algorithm over and over again.

>they have in turn used those algorithms on real world problems, notably AlphaFold

AlphaFold does not use their board game AI[1]. Naming it similarly is just a marketing tactic.

AlphaFold might be a genuinely useful application of their resources, in which case they deserve praise in credit for it. I have to do more research on it, which is made very hard by all the hyperbole and hype around their work.

>Furthermore, their investment into TPU's is going to bring the price down in the long run

Overall, they do the exact opposite of democratizing AI. They invest heavily into insanely data-hungry and hardware-hungry approaches that a normal company will not be able to use (directly, not through Google's cloud services) for decades if ever.

> And yes we should assume if AGI is possible it will require a lot of compute.

What you're saying here (overall) does not make sense. There is absolutely no reason to assume that breaking records in games without substantial improvements in AI theory makes us any closer to "AGI".

Ignoring the completely unsubstantiated statements about AGI, we are left with the question of practical applications. I am not convinced that the societal benefits of investing in complex and computationally intensive machine learning is higher than investing in improvements of simpler algorithms. Most practical applications of AI outside of a few ubercorps involve simpler, older algorithms. Even someone as big as SAP relies mostly on decision trees and SVMs.

If anything, the hype around deep learning likely means it's harder to get funding for old-school AI research that is far likely to benefit smaller companies.


[1] http://predictioncenter.org/casp13/doc/CASP13_Abstracts.pdf Page 11, A7D.

That would make for a good novel :))) we’ve always expected superhuman AIs to free us from work, but maybe we’ll be the ones working while they’ll have all the fun.

Damn, i wish general AI never acquires personnal taste !

Never thought I'd see my github repo cited in a DeepMind/Brain paper. I still have SOTA for >2 player by a large margin :)

I'd be pretty impressed if they beat my 5p strategy without Hanabi-specific stuff, but I assume that's not their goal. Good luck to DeepMind/Brain either way!

Authors: would love to chat sometime if you're reading this! If I get back to Hanabi botting, I might try to use a learned policy to replace some of the random (very non-optimized) heuristics my bot had.

Absolutely, please shoot me an email. Did I mention that we link out random games our bot played in the BAD paper? Sorry for the late reply!

From the paper:

> The main difference in Hanabi is that there is no “cheap-talk” communication channel: any signalling must be done through the actions played. It is therefore more similar in spirit to learning how to relay information to partners in the bidding round of bridge.

Jakob, why did you choose Hanabi instead of bridge itself? Bridge is obviously much more widely known and seems to present similar challenges.

To add on to the reasons mentioned by others (Hanabi is simpler, purely cooperative, etc), I'll point out that Hanabi is fairly popular with young CS types, especially the ones interested in ML and AI, much more so than Bridge which is understandably seen as old fashioned. My source for this statement being that I'm friends with a lot of such people and was roommates with someone who now works for Deep Mind, and have learned first hand and shared in the generalized Hanabi obsession. Perhaps it's just my localized experience and I'm exaggerating the popularity of the game. But I think it's telling that I wasn't the least big surprised when I saw this game mentioned as the next candidate Deep Mind would tackle.

Does it much matter that the topic being studied is one you have experience with and a passion for? Perhaps more than we might think.

Hanabi is fully cooperative and entirely focused on communication. I think it's good to have a testbed that isolates these challenges, rather confounding them with the zero-sum (competitive) aspect of Bridge. Having said this, I do believe Bridge is a good testbed to put the different methods together.

Hanabi is easier - fewer hidden states, purely cooperative, and can be played with only two players.

(I'm aware that uncontested bridge bidding is a thing, but it's still harder than 2p Hanabi)

I didn't know what Hanabi was, so I looked it up: https://en.wikipedia.org/wiki/Hanabi_(card_game)?oldformat=t...

One part of the game's description that immediately jumped out at me was the need to pass information efficiently to other players, without directly giving them the world state. This appears to be a generally similar problem to one described in a recent openai paper: https://blog.openai.com/interpretable-machine-learning-throu...

Am also working on a Hanabi game for the web. WebSocket Multiplayer, HTML5 Canvas2D, highly performant. Should play well in all modern browsers including mobile. Goal is to have a full AI platform for agents as well as competitive human play ;)

Any suggestions for implementing complex Hanabi game rules and logic are appreciated! My gut feel is that an omniscient agent would converge on an optimal policy in all current game states.

Based on my experience maintaining a Hearthstone simulator and helping contributors write AI agents for it, the authors' interest in "imperfect information" is spot on.

But that's just one major hole in what the authors describe as "superhuman" AI demos in games like chess and go. The other is that those games don't look at all like most conventional digital games people enjoy playing.

Go looks like matrix math. A neural program has a chance of learning an internal simulation of the game. There's no chance of that for DOTA 2 or StarCraft II, which is why those wins by reinforcement learning AI bots were so impressive.

However, those games (DOTA 2 and SC2) suffer from big rewards for superhuman dexterity. Just a few seconds of excess dexterity (not strictly superhuman dexterity) can tip the game to an absolute win for an AI bot. So most conventional performance diagnostics, like winrates or Elo scores, greatly understate the role of dexterity in those games.

Hanabi seems like a great reaction to where the games AI research is going. Hanabi will be easier to simulate than a game like Hearthstone, so there will be accelerated implementations or neural programs of it. However, the authors criticize something that might be vital to Hanabi research:

> "However, it is impractical for human Hanabi players to expect others to strictly abide by a preordained policy."

That seems unfair. If you're going to research an idiosyncratically cooperative game, the most competitive strategies are going to involve preordained (meaning coordinated prior to a match) policies. Why turn your back on the path towards strategy development that obviously leads to the best play?

The biggest problem with Hanabi is that, as a cooperative game, we have no real idea what great strategies for it are. I know the authors claim that they do. But for DOTA 2 and StarCraft 2, PvP games with immense popularity, we have a credible natural laboratory for strategy and skill development. Hanabi is really obscure by comparison.

The biggest problems with Hearthstone, which in my opinion is one of the best games to research, are that (1) the rules are way too complex for a neural program to learn them, so you need novel approaches for learning, (2) the rules are too complex to execute on a GPU, so you need novel approaches to parallelized computing and optimization, and (3) the deckbuilding meta and some pay-to-win design have stunted strategy development, so outperforming a human opponent may be way too easy.

There are so many problems to solve, and it's not that credible that neural networks are the path forward!

I think neural networks will be part of the solution, but they are probably not the entire answer. For an example of a method that combines Deep RL with Bayesian reasoning, you can take a look at our recent paper (https://arxiv.org/abs/1811.01458). BAD achieves best known scores in for 2 player Hanabi in self-play.

> where BAD achieves an average score of 24.174 points in the two-player setting, surpassing the best previously published results for learning agents by around 9 points and approaching the best known performance of 24.9 points for (cheating) open-hand gameplay

I didn't realize Hanabi is already that close to being solved.

My gut reaction is that the game is a lot simpler than it appears. I guess your simpler "matrix" game points to that--you already had an intuition for reducing Hanabi. Indeed, looking at the code you share for the "matrix game," it would seem that Hanabi's problem is that, like Chess and Go, it doesn't really resemble more sophisticated games as much as it resembles something that can be literally expressed in Tensorflow.

The good news is that we have open-sourced the environment, so if you think it's easy I would love to see a simple method that solves it.

So far it seems to me like research on game-playing AI has not carried over well to solving problems that people care about. Deepmind's success in applied settings doesn't really seem to have benefitted from any meaningful way from the game-playing research as far as I can tell. What are the best examples of this so far that people are aware of?

The problem is that the expectation of being able to transfer this stuff over is more hype driven than anything else. People hear "AI beat the best person in the world at something intellectually hard" and think that it should be smarter than people in other ways too.

That doesn't diminish work on playing games. If they had released a chess or go challenge where the board is just bigger or more pieces, that would be dumb. But this challenge is a game that is a tiny bit closer to real "problems that people care about." Solving this won't get us to problems that people care about either, but it'll get us closer. It's only an incremental step, but that's okay.

Yeah I agree. I just wanted to highlight that to me the idea that doing better at games is advancing AI in a meaningful way is definitely overhyped.

Sometimes I think that the progress in games seems kind of orthogonal to progress in using machine learning to solve real world problems, because anytime you have a game it automatically gets you essentially infinite labeled training data set (each game has a score/outcome, and there are essentially infinite possible games). So as long as the compute scales up enough, any game humans can play will be solvable.

I wouldn't say that makes games orthogonal to real world problems. That's what makes them good stepping stones. Risk free "cheap" testing makes for fast research.

I totally agree about the ability to just skirt sample complexity. It's a tough one, made tougher by how early stage this work really is. We want bots to be able to match human ability and match human learning. Though they're put together, they're have very separate concerns.

For matching human ability, we're just beginning to learn techniques to get bots able to master hard tasks (e.g. incomplete information games, atari games, picking objects up). Those bots mostly learn waaaaaay slower than people. But never mastering is worse than slowly mastering, so it's early days.

On the other hand, you have people working on efficient learning. This is the question you're getting at with compute scaling arbitrarily-ish. It's more impressive if it can master a game after only playing it a small number of times. People are definitely working on this too, but for even simpler tasks. There's a lot of work right now in contextual bandits on learning fast, and that's a kind of baby-RL task. Even there, simulation tasks are super important because you really need a counterfactual to say whether you're doing well compared to alternatives.

I don't think it's orthogonal. Instead, there is a bias in news reporting. If someone manages to put one factory worker out of a job due to automation, it's not worth a headline. If someone beats the top human at some game, there'll immediately be a headline. People are being put out of jobs by machines as we speak.

Yes but computers can't process infinite data. Go has more possible state than atoms in the galaxy. It is not just about the data it is about novel solutions for interpreting the data and predicting useful states.

I mentioned this in another comment but two notable applied applications would be AlphaFold [1] (greatly outscoring SOTA on major protein folding problem), and their data center efficiency [2] which has reducing their data centre cooling bill by 40%. I expect to see many more applications in the coming years as the algorithms get to a more useful point. I think (as does Demis Hassabis) there are going to nearly endless applications in biosciences, weather prediction, material design, and much much more.

[1] https://deepmind.com/blog/alphafold/ [2] https://deepmind.com/blog/deepmind-ai-reduces-google-data-ce...

OpenAI claims that their work on Dota2 has carried over to work on a robot hand: https://blog.openai.com/learning-dexterity/

(Which isn't quite practical yet but might be in the mid-term future).

Hanabi seems interesting.

I would say that Poker with 3-9 players and heterogeneous player quality also contains similar aspects. You can improve the play when you recognize the skill difference, how people play against other players, changes in impulse control, magical thinking etc.

Yes, but actively communicating with some of the other players through agreed conventions would probably count as collusion and be illegal in N-player poker..

I was not talking about collusion.

You get information of how people play just by observing their play completely legally.

Sure, but in Hanabi the point is to be as informative as possible, while in poker it should be the opposite (unless you collude).

I think that they should focus on educational games that teach basic language and math. That may be very challenging but if anyone can pull that off it could lead to some really general purpose abilities.

I tried playing Hanabi once with someone who has ADHD, a small working memory, and is borderline on the autism spectrum so doesn't have a super great theory of mind. It was excruciating.

I’ve done nearly the same and it was spectacular. Different people are different people.

Just discovering the game but the imperfect information and convergent nature of the game looks like a good fit for a monte-carlo based algorithm.

I wonder why they created their own OpenAI-like interface instead of inheriting from OpenAI's environment?

Hanabi is a multi-agent problem. Unfortunately gym doesn't natively support multi-agent action and state spaces.

Two reasons come to mind:

DeepMind and OpenAI are indirectly in competition with each other.

The Gym interface is okay but the implementation is poor and does not readily support multiple agents.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact