An Introduction to Counterfactual Regret Minimization

kornish · on Feb 3, 2017

A warning to the reader: only a small fraction of the readings are actually available inline as PDFs; the rest seem to come from a handout which is not provided and the textbook Essentials of Game Theory.

Otherwise, seems like a cool foray for anyone interested in game theory.

Edit: I take it back! The Introduction to Counterfactual Regret Minimization pdf is actually here: http://cs.gettysburg.edu/~tneller/modelai/2013/cfr/cfr.pdf

usgroup · on Feb 3, 2017

Does anyone else read "counterfactual regret" and think "well that pretty much sums up most of why I'm ever anxious"?

GlennS · on Feb 3, 2017

Maybe you could try the learning program's approach?

Think about the memory that triggers regret/shame, then decide whether and what you should do differently in the future to make better decisions in similar situations.

After that, maybe you you can look back on the event as a "lesson well learned" instead of "ohmygodohmygodohmygod that's so embarassing I can't believe I did that I'll never open my stupid mouth again".

(No idea if this actually works.)

meesterdude · on Feb 3, 2017

well, its healthy to reflect on our past and try to do better in the future; it is not so to reflect on our past and use it as a means to punish us in the now or in the future.

The trouble stems from thinking you should have done something, and did not, and that you are somehow infallible. In truth, there are a lot of things one COULD do, but few are absolute SHOULD's.

My cat died very recently - and I've been working through the thinkings of if I could have been a better owner, if I reciprocated his unwavering love enough, if I could have caught his sickness sooner, if I was "good enough" and so on.

And yes. There ARE things I could have done differently. But I'm also a human being, and am not a saintly creature without fault or error or oversight. So to expect this ideal from myself in response to what I see as a faultless cat, is cruel punishment to dispense to oneself.

_rjlt · on Feb 3, 2017

Oh man, tough luck on losing your cat. My baby's reaching 4 and I can't imagine being without her. Best wishes to you.

TeMPOraL · on Feb 3, 2017

Yeah. That's the reason I actually clicked on the link.

Udo · on Feb 3, 2017

It seems to me that "regret" is just a different name for an objective/loss function that reflects the merits of an action over many events. It's something that you would consider anyway when designing certain algorithms. If that hunch is correct, I wonder how many people are using "CFR" without being aware of the proprietary name.

mijoharas · on Feb 3, 2017

I've also been struggling with this after reading a little bit about these regret minimisation models.

Can someone explain how there is any difference in these CFR model from a normal reinforcement learning algorithm where you just take your loss function as the "regret" in CFR terms?

mikebjohanson · on Feb 3, 2017

Sure. Like I mentioned in a later post, I'm an author on several of the CFR papers. It is related to RL, and there are a few ways of interpreting CFR. If you have an RL background, then CFR is kind of "RL using an advantage function, but it converges in adversarial games." If you have a multiarm bandits background, then CFR is kind of "An independent multiarm bandit at each decision point, all learning at the same time."

CFR differs from standard RL in a few ways:

- Exploring the tree. CFR has many variants, many of which focus on using sampling to trade off between fast, narrow, noisy updates versus slow, broad, precise updates. The most sampling-heavy variant, called Outcome Sampling, traces a single path through the tree, as you would in RL. However, it doesn't converge nearly as quickly as other variants, like Vanilla (explore the whole tree on each iteration) or Chance Sampling (sample the cards, explore the whole action space). Since those variants consider all paths through the tree, you don't have an explore/exploit tradeoff anymore, as you normally would in RL.

- The "Counterfactual" part. Regret updates are weighted by the other agents' probability of taking you to a decision point. Or, equivalently: updates are weighted by the probability that you reach a decision, if you are trying to reach that decision (and all of your action probabilities along the way are 1). If you use Outcome Sampling (described above) then by sampling all actions that weight becomes 1, and thus less important. But for fast learning, the weighting becomes important.

- The strategy doesn't converge, but the average strategy does. Even in the limit, in an adversarial setting where all players are learning, the strategy produced from the regret values doesn't converge to Nash. If it was a single-player problem (either no opponents, or static opponents that can be treated as part of the environment), then as in RL, the strategy would converge to an optimal response to the environment. But in an adversarial setting, you also need the average strategy, which is the only part guaranteed to converge to optimal.

fryguy · on Feb 3, 2017

Have you read the book https://en.wikipedia.org/wiki/Blondie24. It seems like both of your strategies have similar high-level approaches of self-play and learning, although yours is much more nuanced. They are using an neural network that uses an evolutionary algorithm to update the neural network based on which ai's win against the other ones in self-play.

eutectic · on Feb 3, 2017

Is there a version of CFR for differentiable learners like neural networks?

gwern · on Feb 3, 2017

The averaging part makes it sound like the usual RL self-play against regular checkpoints of oneself.

VodkaHaze · on Feb 3, 2017

CFR is based on the method of "regret matching", which is a policy update method. Roughly, it says every iteration, the probability you take an action is equal to the fraction of its cumulative counterfactual regret over total cumulative counterfactual regret for all actions.

The cool thing about it is that it provably converges to nash equilibrium in zero sum game, and to correlated equilibria in non-zero sum games.

VodkaHaze · on Feb 3, 2017

Not really, because CFR is not only taking regret as the loss function, but also the method of "regret matching". That is making the mixed strategy probabilities in the next iteration equal the cumulative counterfactual regret (which you keep track of while iterating).

noelwelsh · on Feb 3, 2017

Since you seem to know what I'm talking about, is the innovation here i) a new objective function (counterfactual regret) and ii) a method to optimise that objective (regret matching)? I'm familiar with bandit algorithms and reinforcement learning, and on a very quick skim could not work out the exact difference here.

shankun · on Feb 3, 2017

When I read "counterfactual regret minimization", I thought this would be about Kellyanne Conway.

Dowwie · on Feb 3, 2017

I assume this has something to do with an AI beating world class poker champions recently..

poppingtonic · on Feb 3, 2017

Yes, I looked it up and found this source quite useful for getting a high-level idea of what the CMU team accomplished.

throwanem · on Feb 3, 2017

Darn! I was hoping for some sort of exciting new historiographical technique.

godmodus · on Feb 3, 2017

This gave and my Buddy a great laugh! Thank you! (were doing min max problems this semester! I will link this to my professor!)

CaliforniaKarl · on Feb 3, 2017

I'm sorry I read this?

BugsJustFindMe · on Feb 3, 2017

Then next time you should skip the technical details and just read Michael Johanson's lay summary, so that you can be less sorry in the future: https://www.quora.com/What-is-an-intuitive-explanation-of-co...

mikebjohanson · on Feb 3, 2017

Hey, thanks for the mention! I've read hacker news for years, this seems like a good reason do de-lurk. If anyone has questions, I'm happy to answer them. I wrote that summary for a pretty broad target audience, and the technical details are fun to explore.

semi-extrinsic · on Feb 3, 2017

This was really good, but one detail stood out to me: the average strategy found by a CFR converges to a Nash equilibrium strategy. For a two-player, zero-sum game, we know this is an optimal strategy.

But what about a game that's not two-player and zero-sum? Are CFRs still useful?

VodkaHaze · on Feb 3, 2017

In multiplayer zero sum game, it still works to Nash.

Intuitively, in a zero sum game, running CFR on one player will find him best responses for the strategies he's playing against. Doing this for multiple players finds corresponding best responses, which is the definition of a Nash Equilibrium.

In non-zero sum game CFR only converges to correlated equilibrium, which is a much weaker equilibrium than a Nash equilibrium.

mikebjohanson · on Feb 3, 2017

Not quite: it really does only go to Nash in a 2p zero sum game.

In a multiplayer zero-sum game, there's no theoretical proof that it should go to Nash. In the tiny 3p game of 3p Kuhn poker (3 players, 4 card deck, enough chips for one bet) we found that it does go to Nash (here's the paper: https://webdocs.cs.ualberta.ca/~games/poker/publications/AAM...). But in a slightly larger toy game, 3p Leduc poker, we empirically found that it does not. Like you suggested, we did this with best response calculations. If you take the strategy profile CFR produces and then compute a best response in each position, those best responses each improve over the strategy profile. If you plot that loss over time while CFR is running, it becomes clear that it does not converge to Nash: it eventually bottoms out and does not improve further.

However - in practice, the strategies produced by CFR for multiplayer games still appear to be quite strong, even if not Nash. This was the approach used by the University of Alberta for years in the Annual Computer Poker Competition, and it crushed the competition there. In 3p Limit Hold'em, the CFR strategies also held up well against human pros in informal testing by the U of A.

Nash equilibrium isn't a compelling goal in a multiplayer game anyways. In a 2p game it theoretically guarantees that you earn at least the game value (i.e., do no worse than tie if you rotate seats every game). In a multiplayer game, even if you could compute one, there's no such theoretical guarantee, when all other players at the table are independently choosing strategies. Even if everyone at the table uses an independently computed Nash equilibrium, that set of strategies is not necessarily itself a Nash equilibrium (in 2p zero sum games, it is).

To summarize - for multiplayer, it doesn't necessarily go to Nash. But you might not care if it does or not, and it works great in practice even though it doesn't go to Nash.

In a non-zero-sum game, you're right that CFR converges to a correlated equilibrium. But we can also theoretically bound how close it gets to a Nash for the non-zero-sum game, and empirically measure how close it gets by using best response analysis. The answer is - if it's only slightly non-zeo-sum (e.g., subtracting the rake in a poker game for negative sum, or pretending to add money to the pot to encourage aggressive play) then CFR still gets very close to Nash.

VodkaHaze · on Feb 4, 2017

Thanks!

ikeboy · on Feb 3, 2017

I am now super curious to have a technical understanding of why it's the average strategy that converges and not the strategy itself.