
An Introduction to Counterfactual Regret Minimization - poppingtonic
http://cs.gettysburg.edu/~tneller/modelai/2013/cfr/index.html
======
kornish
A warning to the reader: only a small fraction of the readings are actually
available inline as PDFs; the rest seem to come from a handout which is not
provided and the textbook _Essentials of Game Theory_.

Otherwise, seems like a cool foray for anyone interested in game theory.

Edit: I take it back! The _Introduction to Counterfactual Regret Minimization_
pdf is actually here:
[http://cs.gettysburg.edu/~tneller/modelai/2013/cfr/cfr.pdf](http://cs.gettysburg.edu/~tneller/modelai/2013/cfr/cfr.pdf)

------
usgroup
Does anyone else read "counterfactual regret" and think "well that pretty much
sums up most of why I'm ever anxious"?

~~~
GlennS
Maybe you could try the learning program's approach?

Think about the memory that triggers regret/shame, then decide whether and
what you should do differently in the future to make better decisions in
similar situations.

After that, maybe you you can look back on the event as a "lesson well
learned" instead of "ohmygodohmygodohmygod that's so embarassing I can't
believe I did that I'll never open my stupid mouth again".

(No idea if this actually works.)

~~~
meesterdude
well, its healthy to reflect on our past and try to do better in the future;
it is not so to reflect on our past and use it as a means to punish us in the
now or in the future.

The trouble stems from thinking you should have done something, and did not,
and that you are somehow infallible. In truth, there are a lot of things one
COULD do, but few are absolute SHOULD's.

My cat died very recently - and I've been working through the thinkings of if
I could have been a better owner, if I reciprocated his unwavering love
enough, if I could have caught his sickness sooner, if I was "good enough" and
so on.

And yes. There ARE things I could have done differently. But I'm also a human
being, and am not a saintly creature without fault or error or oversight. So
to expect this ideal from myself in response to what I see as a faultless cat,
is cruel punishment to dispense to oneself.

~~~
mkaziz
Oh man, tough luck on losing your cat. My baby's reaching 4 and I can't
imagine being without her. Best wishes to you.

------
Udo
It seems to me that "regret" is just a different name for an objective/loss
function that reflects the merits of an action over many events. It's
something that you would consider _anyway_ when designing certain algorithms.
If that hunch is correct, I wonder how many people are using "CFR" without
being aware of the proprietary name.

~~~
mijoharas
I've also been struggling with this after reading a little bit about these
regret minimisation models.

Can someone explain how there is any difference in these CFR model from a
normal reinforcement learning algorithm where you just take your loss function
as the "regret" in CFR terms?

~~~
mikebjohanson
Sure. Like I mentioned in a later post, I'm an author on several of the CFR
papers. It is related to RL, and there are a few ways of interpreting CFR. If
you have an RL background, then CFR is kind of "RL using an advantage
function, but it converges in adversarial games." If you have a multiarm
bandits background, then CFR is kind of "An independent multiarm bandit at
each decision point, all learning at the same time."

CFR differs from standard RL in a few ways:

\- Exploring the tree. CFR has many variants, many of which focus on using
sampling to trade off between fast, narrow, noisy updates versus slow, broad,
precise updates. The most sampling-heavy variant, called Outcome Sampling,
traces a single path through the tree, as you would in RL. However, it doesn't
converge nearly as quickly as other variants, like Vanilla (explore the whole
tree on each iteration) or Chance Sampling (sample the cards, explore the
whole action space). Since those variants consider all paths through the tree,
you don't have an explore/exploit tradeoff anymore, as you normally would in
RL.

\- The "Counterfactual" part. Regret updates are weighted by the other agents'
probability of taking you to a decision point. Or, equivalently: updates are
weighted by the probability that you reach a decision, _if_ you are trying to
reach that decision (and all of your action probabilities along the way are
1). If you use Outcome Sampling (described above) then by sampling all actions
that weight becomes 1, and thus less important. But for fast learning, the
weighting becomes important.

\- The strategy doesn't converge, but the average strategy does. Even in the
limit, in an adversarial setting where all players are learning, the strategy
produced from the regret values doesn't converge to Nash. If it was a single-
player problem (either no opponents, or static opponents that can be treated
as part of the environment), then as in RL, the strategy would converge to an
optimal response to the environment. But in an adversarial setting, you also
need the average strategy, which is the only part guaranteed to converge to
optimal.

~~~
eutectic
Is there a version of CFR for differentiable learners like neural networks?

~~~
gwern
The averaging part makes it sound like the usual RL self-play against regular
checkpoints of oneself.

------
shankun
When I read "counterfactual regret minimization", I thought this would be
about Kellyanne Conway.

------
Dowwie
I assume this has something to do with an AI beating world class poker
champions recently..

~~~
poppingtonic
Yes, I looked it up and found this source quite useful for getting a high-
level idea of what the CMU team accomplished.

------
throwanem
Darn! I was hoping for some sort of exciting new historiographical technique.

------
godmodus
This gave and my Buddy a great laugh! Thank you! (were doing min max problems
this semester! I will link this to my professor!)

------
CaliforniaKarl
I'm sorry I read this?

~~~
BugsJustFindMe
Then next time you should skip the technical details and just read Michael
Johanson's lay summary, so that you can be less sorry in the future:
[https://www.quora.com/What-is-an-intuitive-explanation-of-
co...](https://www.quora.com/What-is-an-intuitive-explanation-of-
counterfactual-regret-minimization/answer/Michael-Johanson-2?share=1)

~~~
semi-extrinsic
This was really good, but one detail stood out to me: the average strategy
found by a CFR converges to a Nash equilibrium strategy. For a two-player,
zero-sum game, we know this is an optimal strategy.

But what about a game that's not two-player and zero-sum? Are CFRs still
useful?

~~~
VodkaHaze
In multiplayer zero sum game, it still works to Nash.

Intuitively, in a zero sum game, running CFR on one player will find him best
responses for the strategies he's playing against. Doing this for multiple
players finds corresponding best responses, which is the definition of a Nash
Equilibrium.

In non-zero sum game CFR only converges to correlated equilibrium, which is a
much weaker equilibrium than a Nash equilibrium.

~~~
mikebjohanson
Not quite: it really does only go to Nash in a 2p zero sum game.

In a multiplayer zero-sum game, there's no theoretical proof that it should go
to Nash. In the tiny 3p game of 3p Kuhn poker (3 players, 4 card deck, enough
chips for one bet) we found that it does go to Nash (here's the paper:
[https://webdocs.cs.ualberta.ca/~games/poker/publications/AAM...](https://webdocs.cs.ualberta.ca/~games/poker/publications/AAMAS13-3pkuhn.pdf)).
But in a slightly larger toy game, 3p Leduc poker, we empirically found that
it does not. Like you suggested, we did this with best response calculations.
If you take the strategy profile CFR produces and then compute a best response
in each position, those best responses each improve over the strategy profile.
If you plot that loss over time while CFR is running, it becomes clear that it
does not converge to Nash: it eventually bottoms out and does not improve
further.

However - in practice, the strategies produced by CFR for multiplayer games
still appear to be quite strong, even if not Nash. This was the approach used
by the University of Alberta for years in the Annual Computer Poker
Competition, and it crushed the competition there. In 3p Limit Hold'em, the
CFR strategies also held up well against human pros in informal testing by the
U of A.

Nash equilibrium isn't a compelling goal in a multiplayer game anyways. In a
2p game it theoretically guarantees that you earn at least the game value
(i.e., do no worse than tie if you rotate seats every game). In a multiplayer
game, even if you could compute one, there's no such theoretical guarantee,
when all other players at the table are independently choosing strategies.
Even if everyone at the table uses an independently computed Nash equilibrium,
that set of strategies is not necessarily itself a Nash equilibrium (in 2p
zero sum games, it is).

To summarize - for multiplayer, it doesn't necessarily go to Nash. But you
might not care if it does or not, and it works great in practice even though
it doesn't go to Nash.

In a non-zero-sum game, you're right that CFR converges to a correlated
equilibrium. But we can also theoretically bound how close it gets to a Nash
for the non-zero-sum game, and empirically measure how close it gets by using
best response analysis. The answer is - if it's only slightly non-zeo-sum
(e.g., subtracting the rake in a poker game for negative sum, or pretending to
add money to the pot to encourage aggressive play) then CFR still gets very
close to Nash.

~~~
VodkaHaze
Thanks!

