
Counterfactual Regret Minimization with Kuhn Poker - vpj
http://blog.varunajayasiri.com/ml/kuhn_poker.html
======
hgibbs
This is cool. I was actually asked for the optimal strategy for player 1 in
this game as an interview question for a quant research role; I didn't know
anything about game/decision theory, other than the basics, and so I only got
as far as determining what to do with an Ace and a King, and then figuring out
that you should bluff with a certain frequency if you hold the ace.

Anyhow, I came up with another question while considering this problem. Player
1 should only bluff at the optimal rate if player 2 punishes them for
deviating from this rate by changing the frequency that they call when holding
the king. Otherwise, Player 1 gains more on average by bluffing with the queen
more frequently. So, how should Player 2 figure out if Player 1 is actually
bluffing at the optimal rate (and not higher)? It could happen, with
exponentially small probability, that Player 1 decides whether to bluff
randomly, and bluffs with the first 10,000 queens before ever folding with a
queen. How does Player 2 distinguish this from Player 1 only ever bluffing
with the queen?

But perhaps this is something that the Nash Theorem deals with? I don't know.
Any comments from people who know more than me on this are welcome!

~~~
s1t5
> Player 1 should only bluff at the optimal rate if player 2 punishes them for
> deviating from this rate by changing the frequency that they call when
> holding the king. Otherwise, Player 1 gains more on average by bluffing with
> the queen more frequently. So, how should Player 2 figure out if Player 1 is
> actually bluffing at the optimal rate (and not higher)?

You're describing exploitative strategies - that's the kind of circular
thinking of "if he knows that I know what he knows...". In a Nash equilibrium
neither side has an incentive to change their strategy even if they know what
their opponent is doing. If you're betting at the optimal frequency, there is
nothing that your opponent can do to counterexploit your strategy. If you're
deviating from the optimal frequency, your opponent can adjust, you can adjust
in return and so on.

------
plafl
For the uninitiated like me it's confusing at first the following sentence
when deriving the Nash equilibrum:

"If first player has a K, because of 1. and 2., he would only lose if he bets.
So first player should pass if he has a K."

It actually means that the first player always loses on average if he bets
(expected utility -1/2 if not mistaken). If he folds however the expected
utility is clearly zero since half of the time wins and half of the time
loses.

I have not reached yet the description of the algorithm.

I actually find somewhat paradoxical that the first player should fold with a
K and bluff with a Q! I remember that I loved little results like this when
years ago I first followed the game theory course in Coursera. It was a little
dry but I completed it nevertheless since I found it quite fascinating. I do
wonder what possible applications are for my working field, machine learning,
to apply these kind of algorithms. I would love to have an excuse to devote
some time to it.

~~~
ThrustVectoring
> I actually find somewhat paradoxical that the first player should fold with
> a K and bluff with a Q!

Bluffing is a strategy that wins against the middling hands you can convince
to fold, and loses against the strongest hands which just call you anyways.
The game is special in that when you have a middling hand, your opponent
_never_ has one, so they _always_ either correctly call with the best hand or
correctly fold with the worst.

