
The Markov Property, Chain, Reward Process and Decision Process - thebillkidy
https://xaviergeerinck.com/markov-property-chain-reward-decision
======
melling
Here are a couple of other Markov Chain articles that are worthwhile:

[http://setosa.io/ev/markov-chains/](http://setosa.io/ev/markov-chains/)

[https://towardsdatascience.com/a-zero-math-introduction-
to-m...](https://towardsdatascience.com/a-zero-math-introduction-to-markov-
chain-monte-carlo-methods-dcba889e0c50)

~~~
baxtr
These are great! Thank you!!

------
lossolo
> Xavier works on enabling everyone to construct products that make a life
> changing impact on world scale.

I would remove/change that sentence if I were you. I suppose it was made to
make what you do look impactful and serious but I have exact opposite
impression after reading it.

~~~
thebillkidy
Hi Lossolo, thanks a lot for this feedback, I took it into account and
refactored this sentence a bit :) (deploy will be done in +-2min)

------
platz
there is a lot of info out there about markov chains, but very little about
markov decision processes (MDP).

How popular are MDP? What are their strengths? weaknesses?

Is there really a connection between MDP and reinforcement learning as the
author claims, as I've not heard this connection before?

This is a very welcome article.

~~~
clickok
Consider reading "Reinforcement Learning, an Introduction" by Rich Sutton[0].
It is very accessible (and probably the most used textbook in the field), and
goes into detail about the connection between RL and MDPs.

If you want a high-level answer right now about how the two are related:
reinforcement learning focuses on predicting or maximizing the discounted sum
of rewards, G_t = R_{t+1} + γ R_{t+2} + γ^2 R_{t+3} + ... which is called the
return Crucially, the process generating the rewards (the environment) is
assumed to be memoryless[1]. So we can reformulate the return as a recursive
equation: G_{t} = R_{t+1} + γ G_{t+1}, since the return from "t+1" doesn't
depend on the reward you got from the previous time step ("R_{t+1}").

This in turn allows you to define Bellman equations (which define optimal
solutions to the problem of maximizing return, which is what you want to do).
More generally assuming the problem is Markovian makes things substantially
easier to reason about, because you don't have to worry about complex
histories leading up to the agent's arrival in a state.

While it's honestly a pretty strong assumption, it's one that we make across
many fields and particularly in RL. It's actually not too inaccurate; for
example most games are Markovian (or close), and in physical problems (e.g.
robot locomotion) if you've got position+velocity then controlling the robot
can be viewed as a (continuous-time) MDP.

The weaknesses are obvious: not every environment is Markovian, sometimes
there is long-term dependence on the past. For example, translating prose or
poetry in such a way that preserves the point across is _very_ tough[2], not
really something you can formulate as an MDP. Other times, the weaknesses are
easier to overcome (think of a poker game, where you can combine the present
state information with the betting patterns from the past; this augmented
state should be able to tell you everything you need to know).

\---

0\.
[http://incompleteideas.net/book/bookdraft2017nov5.pdf](http://incompleteideas.net/book/bookdraft2017nov5.pdf)

1\. Another way of phrasing this is that each state provides all the useful
information, and knowing about previous states does not tell you anything new
about the environment. Think of perfect information games, like Chess. It
doesn't really matter how you got into a particular position, just the moves
you make from there.

2\. Compare the poetry translations of Jerry Lettvin with what you'd get using
existing translation software
([https://sites.google.com/site/lettvingroup/Home/Projects-
His...](https://sites.google.com/site/lettvingroup/Home/Projects-
History/biographies/jerry-lettvin/creative-writing/gallows-songs)).

~~~
chris_st
Wow, what a great answer -- thanks!

I have a question -- in a card game, it appears that knowing which cards have
been played is important (or can help, anyway). This seems to violate the idea
of being memoryless.

Can you "cheat" and make reason on the cards left instead? Thanks for any
clues.

~~~
platz
Card games e.g. Poker isn't a perfect information game, so history does
matter, so it's not markovian.

~~~
chris_st
Sorry, should have given a more concrete example, more what I'm interested in.

I was thinking Hearts or Spades, which aren't perfect information, but since
the whole deck is dealt out, you can (with a lot of memory :-) know what cards
are left.

~~~
clickok
Yeah, so here we can "solve" the problem without using a POMDP by really
expanding what constitutes the state space.

Initially, your state is your hand plus the initial rules (who goes first,
which suit is trump, etc depending on what kind of game you're playing). After
each player's turn you transition into a new state, which is <<starting hand>>
\+ <<history of cards played>>, so at each point in time the state has all the
available information. Obviously this leads to a combinatorially huge state
space, because there's O(52!) different possible sequences for a standard deck
of cards.

Typically you have to use some sort of approximation architecture because
evaluating all different possible responses to a given sequence of cards would
both take too long and even if you could do it, the memory requirements would
be enormous. See for example DeepStack[0], which uses a neural net to
approximate the quality of positions + possible responses (along with other
techniques), rather than having a lookup table for what to do in each possible
game state.

So you can formulate such games as MDPs, but the question then becomes: is
this a useful way to think about it? Sometimes it is, because your approach to
the problem scales and so you can throw enough resources (training data,
compute power) at it to learn a useful strategy; other times there is a better
way of formulating the problem[1].

I have some implementations of reinforcement learning algorithms and code for
working with MDPs if you're interested in that sort of thing[2], and of course
I highly recommend Rich's book.

\---

0\. [https://www.deepstack.ai/](https://www.deepstack.ai/) \-- note that it's
not really an RL system, but it does illustrate using neural nets to learn
complicated functions in a large state space.

1\. For example, I am not sure that such an MDP for hearts is well-posed: the
convergence results that I'm familiar with rely on the MDP being ergodic and
aperiodic, but the suggested construction is obviously periodic.

2\. The algorithms: [https://github.com/rldotai/rl-
algorithms](https://github.com/rldotai/rl-algorithms) An MDP library:
[https://github.com/rldotai/mdpy](https://github.com/rldotai/mdpy) (check out
notebooks/example_notebook.ipynb for a basic demo).

------
glitchc
Hi Xavier, in the discount factor equation, shouldn't the last term of the
expansion be gamma^n.Rn instead of gamma^2.Rn? Or am I missing something?

~~~
thebillkidy
You are completely right! Thanks for spotting this! :)

~~~
thebillkidy
Update: Fixed it, the caching might need to expire on your side though (ctrl +
shift + r should do the trick)

------
TheCabin
For my taste, the article is a bit too high-level to build a meaningful
intuition. I think a more complex example would be beneficial.

~~~
thebillkidy
Thanks a lot for your feedback! Currently I am working on the high-level
definitions of Reinforcement Learning, so that i can build on that to go
deeper in the more advanced ones. I'd like to have a series that explains the
OpenAI examples :)

------
antman
Nice work! Any good pages for pomdp?

------
bhaavan
Doesn't the typo "3th" annoy anyone when reading that article?

~~~
gautamdivgi
No it doesn’t.

~~~
thebillkidy
also fixed it to "third" :) keep the feedback coming

