I eventually checked out the leaders code and ran it myself, but removed their carefully selected random seeds, and found that the supposed leading solutions often failed to converge at all without their magically selected random seed.
I left the experience believing deep RL is still very unreliable.
This is an attempt at defining a set of minimal but meaningful benchmark RL experiments. In each experiment, we typically evaluate over a range of environment parameters and compute an aggregate score. See https://arxiv.org/abs/1908.03568 for more details.
We also include a simple starter RL codebase, and examples of evaluating agents defined in other RL frameworks.
At Hiome (hiome.com), we use Q-learning to learn people's habits and automatically program their smart home, and it's insanely effective. Since there's no training data required, we don't even have to violate our users' privacy to aggregate their data, so everything stays local in their home.
I believe similar techniques will get us closer to true AGI than neural nets.
And of course, there's no such thing as the single perfect algorithm. My point is just that I'm surprised Q-learning isn't talked about more.
IMHO Q-learning isn't talked about because it really is not a good fit for the kind of problems people actually want to solve.
What behavior does Hiome 'learn' with Q-learning? From your site it's not obvious what actions on the world would be implied where you can actually get some feedback/reward/punish depending on whether these actions were desired by the smart home inhabitants; and the behavior that your page does show - occupancy sensing - is essentially a classification problem.
This is a little dubious; humans learn by positive and negative experiences, but also by imparted doctrine. Neural networks and other supervised methods are certainly analogs to an aspect of human learning.
But it is a good model to understand some principles and motivations behind the general ideas in AI and machine learning.
Consider a 3x3 board where each cell holds 3 bits of information (each cell can be in 2^3 states). Then for the board you have (2^3)^9 = 2^27 different states.
Edit, I just considered : Unless you mean that the state is the combination of all the cells. Then you are right
One thing I never fully understood, why would anyone choose SARSA when they could use Q-learning? I believe they use the same inputs, and are nearly the same algorithm, but Q-learning is off-policy while SARSA is on-policy (if I remember right?)
Apart from that using Q-factors does not scale well. If your action space is a game controller, things may still look ok, but not if your action space is multi-dimensional and continuous.
edit: hn.algolia.com says you posted this two days ago...
Maybe the HN algorithm repromoted with enough upvotes?
tosh is a busy guy. He submits about 30 articles a day and has 45,000 karma. He could probably teach us about the HN algorithm, and about ourselves.
He does a lot of Wikipedia submissions that seem to be effective.