
Q-Learning - tosh
https://en.wikipedia.org/wiki/Q-learning
======
Buttons840
I tried to learn deep Q-learning recently using OpenAI gym. I looked at the
"leader boards" and tried to learn from their code, but I wasn't getting
results nearly as good as the "leaders".

I eventually checked out the leaders code and ran it myself, but removed their
carefully selected random seeds, and found that the supposed leading solutions
often failed to converge at all without their magically selected random seed.

I left the experience believing deep RL is still very unreliable.

~~~
thrax
I will contrast with my anecdote.. I build a simulation of 10x10 cityblocks
with a population of 30 pedestrians.. I made a reward function based on
distance to a random "target point" with penalty for walking in the street vs
sidewalk.. and penalty for walking into walls.. the inputs were a 360 degree
raycast of 16 samples.. and "distance to target" and the outputs were awsd
keyboard inputs. I left it running overnight and by morning, bots were pretty
efficiently walking around the city to their random targets via sidewalks. It
felt like magic. I could have coded the behavior directly but the learned
version seemed somewhat noisier and more organic. This was done a couple years
ago using the a JavaScript deepq learning library. It feels like a squishier
version of a-star or something.

~~~
basman
The raycast probably disambiguated the state pretty well, such that it
essentially had to memorize a few hundred actions, so that it did end up
basically doing a sort of asynchronous distributed Dijkstra's algorithm.

------
guptaneil
I'm always amazed how much attention supervised learning algorithms get when
Q-learning is often better suited for real-world problems that go beyond just
classifying things, works closer to how humans actually learn, and is much
easier to implement (requires way less code and no training data). I wonder
how much of this disparity is because big companies like Google and FB who
hold all the training data benefit from only talking about ML techniques that
nobody can compete with them on.

At Hiome (hiome.com), we use Q-learning to learn people's habits and
automatically program their smart home, and it's insanely effective. Since
there's no training data required, we don't even have to violate our users'
privacy to aggregate their data, so everything stays local in their home.

I believe similar techniques will get us closer to true AGI than neural nets.

~~~
CodiePetersen
It's not used a lot in ambiguous tasks like image recognition and audio
because you would need to have a state for every single possible wave form or
image variation. You would have to have a similarity preserving hash or some
other related method to reduce related states to a single common state. That's
why it's better to use them near the ends of DL networks. If you didn't do
that, every image of the same apple with just one pixel different would be a
new state.

~~~
guptaneil
Yup, totally agree that Q-learning is not the right tool for classification
problems. However, it is great for an agent that needs to act on its world
where it can quickly get a reward/punishment for its choice (assuming a
relatively limited pool of possible actions).

And of course, there's no such thing as the single perfect algorithm. My point
is just that I'm surprised Q-learning isn't talked about more.

~~~
PeterisP
Can you name two real world problems with practical applications that fit all
the criteria? I.e. where (1) an agent needs to act on its world but it would
be okay for it to spend many tries exploring and failing; (2) it will be able
to quickly and cheaply (i.e. without 24/7 human supervision) get a
reward/punishment for its choice, and (3) the set of world states and possible
actions is sufficiently small so that Q-learning is tractable?

IMHO Q-learning isn't talked about because it really is not a good fit for the
kind of problems people actually want to solve.

What behavior does Hiome 'learn' with Q-learning? From your site it's not
obvious what actions on the world would be implied where you can actually get
some feedback/reward/punish depending on whether these actions were desired by
the smart home inhabitants; and the behavior that your page does show -
occupancy sensing - _is_ essentially a classification problem.

------
CodiePetersen
Q learners have always been too brute forcey imo. Every action from any known
state gets a value. That will explode on any meaningful task. Its better to
use it in combination with some other model that reduces how many states are
seen by the q learner.

But it is a good model to understand some principles and motivations behind
the general ideas in AI and machine learning.

~~~
leet
Yep. That is what Deep Q Network does.

~~~
CodiePetersen
Yeah that's a good example. Say for alpha go for instance. Without the DL
network the q learner would be massive. And that's just for a board game.
Imagine trying to do that for systems like moving human body parts. Every
single body configuration would be a state and then you have every single
possible action from each state.

~~~
jsjolen
The table becomes unwieldy on much simpler tasks than that.

Consider a 3x3 board where each cell holds 3 bits of information (each cell
can be in 2^3 states). Then for the board you have (2^3)^9 = 2^27 different
states.

~~~
wannabesrevenge
you mathed wrong. 3x3 board with 8 states per cell is 72 total states. 9x8,not
8^9.

Edit, I just considered : Unless you mean that the state is the combination of
all the cells. Then you are right

~~~
CodiePetersen
Yes the state is the entirety of the boards configuration.

------
2bitencryption
Q-learning blew my mind when I took an AI course in college.

One thing I never fully understood, why would anyone choose SARSA when they
could use Q-learning? I believe they use the same inputs, and are nearly the
same algorithm, but Q-learning is off-policy while SARSA is on-policy (if I
remember right?)

~~~
loehnsberg
SARSA follows the current policy. Suppose you're minimizing cost. Then, if the
value function of the MDP is a lower bound, it will explore interesting states
simply because SARSA underestimates the cost-to-go. If updates of the value
function then tighten this lower bound, SARSA will converge. In this case,
SARSA is more efficient than Q-learning.

Apart from that using Q-factors does not scale well. If your action space is a
game controller, things may still look ok, but not if your action space is
multi-dimensional and continuous.

------
37
Didn't you post this two days ago?

[https://news.ycombinator.com/item?id=20685049](https://news.ycombinator.com/item?id=20685049)

edit: hn.algolia.com says you posted this two days ago...

[https://hn.algolia.com/?query=q%20learning&sort=byDate&prefi...](https://hn.algolia.com/?query=q%20learning&sort=byDate&prefix&page=0&dateRange=all&type=story)

~~~
tlb
That's the "second chance" mechanism described at
[https://news.ycombinator.com/item?id=11662380](https://news.ycombinator.com/item?id=11662380)

~~~
37
Oh I see.... interesting. I assumed it was something built into HN because the
post ID remained the same.

