Hacker News new | past | comments | ask | show | jobs | submit login
Q-Learning (wikipedia.org)
88 points by tosh 60 days ago | hide | past | web | favorite | 33 comments

I tried to learn deep Q-learning recently using OpenAI gym. I looked at the "leader boards" and tried to learn from their code, but I wasn't getting results nearly as good as the "leaders".

I eventually checked out the leaders code and ran it myself, but removed their carefully selected random seeds, and found that the supposed leading solutions often failed to converge at all without their magically selected random seed.

I left the experience believing deep RL is still very unreliable.

I will contrast with my anecdote.. I build a simulation of 10x10 cityblocks with a population of 30 pedestrians.. I made a reward function based on distance to a random "target point" with penalty for walking in the street vs sidewalk.. and penalty for walking into walls.. the inputs were a 360 degree raycast of 16 samples.. and "distance to target" and the outputs were awsd keyboard inputs. I left it running overnight and by morning, bots were pretty efficiently walking around the city to their random targets via sidewalks. It felt like magic. I could have coded the behavior directly but the learned version seemed somewhat noisier and more organic. This was done a couple years ago using the a JavaScript deepq learning library. It feels like a squishier version of a-star or something.

The raycast probably disambiguated the state pretty well, such that it essentially had to memorize a few hundred actions, so that it did end up basically doing a sort of asynchronous distributed Dijkstra's algorithm.

You may be interested in our new Behaviour Suite for Reinforcement Learning project at https://github.com/deepmind/bsuite.

This is an attempt at defining a set of minimal but meaningful benchmark RL experiments. In each experiment, we typically evaluate over a range of environment parameters and compute an aggregate score. See https://arxiv.org/abs/1908.03568 for more details.

We also include a simple starter RL codebase, and examples of evaluating agents defined in other RL frameworks.

I am interested, thanks.

actually the seed is also a hyper-parameter

Uuuuummmm, I strongly disagree.

Is that a joke? Is that a hyper parameter!? I can’t google things on my phone, it takes too long!

I tried 50000 different random seeds, and my algo only works when the seed is 32451 :)

I'm always amazed how much attention supervised learning algorithms get when Q-learning is often better suited for real-world problems that go beyond just classifying things, works closer to how humans actually learn, and is much easier to implement (requires way less code and no training data). I wonder how much of this disparity is because big companies like Google and FB who hold all the training data benefit from only talking about ML techniques that nobody can compete with them on.

At Hiome (hiome.com), we use Q-learning to learn people's habits and automatically program their smart home, and it's insanely effective. Since there's no training data required, we don't even have to violate our users' privacy to aggregate their data, so everything stays local in their home.

I believe similar techniques will get us closer to true AGI than neural nets.

It's not used a lot in ambiguous tasks like image recognition and audio because you would need to have a state for every single possible wave form or image variation. You would have to have a similarity preserving hash or some other related method to reduce related states to a single common state. That's why it's better to use them near the ends of DL networks. If you didn't do that, every image of the same apple with just one pixel different would be a new state.

Yup, totally agree that Q-learning is not the right tool for classification problems. However, it is great for an agent that needs to act on its world where it can quickly get a reward/punishment for its choice (assuming a relatively limited pool of possible actions).

And of course, there's no such thing as the single perfect algorithm. My point is just that I'm surprised Q-learning isn't talked about more.

Can you name two real world problems with practical applications that fit all the criteria? I.e. where (1) an agent needs to act on its world but it would be okay for it to spend many tries exploring and failing; (2) it will be able to quickly and cheaply (i.e. without 24/7 human supervision) get a reward/punishment for its choice, and (3) the set of world states and possible actions is sufficiently small so that Q-learning is tractable?

IMHO Q-learning isn't talked about because it really is not a good fit for the kind of problems people actually want to solve.

What behavior does Hiome 'learn' with Q-learning? From your site it's not obvious what actions on the world would be implied where you can actually get some feedback/reward/punish depending on whether these actions were desired by the smart home inhabitants; and the behavior that your page does show - occupancy sensing - is essentially a classification problem.

Well I don't know about q learning being talked about more but unsupervised reinforcement learning definitely needs more attention in general.

I guess you don't necessarily need data, but if you don't have data than you'd need a simulator, and we'd don't have simulators for most problems that we care about.

> works closer to how humans actually learn

This is a little dubious; humans learn by positive and negative experiences, but also by imparted doctrine. Neural networks and other supervised methods are certainly analogs to an aspect of human learning.

Q learners have always been too brute forcey imo. Every action from any known state gets a value. That will explode on any meaningful task. Its better to use it in combination with some other model that reduces how many states are seen by the q learner.

But it is a good model to understand some principles and motivations behind the general ideas in AI and machine learning.

Yep. That is what Deep Q Network does.

Yeah that's a good example. Say for alpha go for instance. Without the DL network the q learner would be massive. And that's just for a board game. Imagine trying to do that for systems like moving human body parts. Every single body configuration would be a state and then you have every single possible action from each state.

The table becomes unwieldy on much simpler tasks than that.

Consider a 3x3 board where each cell holds 3 bits of information (each cell can be in 2^3 states). Then for the board you have (2^3)^9 = 2^27 different states.

Then multiply that by how many actions you have per state. We'll suppose 9 because you can only change one tile at a time. Then multiply that by 4 bytes assuming we are using a float instead of a double and you get 4.8 gigs of memory for whatever this simple problem is.

you mathed wrong. 3x3 board with 8 states per cell is 72 total states. 9x8,not 8^9.

Edit, I just considered : Unless you mean that the state is the combination of all the cells. Then you are right

Yes the state is the entirety of the boards configuration.

Yes. So we use a Nueral network as a function approximator for Q values and call it deep Q

Q-learning blew my mind when I took an AI course in college.

One thing I never fully understood, why would anyone choose SARSA when they could use Q-learning? I believe they use the same inputs, and are nearly the same algorithm, but Q-learning is off-policy while SARSA is on-policy (if I remember right?)

SARSA follows the current policy. Suppose you're minimizing cost. Then, if the value function of the MDP is a lower bound, it will explore interesting states simply because SARSA underestimates the cost-to-go. If updates of the value function then tighten this lower bound, SARSA will converge. In this case, SARSA is more efficient than Q-learning.

Apart from that using Q-factors does not scale well. If your action space is a game controller, things may still look ok, but not if your action space is multi-dimensional and continuous.

Double Q-learning can learn from off-policy data. But it's fairly tricky to get all the tuning parameters right. It's a good choice if you can start up 100 or more runs with a range of parameters and pick the one that worked best.

From my understanding, SARSA could be more ideal when there is a greater cost associated with making a mistake whilst learning. SARSA is more conservative, as it takes into account possible large negative rewards during the exploratory phase. The classic example problem is "cliff walking."[0]

[0] https://github.com/cvhu/CliffWalking

was it the AI Pacman? I did that project, it really was mind-blowing

Didn't you post this two days ago?


edit: hn.algolia.com says you posted this two days ago...


That's the "second chance" mechanism described at https://news.ycombinator.com/item?id=11662380

Oh I see.... interesting. I assumed it was something built into HN because the post ID remained the same.

Although this says an hour ago, it looks like the same post.

Maybe the HN algorithm repromoted with enough upvotes?

tosh is a busy guy. He submits about 30 articles a day and has 45,000 karma. He could probably teach us about the HN algorithm, and about ourselves.

He does a lot of Wikipedia submissions that seem to be effective.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact