Hacker News new | past | comments | ask | show | jobs | submit login

Focusing on the "neural network" part might be confusing you.

Classification/supervised learning is essentially about learning labels. We have some examples that have already been labeled: cats vs. dogs, suspicious transactions vs legitimate ones, As vs Bs vs...Zs. From those examples, we want to learn some way to assign new, unlabeled instances, to one of those classes.

Reinforcement learning, in contrast, is fundamentally about learning how to behave. Agents learn by interacting with their environment: some combinations of states and actions eventually lead to a reward (which the agent "likes") and others do not. The reward might even be disconnected from the most recent state or action and instead depend on decisions made earlier. The goal is to learn a "policy" that describes what should be done in each state and balances learning more about the environment ("exploration", which may pay off by letting us collect more rewards later) and using what we know about the environment to maximize our current reward intake (exploitation). Games are a particularly good test-bed for reinforcement learning because they have fairly clear states (I have these cards in my hand, or that many lives, etc), actions ("hit me!", "Jump up") and rewards (winnings, scores, levels completed). There's also an obvious parallel with animal behavior, which is where the name originated.

In both cases, neural networks are useful because they are universal function approximators. There's presumably some very complex function that maps data onto labels (e.g., pixels onto {"DOG", "CAT"}) for supervised learning, and states onto action sequences for reinforcement learning. However, we usually don't know what that is, and can't fit it directly, so we let neural networks learn it instead. However, you can do both supervised learning and reinforcement learning without them (in fact, until recently, nearly everyone did).

However, the network typically doesn't get "rewritten" on the fly. Instead, it does something like estimate the value of a state or state-action pair.

Thanks for taking the time to explain this to me, it's definitely helped my understanding. It sounds somewhat 'static', assuming I haven't misinterpreted. Where does the 'learning' part come in? Again, this is almost certainly due to my lack of knowledge, but it sounds like RL essentially brute-forces the optimal inputs for each statically defined 'action function'. Meaning the usefulness of the model depends entirely on how well you've initially specified it, meaning the problem is really solved through straight-forward analysis.

(I've obviously gone wrong somewhere here... Just walking you through my thought process)

You're welcome!

The agent "learns" by stumbling around and interacting with its environment. At the beginning, its behavior is pretty random, but as it learns more and more, it refines its "policy" to collect more rewards more quickly.

Brute force is certainly possible for some situations. For example, suppose you're playing Blackjack. You can calculate the expected return from 'hitting' (taking another card) and 'standing' (keeping what you've got), based on the cards in your hand and the card the dealer shows.

So...brute force works for simple tasks, but in a lot of situations, it's hard to enumerate all possible states (chess has something like 10^47 possible states) and state-action pairs. It's also difficult to "assign credit"--you rarely lose a chess game just because of the last move. These make it difficult to brute-force a solution or find one via analysis. However, the biggest "win" for using RL is that it's applicable to "black box" scenarios where we don't necessarily know everything about the task. The programmer just needs to give it feedback (though the reward signal) when it does something good or bad.

Furthermore, depending on how you configure the RL agent, it can react to changes in the environment, even without being explicitly reset. For example, imagine a robot vacuum that gets "rewarded" for collecting dirt. It's possible that cleaning the room changes how people use it and thus, changes the distribution of dirt. With the right discounting setup, the vacuum will adjust its behavior accordingly.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact