Thanks for taking the time to explain this to me, it's definitely helped my understanding. It sounds somewhat 'static', assuming I haven't misinterpreted. Where does the 'learning' part come in? Again, this is almost certainly due to my lack of knowledge, but it sounds like RL essentially brute-forces the optimal inputs for each statically defined 'action function'. Meaning the usefulness of the model depends entirely on how well you've initially specified it, meaning the problem is really solved through straight-forward analysis.(I've obviously gone wrong somewhere here... Just walking you through my thought process)

 You're welcome!The agent "learns" by stumbling around and interacting with its environment. At the beginning, its behavior is pretty random, but as it learns more and more, it refines its "policy" to collect more rewards more quickly.Brute force is certainly possible for some situations. For example, suppose you're playing Blackjack. You can calculate the expected return from 'hitting' (taking another card) and 'standing' (keeping what you've got), based on the cards in your hand and the card the dealer shows.So...brute force works for simple tasks, but in a lot of situations, it's hard to enumerate all possible states (chess has something like 10^47 possible states) and state-action pairs. It's also difficult to "assign credit"--you rarely lose a chess game just because of the last move. These make it difficult to brute-force a solution or find one via analysis. However, the biggest "win" for using RL is that it's applicable to "black box" scenarios where we don't necessarily know everything about the task. The programmer just needs to give it feedback (though the reward signal) when it does something good or bad.Furthermore, depending on how you configure the RL agent, it can react to changes in the environment, even without being explicitly reset. For example, imagine a robot vacuum that gets "rewarded" for collecting dirt. It's possible that cleaning the room changes how people use it and thus, changes the distribution of dirt. With the right discounting setup, the vacuum will adjust its behavior accordingly.

Search: