
Tackling open challenges in offline reinforcement learning - theafh
https://ai.googleblog.com/2020/08/tackling-open-challenges-in-offline.html
======
RLmagic
Admittedly, I'm a layperson that skimmed the majority of this article.

That said, to me it looks like "offline" takes the magic out of RL.

When I think of RL, I think about the implications of it's use combined with
do-calcus, as describe in the book of why (pearl).

The magic is the way an RL system mimics that of an organism where it grows
and dies based on environmental stimuli. Adding an "evolution" algo continues
improving the results.

It seems like "offline" RL is trying to use RL where it doesn't make sense,
like medical decisions where no one in their right mind would want to be
subject to a learning coefficient. RL is akin to the medical student, not a
doctor.

To solve the "irresponsible use-cases" issue, like a medical student, the RL
should be observing and learning while the current best-of-the-best (most
recently trained) curve-fitting algorithm(ML) is used to make any inferences.
The data from the RL system can be inputted into the next ML iteration, where
the improvement would likely come from the RL system adjusting feature
weights.

Additionally, I'd be most interested to see how these results stack up against
highly-effective GANs, which to me reads simarly to "offline RL".

~~~
liuliu
These use-cases are different. They mainly designed to workaround the issue
where a differentiable loss cannot be designed.

Thinking it as a e2e system for autonomous driving. Your end-goal is miles-
driven without accident. This metrics is impossible to work out a
differentiable loss with.

Or more realistic case that actually drives Google / Facebook research in RL:
increase in ads clicks or e2e purchase is hard to be integrated into a
supervised learning system as the differentiable loss. The likelihood to click
an ads (in a supervised learning system) is subtly different from having more
ads clicks (in a reinforcement learning system).

~~~
RLmagic
Forgive me if this completely misses per the disclaimer in my first comment:

In the driving example, can the differentiable loss be a derivative of a
quality scoring algorithm, possibly another NN, that measures episodic quality
at various time and distance-based intervals?

For example, the RL systems loss/reward function attributes a quality score to
every 1 second or 1/10th of a mile.

My guess is that there are well known problems with this assumption. What are
they?

Thanks for the response!

~~~
morei
That's kinda just moving the problem: What is this quality scoring algorithm
and how do we built it?

If we had such a quality score, then the problem becomes differentiable and
far easier to manage. But it's hard because we don't have (and don't know how
to build) such a metric.

------
tosh
> The central challenge arises from a distributional shift: in order to
> improve over the historical data, offline RL algorithms must learn to make
> decisions that differ from the decisions taken in the dataset. However, this
> can lead to problems when the consequences of a seemingly good decision
> cannot be deduced from the data — if no agent has taken this particular turn
> in the maze, how does one know if it leads to the goal or not? Without
> handling this distributional shift problem, offline RL methods can
> extrapolate erroneously, making over-optimistic conclusions about the
> outcomes of rarely seen actions

