
When Goodharting Is Optimal - laurex
https://www.lesswrong.com/posts/megKzKKsoecdYqwb7
======
comex
This is just a demonstration of how naively optimizing the expected value is
not actually "optimal" in general.

There are two fundamental strategies for the robot. One is to stay on L
forever, and have a 50.1% chance of receiving... let's call it `x` amount of
reward, and a 49.9% chance of receiving nothing. The other is to move back and
forth (L, center, R, center), which has a 100% chance of receiving `x/4`
reward. (In theory the robot could also stay on R forever, but that's strictly
inferior to staying on L, even if only slightly.)

The robot can also use a mix of these two strategies, e.g. by moving back and
forth but waiting an extra turn every time it's on L. This increases the
worst-case reward at the cost of reducing the best-case reward.

Let's rephrase in gambling terms. Suppose you have a one-time offer to bet any
amount of money and then flip a fair coin (well, a 50.1/49.9% coin). If it's
heads, the money you bet is quadrupled; if it's tails, you lose the money.
Then the first option above represents betting all your money; the second
option represents betting nothing; mixing the strategies is equivalent to
betting only some of your money.

What would you do? The expected value of the money you come away with is 2x
the money you bet, so if you're optimizing for expected value, you'll bet all
the money you own. But most people wouldn't, because the risk of coming away
with nothing is not worth the possible benefit. Personally, I might bet some
of my money, but there's a good chance I'd bet nothing, because I care much
more about the worst-case scenario than the average scenario – at least when
the chance of a worst-case scenario is so high (almost 50%!).

You could argue that monetary gain is not the same thing as a reward function,
and risk aversion should be modeled as part of the reward calculation, so that
you can, in fact, optimize the expected value of the reward function. But in
that case, the robot just has a bad (or non-human-like) reward function, so
it's no surprise that it makes strange decisions.

------
g82918
This really feels like an article that is trying to say something important
about a bigger issue. But the examples and the overall tone really diminish
it. They allude a lot to things more important than them like Goodhart's law,
mainly for cache, while they talk about their own poor quality idea(a bizarre
imagined scenario far removed from, at least my, regular concerns). In the
case of an AI show a concrete example involving a robot vacuum. Otherwise it
feels like one of my junior's blowing smoke up my ass about hardware threads.

------
carlmr
>UL and UR think that that policy is optimal!

Whatever UL and UR are they're introduced as "reward functions". They "see",
they "prefer", they "think" etc. Is it normal to anthropomorphize reward
functions now?

I find it makes this article hard to read. Is this common in robotics and if
so, why would you describe reward functions in such a roundabout way?

