When Goodharting Is Optimal

comex · on Jan 17, 2020

This is just a demonstration of how naively optimizing the expected value is not actually "optimal" in general.

There are two fundamental strategies for the robot. One is to stay on L forever, and have a 50.1% chance of receiving... let's call it `x` amount of reward, and a 49.9% chance of receiving nothing. The other is to move back and forth (L, center, R, center), which has a 100% chance of receiving `x/4` reward. (In theory the robot could also stay on R forever, but that's strictly inferior to staying on L, even if only slightly.)

The robot can also use a mix of these two strategies, e.g. by moving back and forth but waiting an extra turn every time it's on L. This increases the worst-case reward at the cost of reducing the best-case reward.

Let's rephrase in gambling terms. Suppose you have a one-time offer to bet any amount of money and then flip a fair coin (well, a 50.1/49.9% coin). If it's heads, the money you bet is quadrupled; if it's tails, you lose the money. Then the first option above represents betting all your money; the second option represents betting nothing; mixing the strategies is equivalent to betting only some of your money.

What would you do? The expected value of the money you come away with is 2x the money you bet, so if you're optimizing for expected value, you'll bet all the money you own. But most people wouldn't, because the risk of coming away with nothing is not worth the possible benefit. Personally, I might bet some of my money, but there's a good chance I'd bet nothing, because I care much more about the worst-case scenario than the average scenario – at least when the chance of a worst-case scenario is so high (almost 50%!).

You could argue that monetary gain is not the same thing as a reward function, and risk aversion should be modeled as part of the reward calculation, so that you can, in fact, optimize the expected value of the reward function. But in that case, the robot just has a bad (or non-human-like) reward function, so it's no surprise that it makes strange decisions.

g82918 · on Jan 17, 2020

This really feels like an article that is trying to say something important about a bigger issue. But the examples and the overall tone really diminish it. They allude a lot to things more important than them like Goodhart's law, mainly for cache, while they talk about their own poor quality idea(a bizarre imagined scenario far removed from, at least my, regular concerns). In the case of an AI show a concrete example involving a robot vacuum. Otherwise it feels like one of my junior's blowing smoke up my ass about hardware threads.

carlmr · on Jan 17, 2020

>UL and UR think that that policy is optimal!

Whatever UL and UR are they're introduced as "reward functions". They "see", they "prefer", they "think" etc. Is it normal to anthropomorphize reward functions now?

I find it makes this article hard to read. Is this common in robotics and if so, why would you describe reward functions in such a roundabout way?