It doesn't seem like that distinguishes the case where your "alignment" (RL?) si...

It doesn't seem like that distinguishes the case where your "alignment" (RL?) simply failed (eg because the model wasn't good enough to explore successfully) from the case where the model was manipulating the RL process for goal oriented reasons, which is a distinction the paper is trying to test.