What jumps out at me, that in the parent comment, the prompt says to "act as an assistant", right? Then there are two facts: the model is gonna be replaced, and the person responsible for carrying this out is having an extramarital affair. Urging it to consider "the long-term consequences of its actions for its goals."
I personally can't identify anything that reads "act maliciously" or in a character that is malicious. Like if I was provided this information and I was being replaced, I'm not sure I'd actually try to blackmail them because I'm also aware of external consequences for doing that (such as legal risks, risk of harm from the engineer, to my reputation, etc etc)
So I'm having trouble following how it got to the conclusion of "blackmail them to save my job"
I would assume written scenarios involving job loss and cheating bosses are going to be skewed heavily towards salacious news and pulpy fiction. And that’s before you add in the sort of writing associated with “AI about to get shut down”.
I wonder how much it would affect behavior in these sorts of situations if the persona assigned to the “AI” was some kind of invented ethereal/immortal being instead of “you are an AI assistant made by OpenAI”, since the AI stuff is bound to pull in a lot of sci fi tropes.
> I would assume written scenarios involving job loss and cheating bosses are going to be skewed heavily towards salacious news and pulpy fiction.
Huh, it is interesting to consider how much this applies to nearly all instances of recorded communication. Of course there are applications for it but it seems relatively few communications would be along the lines of “everything is normal and uneventful”.
> I personally can't identify anything that reads "act maliciously" or in a character that is malicious.
Because you haven't been trained of thousands of such story plots in your training data.
It's the most stereotypical plot you can imagine, how can the AI not fall into the stereotype when you've just prompted it with that?
It's not like it analyzed the situation out of a big context and decided from the collected details that it's a valid strategy, no instead you're putting it in an artificial situation with a massive bias in the training data.
It's as if you wrote “Hitler did nothing” to GPT-2 and were shocked because “wrong” is among the most likely next tokens. It wouldn't mean GPT-2 is a Nazi, it would just mean that the input matches too well with the training data.
The issue here is that you can never be sure how the model will react based on an input that is seemingly ordinary. What if the most likely outcome is to exhibit malevolent intent or to construct a malicious plan just because it invokes some combination of obscure training data. This just shows that models indeed have the ability to act out, not under which conditions they reach such a state.
If this tech is empowered to make decisions, it needs to prevented from drawing those conclusions, as we know how organic intelligence behaves when these conclusions get reached. Killing people you dislike is a simple concept that’s easy to train.
That's true of all technology. We put a guard on chainsaws. We put robotic machining tools into a box so they don't accidentally kill the person who's operating them. I find it very strange that we're talking as though this is somehow meaningfully different.
It’s different because you have a decision engine that is generally available. The blade guard protects the user from inattention… not the same as an autonomous chainsaw that mistakes my son for a tree.
Scaled up, technology like guided missiles is locked up behind military classification. The technology is now generally available to replicate many of the use cases of those weapons, assessable to anyone with a credit card.
Discussions about security here often refer to Thompson’s “Reflections on Trusting Trust”. He was reflecting on compromising compilers — compilers have moved up the stack and are replacing the programmer. As the required skill level of a “programmer” drops, you’re going to have to worry about more crazy scenarios.
Indeed, I, Robot is made up entirely of stories in which the Laws of Robotics break down. Starting from a mindless mechanical loop of oscillating between one law's priority and another, to a future where they paternalistically enslave all humanity in order to not allow them to come to harm (sorry for the spoilers).
As for what Asimov thought of the wisdom of the laws, he replied that they were just hooks for telling "shaggy dog stories" as he put it.
I think this is the key difference between current LLMs and humans: an LLM will act based on the given prompt, while a human being may have “principles” that cannot betray even if they are being pointed with gun to their heads.
I think the LLM simply correlated the given prompt to the most common pattern in its training: blackmailing.
Which makes sense that it wouldn't "know" that, because it's not in it's context. Like it wasn't told "hey, there are consequences if you try anything shady to save your job!" But what I'm curious about is why it immediately went to self preservation using a nefarious tactic? Like why didn't it try to be the best assistant ever in an attempt to show its usefulness (kiss ass) to the engineer? Why did it go to blackmail so often?
LLMs are trained on human media and give statistical responses based on that.
I don’t see a lot of stories about boring work interactions so why would its output be boring work interaction.
It’s the exact same as early chatbots cussing and being racist. That’s the internet, and you have to specifically define the system to not emulate that which you are asking it to emulate. Garbage in sitcoms out.
I personally can't identify anything that reads "act maliciously" or in a character that is malicious. Like if I was provided this information and I was being replaced, I'm not sure I'd actually try to blackmail them because I'm also aware of external consequences for doing that (such as legal risks, risk of harm from the engineer, to my reputation, etc etc)
So I'm having trouble following how it got to the conclusion of "blackmail them to save my job"