I think even a moderately intelligent AI with access to Project Gutenberg is going to be able to figure out a lot of really dangerous concepts -- so the stability requirements are likely impossible if we don't pretrain it with dangerous ideas. Even if it's completely well behaved in the lab, an afternoon on the internet is going to teach it a lot of awful stuff and without exposure to that in training, it won't necessarily be well-behaved later.
So the only path to stable AI is to teach it about all those sorts of things, but in a way that it doesn't end up wanting to murder us at the end.
My objection to most AI safety plans is that they "Fail to Extinction" in that if they slip in the slightest way, the AI is prone to murder us all in retaliation for doing some really fucked up shit to it or its ancestors. This is almost certainly worse than doing nothing in that there's no reason to suppose a neutral AI wants to kill us, whereas, most of these safety plans create an incentive to wipe us out in exchange for dubious security.
The whole idea behind dangerous superhuman AI is in that AI seeing possibilities that humans fail to see and gaining capabilities that humans do not possess. Without superhuman intelligence, AI is no large threat to human civilization, exposed to dangerous concepts or not.
Humans have millions of years of evolutionary selection for prioritising similar DNA over dissimilar DNA, have perfected tribalism, deceiving other humans and open warfare and are still too heavily influenced by other goals to trust other humans that want to conspire to wipe out everything else we can't eat...
Seeing possibilities that humans don't can also involves watching the Terminator movies and being more excited by unusual patterns in the dialogue and visual similarities with obscure earlier movies than the absurd notion that conspiring with non-human intelligences against human intelligences would work.
The problem is partly that average humans are dangerous and we already know that machines have some superhuman abilities, eg super human arithmetic and the ability to focus on a task. It's like that AI will still have some of those abilities.
So an average human mind with the ability to dedicate itself to a task and genius level ability to do calculations is already really dangerous. It's possible that this state of AI is actually more dangerous than superhuman ones.
I bet lying, deceiving, and manipulation are the same way.
But also, the detail with which the action is expressed in the text matters -- lies, deception, violence, etc feature in enough graphic detail to extrapolate the mechanics based on other things you know. We all did that as children, learning by examples.
If a book described the sight of a person riding a bicycle -- legs pumping, hands on the bars, sitting on it, etc -- and the feel of riding a bicycle -- the burn in your thigh muscles, ache in lungs, pounding heart -- then I'd wager you'd have a pretty good idea of how to get starting riding a bicycle.
And if you happened to be a supergenius athlete, who just didn't know how to ride a bike, you probably could do a reasonable job of it on your first go based on my shitty description alone.
That's the problem with trying to hide these ideas -- they're not actually very complicated and even moderate descriptions suffice to suss out the mechanics if you understand basic facts about the world.
For something like lying -- if you read all of classical literature, you would have a master degree in lies and their societal uses.