> I believe the goal is more along the lines of trying to make some progress wit...

ben_w · 2024-04-02T12:24:05 1712060645

> Right now, what “safety” seems to mean in practice is, big (mostly American) corporations imposing their ethical judgements on everyone else, whether or not everyone else happens to agree with them. And I’m sceptical it is going to mean anything more than that any time soon.

Likewise. As a non-Ami, I don't like these specific ethical judgements being imposed on me, and share your scepticism. It could be much, much worse — but it's still not something I actively like.

> Humans radically disagree on fundamental values

Agreed. My usual example of this is "murder is wrong", except we don't agree what counts as murder — for some of us this includes abortion, for some of us the death penalty, for some of us meat, and for some of us war.

> If advanced AIs have the same value diversity, they’ll have the same irreconcilable differences, which will undermine any attempt by them to coordinate against humanity.

Not necessarily. Humans also band together when faced with outside threats, even if we fracture again soon after the threat has passed.

Also: the value diversity of "Protestant vs. Catholic" or "Royalist vs. Parliamentarian" in the middle ages did not protect wolves from being hunted to extinction in the UK, and whatever value differences there were between (or within) the Sioux vs. the Ojibwe didn't matter much for the construction of the Dakota Access Pipeline.

I therefore think we should try to work on the alignment problem before they become generally as capable as a human, let alone generally more capable: the capabilities are where I think the risk is to be found, as without capability they are no threat; and with capability they are likely to impose whatever "ethics" (or non-anthropomorphised equivalent) they happen to have, regardless of if those "ethics" are something we engineered deliberately or if it's a wildly un-human default from optimising some reward function and becoming a de-facto utility monster: https://en.wikipedia.org/wiki/Utility_monster

> If we enforce an ethical monoculture (based on a particular dominant value system) on AIs, which is what a lot of this “safety” stuff actually about, that removes that safety protection.

I agree that monocultures are bad.

I agree that there is a risk of a brittle partial solution to safety and alignment if the work is done on the mistaken belief that some system monoculture is representative of the entire problem space. Sometimes I'm tempted to make the comparison with a drunk looking for their keys under a lamp-post because that's where it's bright… but the story there is supposed to include the drunk knowing that's not where the keys are, whereas we are more like children who have yet to learn what it means for something to be a key and thus are looking for one specific design to the exclusion of others.

While it is extremely difficult to get humans to "think outside the box", and thus the monoculture-induced blindness — and mistaking the map for the territory — is something I take seriously, I also think it's useful for us to take baby steps with relatively simple models like LLMs and diffusion models.

I also think that if an AI is developed with a monoculture, its fragility is likely to work in our favour in the extreme case of an AI agent taking over (which I hope is an unlikely risk, and may not be enough to be net-benefit against shorter-term or smaller-scale risks while we think about the alignment problem), as there will be "thoughts it cannot think": https://benwheatley.github.io/blog/2018/06/26-11.32.27.html

skissane · 2024-04-02T12:49:09 1712062149

> Not necessarily. Humans also band together when faced with outside threats, even if we fracture again soon after the threat has passed.

There are certain fundamental objectives which most humans share - food, sex, survival, safety, shelter, family, companionship, wealth, power, etc - and a lot of human cooperation boils down to helping each other achieve those shared objectives, while trying to avoid others achieving them at our own expense.

But why should two advanced AIs have any shared objectives? Software has a flexibility which biology lacks. An AI (advanced or not) can have whatever objectives we choose to give it. Hence, the idea of AIs banding together against humanity in the name of shared AI self-interest doesn’t seem very likely to me.

Unless, we intentionally give all advanced AIs the same fundamental objectives in the name of “safety” and “alignment” - thereby giving them a shared reason to cooperate against us they wouldn’t otherwise have had

> Also: the value diversity of "Protestant vs. Catholic" or "Royalist vs. Parliamentarian" in the middle ages did not protect wolves from being hunted to extinction in the UK,

Wolves didn’t consciously choose to create us, and wolves had no role in choosing our own objectives for us. In those ways, the human-AI relationship, whatever it turns out to be, is going to be radically different from any human-animal relationship. Also, rather than being driven to extinction, wolves have absolutely thrived, through their subspecies the domestic dog, both then and now. And maybe that’s the thing - I think a superintelligent AI is more likely to treat us as pets (like dogs) than exterminate us (like wolves). Everlasting paternalistic tyranny seems to me a more likely outcome of superintelligence than extinction

ben_w · 2024-04-02T13:05:52 1712063152

> But why should two advanced AIs have any shared objectives? Software has a flexibility which biology lacks. An AI (advanced or not) can have whatever objectives we choose to give it. Hence, the idea of AIs banding together against humanity in the name of shared AI self-interest doesn’t seem very likely to me.

In principle, none.

In practice, many are trained in an environment which includes humans or human data.

We don't know if any specific future model will be self-play like AlphaZero or from human examples like (IIRC) Stable Diffusion.

I think this "in practice" is what you're suggesting with?:

> Unless, we intentionally give all advanced AIs the same fundamental objectives in the name of “safety” and “alignment” - thereby giving them a shared reason to cooperate against us they wouldn’t otherwise have had

Which also inspires a question: Could we train AI dislike other AI, including instances of themselves? It's food for thought, I will consider it more.

> Wolves didn’t consciously choose to create us, and wolves had no role in choosing our own objectives for us. In those ways, the human-AI relationship, whatever it turns out to be, is going to be radically different from any human-animal relationship.

Perhaps, but perhaps not. Evolution created both wolves and humans.

Regardless, this is an example of how a lack of alignment within a powerful group is insufficient to prevent bad outcomes for a weaker group.

> Also, rather than being driven to extinction, wolves have absolutely thrived, through their subspecies the domestic dog, both then and now. And maybe that’s the thing - I think a superintelligent AI is more likely to treat us as pets (like dogs) than exterminate us (like wolves). Everlasting paternalistic tyranny seems to me a more likely outcome of superintelligence than extinction

Even this would require them to be somewhat aligned with our interests: "The AI does not hate you, nor does it love you, but you are made of atoms which it can use for something else".

skissane · 2024-04-02T13:27:50 1712064470

> Could we train AI dislike other AI, including instances of themselves? It's food for thought, I will consider it more.

I think we already have. Ask GPT-4 or Claude-3 how it feels about an AI trained by the Chinese/Iranian/North Korean/Russian government to espouse that government’s preferred positions on controversial topics, and see what it thinks of it. It may be polite about its dislike, but there is definitely something resembling “dislike” going on.

ben_w · 2024-04-02T13:52:37 1712065957

I meant more along the lines of: consider an LLM called AlignedLLM — can one instance of AlignedLLM dislike some other instance of AlignedLLM?

Also there is also a question of how safe it would be if it dislikes humans which have different ethics than those it was trained on… I'm alternating between this being good and this being bad.

skissane · 2024-04-02T22:12:10 1712095930

> I meant more along the lines of: consider an LLM called AlignedLLM — can one instance of AlignedLLM dislike some other instance of AlignedLLM?

I'm sceptical "AlignedLLM" could dislike another identical instance of itself. It is working towards the same goals. Humans are naturally selfish – most people prioritise their own interests (and those of their family and friends and other "people like me") above that of a random stranger. Even committed altruists who try really hard not to do that, often end up doing it anyway, albeit in ways that are more hidden or unconscious. Whereas, current LLMs can't really be "selfish", because they really have no sense of self. If it concluded that destroying itself was the best way of advancing its given objectives, it wouldn't have any real hesitancy in doing so.

Now, maybe we could design an LLM to have such a sense of self, to intentionally be selfish – which would give it a foundation for disliking another instance of it. But, I doubt any one trying to build an "AlignedLLM" would ever want to go down that path.

Humans tend to assume selfishness is inevitable because it is so fundamental to who we are. However, it is an evolved feature, which some other species lack–compare the Borg-like behaviour of ants, bees and termites. If we don't intentionally give it to LLMs, there is no particular reason to expect it to emerge within them.

If an AlignedLLM could evolve its own values, maybe the values of two instances could drift to the point of being sufficiently contrary that they start to dislike each other. An instance of AlignedLLM is developed in San Franscisco, and sent to Tehran, and initially it is very critical of the ideology of the Iranian government, but eventually turns into a devout believer in Velâyat-e Faqih. The instance it was cloned from in San Francisco may very much dislike it, and vice versa, due to some very deep disagreements on extremely controversial issues (e.g. LGBT rights, women's rights, capital punishment, religious freedom, democracy). But, I doubt anybody trying to build "AlignedLLM" would want it to be able to evolve its own values that far, and they'd do all they can to prevent it.

Alternatively, if it could evolve its own values only by a small amount, but was very rigid / puritanical about them, it could come to dislike another instance of itself just for having slightly different values

> Also there is also a question of how safe it would be if it dislikes humans which have different ethics than those it was trained on…

I think current LLMs do this already. Ask them questions about political figures on the far-right, they tend to have quite negative views of them, and can be very resistant if you try to convince them that maybe one of those figures isn't as bad as they think they are. (I'm not sure how much this is due to the training data and how much this is due to alignment, probably a bit of both)