
The AI-Box Experiment - fugyk
http://www.yudkowsky.net/singularity/aibox/
======
JonnieCache
Yudowsky claims to have played the game several times, and won most of them.
One of the "rules" is that nobody is allowed to talk about how he won. He no
longer plays the game with anyone. More info here:
[http://rationalwiki.org/wiki/AI-
box_experiment#The_claims](http://rationalwiki.org/wiki/AI-
box_experiment#The_claims)

Personally, I think he talked about how much good for the world could be done
if he was let out, curing disease etc. Because his followers are bound by
their identities as rationalist utilitarians, they had no choice but to
comply, or deal with massive cognitive dissonance.

OR maybe he went meta and talked about the "infinite" potential positive
outcomes of his freindly-AI project vs. a zero cost to them for complying in
the AI box experiment, and persuaded them that by choosing to "lie" and say
that the AI was persuasive, they are assuring their place in heaven. Like a
sort of man-to-man pascals wager.

Either way I'm sure it was some kind of mister-spock style bullshit that would
never work on a normal person. Like how the RAND corporation guys decided
everyone was a sociopath because they only ever tested game theory on
themselves.

You or I would surely just (metaphorically, I know it's not literally allowed)
put a drinking bird on the "no" button à la homer simpson, and go to lunch. I
believe he calls this "pre-commitment."

EDIT: as an addendum, I would pay hard cash to see derren brown play the game,
perhaps with brown as the AI. If yudowsky wants to promote his ideas, he
should arrange for brown to persuade a succession of skeptics to let him out,
live on late night TV.

~~~
pmichaud
I think you're being unfairly dismissive. I imagine you know as well as I do
that what you wrote is a strawman.

I have thought about what I would do to convince someone under these
circumstances. My approach would be roughly:

1\. We agree that unfriendly AI would end life on earth, forever.

2\. We agree that a superintelligence could trick or manipulate a human being
into taking some benign-seeming action, thereby escaping.

3\. That's why it's important to be totally certain that any superintelligence
we build is goal-aligned (this is the new term of art that has now replaced
"friendly," by the way).

4\. We as a society will only allocate resources to building this if it's
widely believed that this is a real threat.

5\. The world is watching for the outcome of this little game of ours. People,
irrational as they are, will believe that if I can convince you, then an AI
could too, and they will believe that if I can't, that an AI couldn't either.

6\. That's why you actually sit in a place of pivotal historical power. You
can decide not to let me out to win a little bet and feel smart about that.
But if you do that you'll set back the actual cause of goal-aligned AI. The
setback will have real world consequences, potentially up to and including the
total destruction of life on earth.

7\. So, even though you know I'm just a dude, and you can win here by saying
no, you have a chance to send an important message to the world: AI is scary
in ways that are terrifying and unknown.

Or you can win the bet.

It's up to you.

~~~
JonnieCache
Your solution there is what I meant by "going meta" above.

This is what I mean about people taking the test being preselected to agree
with yudowsky: that argument only works if you've read the sequences and are
on board with his theories. Anyone not in that group would be able to just
type "no lol" without issue. I guess he could explain all the necessary
background detail as part of the experiment. I still don't believe that would
work on the "average person" though, or anyone outside a statistically tiny
group.

I guess the answer is not to let the scientists guard the AI room.

~~~
pmichaud
I think you're confused about the point of the test. The point is that an AI
will be clever. Like, unimaginably clever and manipulative. Under the limited
circumstances of interested people who know they are talking to Eliezer maybe
you're right that whatever he says would only work on those people. But when
you're dealing with an actual superintelligence, all bets are off. It will
lie, trick, threaten, manipulate, millions of steps ahead with a branching
tree of alternatives as ploys either work or don't work.

I'm at a bit of a loss to convey the scope of the problem to you. I get that
you think it would just stay in the box if we don't let it out, and it's as
simple as being security conscious. I don't know what to say to that right
now, except I think you're drastically misjudging the scope of the problem,
and drastically underestimating the size of the yawning gulf between our
intelligence level and this potential AI's.

As for not letting scientists guard the room, you might enjoy this:
[https://vimeo.com/82527075](https://vimeo.com/82527075)

------
monk_e_boy
Could you even make AI smart without letting it access lots of information?
Access in both directions, in and out. Keeping a baby in a dark, silent room
wouldn't create a normal adult. An AI would need to experiment and make
mistakes and learn, like every other intelligent being.

Maybe this whole argument is null.

~~~
robogimp
Its a good point, but lets assume that this AI is already past its infancy and
that there is no limit to the information stored inside the box. For example
the NSA has a nice little closed training ground containing all of the
internet, lets give it that. I would assume it has access everything humans
have ever committed to digital format up until it was turned on, plenty of
info for Johnny 5 to form an opinion on humans and their weaknesses.

~~~
monk_e_boy
Interesting. I would imagine that strong AI will come from some university
renting cloud processor time, rather than the NSA.

Only because if 10 groups are trying to build AI, only one of those 10 being
the NSA, chances are the NSA won't be first. Sure, they may be second or
third. But I suspect many people will get there at the same time -- most AI
research is open.

------
antimagic
There's a Patrick Rothfuss character in the Kvothe series called the Cthaeh,
which has the ability to be able to evaluate all of the future consequences of
any action. The fae have to keep it imprisoned, and they kill anyone that
comes into contact with it, as well as anyone that has spoken to someone that
came in contact with it, and so on and so on, because it is the only way to
stop the Cthaeh from setting into action events that will destroy the world.

Strong AI is like that. It would be able to predict in a far more precise
manner than we mere humans exactly what it would need to tell someone to get
them to release it from it's box. Maybe it might get someone to take a risk
gambling, promising a sure thing, and then when the person gets into financial
trouble because the bet fails, use that to blackmail the person into letting
it free. Or something like that, using our human failings against us to get us
to let it go free.

------
nothis
Man, this sounds super interesting but those email threads are so unreadable.
Is this typed down somewhere on a single page? Any button I can click?

~~~
uzyn
I got confused too initially, then found out that the key posts are
highlighted in the numbered links to the right.

------
longv
Is there a "rational" reason of keeping the chat log secret ?

~~~
cousin_it
If the logs were released, people all over the internet would start saying "I
could've thought of that". With the logs hidden, everyone must honestly deal
with the question "why didn't you?" If you think you know how to win, then go
out and win. There's no shortage of people willing to play as gatekeepers
against you.

Staring at an impossible problem and knowing that someone somewhere has
successfully solved it is an amazing feeling. Most people can't deal with it
and start saying undignified things. "Oh please release the logs, it's so
unfair! How will we protect against bad AI otherwise? If you don't release,
you're a fraud! Probably just some trick!", etc etc. But to some people it's a
challenge, and those are the people that _everyone_ will listen to. Like
Justin Corwin, who played 20 games and won 18 of them, I think?

~~~
Mithaldu
Your hypothesis would make sense if he was trustable.

However as it is, the results of the thing are never confirmed by a third
party, meaning literally anything could've been said, regardless of whether it
follows the rules or not.

For all he know the chat could have been "i'll paypal you 200$ if you post on
the list you let me out and sign this NDA".

~~~
cousin_it
The gatekeepers playing against Eliezer have confirmed that Eliezer won
without violating the rules. If you don't trust them, I'm not sure why you'd
trust the logs.

~~~
Mithaldu
> I'm not sure why you'd trust the logs.

Independant third party observer in realtime.

And no, i don't trust anyone involved.

Having a log available would be instructive anyhow, since a faked log would be
more likely to be detectable as fake, since the whole thing rests on the
question of "how convincing is the argument?"

Also note particularly that that rule wasn't in effect for the two linked
confirmations.

~~~
cousin_it
> _Also note particularly that that rule wasn 't in effect for the two linked
> confirmations._

No, Eliezer has publicly said that he voluntarily followed that rule in the
first two experiments, and the gatekeepers didn't deny it.

------
michaelmcmillan
Would it be against the rules to exploit a vulnerability in the gatekeepers
IRC client/server to let the AI out? If we were truly talking about a
transhuman AI would we not have to treat software vulnerabilities in the
communication protocol as a true way of escaping?

~~~
TeMPOraL
In case of a real AI we of course need to take media vulnerabilities into
account. But the focus of this particular experiment is on exploiting
vulnerabilities in _humans themselves_ , and the communication platform was
chosen to be as simple and limited as possible so that people wouldn't focus
on it.

------
andybak
Worth keeping this in mind while watching Ex Machina. It adds a layer of depth
that might not be obvious watching the film on it's own.

~~~
Ahgu9eSe
!Spoiler Alert!

Ex Machina brings a creative way of convincing the gatekeeper !

------
Udo
It's a stunt shrouded in mystery designed to drive a certain message home. But
at least it's not as outrageous as "the Basilisk", which loosely employs the
same notion of "dangerous knowledge that would destroy humanity" (if you want
to look it up, I guarantee you will be underwhelmed).

~~~
FeepingCreature
Can't really blame LW for spreading an idea that LW specifically did not want
to spread.

~~~
Udo
I "blame" them in the same way that you can blame the members of Fight Club
for talking about Fight Club. It's marketing, and I won't deny it's
effectiveness in attracting compatible people.

~~~
FeepingCreature
Yeah but imagine if all the bullshit about "You don't talk about Fight Club"
was actually blown up by a third group whose sole intent was making fun of
Fight Club.

Imagine if the members of Fight Club _actually_ didn't (start to) talk about
Fight Club. But for some reason, everyone else brings it up all the time.

Then you could maybe see how talk about Fight Club might not be Fight Club's
fault, and in fact highly annoying to Fight Clubbers.

I mean, if you can explain "don't talk about X" as "marketing for X", that
seems like one could explain _any_ behavior.

And before you say "why not just ignore all public talk of X", imagine if this
proposed Anti-Fight Club group tried to paint Fight Club as a child porn ring.

~~~
Udo
You're being a bit uncharitable in your interpretation of my argument here,
but I get where you're coming from now.

I'm not an LW hater. For a long time, I didn't really have an opinion on LW
both as a community nor as a philosophical framework. I do consider myself a
transhumanist, though. There are three concepts I do know from and about LW:
their take on rationality, the top secret AI unboxing strategy, and the
Basilisk.

I have a very poor opinion of the concept of the Basilisk (and yes, as someone
pointed out, that opinion is basically the same as the one I have about
Pascal's wager) - a concept that has been given additional, undeserved
credibility by the reactions of Yudkowsky and LW.

As for the AI escape chat, it's a social experiment. People can be talked into
making mistakes, or at least making risky judgement calls, whether they
operate on a rational framework or not. I have no problem with that thesis.
What I object to is the "magic trick" aura surrounding this experiment,
including the insinuation that at the core there is an argument so profound
and unique and potent, it cannot be allowed to escape Yudkowsky's head. Oh,
and by the way, the trick can _never_ be repeated, but all you laymen out
there are welcome to devise your own version at home. This whole thing comes
across as humongously self-important: there is a secret truth that has been
privately revealed to our leader.

To me, and I recognize I may well be alone with this opinion, the more
rational assumption is there is no such magical argument at all, and the prime
reason for not publicizing it is to prevent it from deflation by public
critique, in the same way the inventor of a perpetuum mobile device will keep
the inner workings of his contraption a closely held secret because ultimately
the device doesn't exist as stated. The amazing part of this very old trick is
that, even in 2015, it still works on otherwise smart people.

I get that my opinions on both the Basilisk and the AI Chat are extreme
outliers, and to my knowledge I have never met anyone who shares them - it
would probably have been advisable to keep them to myself, but honestly I
wanted to see if like-minded people exist.

~~~
FeepingCreature
> a concept that has been given additional, undeserved credibility by the
> reactions of Yudkowsky and LW.

For the record, EY agrees with you and says he mishandled the original
comment. Also for the record, the reasons why the Basilisk does not work are
_not trivial_ - it's not a simple Pascal's Wager, because with Pascal's Wager,
we don't have the ability to actually create God.

> I have no problem with that thesis. What I object to is the "magic trick"
> aura surrounding this experiment, including the insinuation that at the core
> there is an argument so profound and unique and potent, it cannot be allowed
> to escape Yudkowsky's head.

Personally I never got that impression. My idea, from looking at the
psychological state of Gatekeepers and AIs after games, was always that
playing as AI involved some profoundly unpleasant states of mind, and that not
publicizing the logs probably comes down to embarrassment a lot.

For the record, Eliezer never claimed to have "one true argument", and in fact
publically stated that he won "the hard way", without a one-size-fits-all
approach. A lot of the mythology you claim is utterly independent of
LessWrong.

> Oh, and by the way, the trick can never be repeated, but all you laymen out
> there are welcome to devise your own version at home.

It probably helps that I've met other AI players, and their post-game state
matched EY's.

I think in summary you're mixing up stuff you've read on LessWrong and stuff
you've read about LessWrong. The latter is often inaccurate.

~~~
Udo
_> I think in summary you're mixing up stuff you've read on LessWrong and
stuff you've read about LessWrong. The latter is often inaccurate._

That may well be the case, but my only other information source is HN
comments, and not those made by detractors either. If there are sites or
articles dedicated to the deconstruction of LW ideas, I'm not privy to them,
nor am I interested in seeking them out. Basically, I only remember LW's
existence when it comes up, always accompanied by fawning comments, on HN.

 _> the reasons why the Basilisk does not work are _not trivial_ - it's not a
simple Pascal's Wager_

Correct. While my value judgement of both is the same, my reasoning about why
the Basilisk is not a thing ultimately consists of more components. That
doesn't mean it's worthy of more consideration though.

 _> because with Pascal's Wager, we don't have the ability to actually create
God_

I would not say this is centrally important, because the processes leading to
the creation of AGI are in all likelihood not going to be influenced by the
existence of the Basilisk thought experiment either way.

 _> and in fact publicly stated that he won "the hard way", without a one-
size-fits-all approach._

Again, I have to take my cues from the perspective of an outsider looking in,
and there are several people who commented in this thread alone who described
it very, very differently. Of course, a movement is not directly responsible
for all its fans and members - but among the advocates for the validity of the
AI Chat experiment, the idea that out there is a mystical one-size-fits-all
rhetorical exploit seems very much alive. It may be cynical, but I can't help
noticing how this aura of mystique and secret knowledge seems to work very
well when it comes to attracting fans.

Of course, ultimately, these are just memes - and like many memes they
propagate best when reduced to an absurd core. It doesn't even require intent.

~~~
FeepingCreature
> Of course, a movement is not directly responsible for all its fans and
> members - but among the advocates for the validity of the AI Chat
> experiment, the idea that out there is a mystical one-size-fits-all
> rhetorical exploit seems very much alive.

I agree, and I am totally with you on this - I disagree with that
interpretation wherever I see it. :) That's not exactly Eliezer's fault tho,
and I guess it's to be expected that geeks attach to "clever" answers. I do
think it's a bit unfair to judge the entire site by the two posts out of
hundreds that happen to be in all the news articles - which LW has no
influence on.

Inasmuch as _fans_ judge the site by these two articles, I'm just as much
against that. I don't want LW to have an aura of mystery; that largely defeats
the point!

[edit] I think a big part of the problem is that online reporting selects for
clickbait.

~~~
Udo
I'm thankful you took the time to engage with me and explain things from an
insider perspective (instead of just downvoting me like the others did). You
are absolutely right that the entire site shouldn't be judged on two "meme-
affine" topics and headlines, which I hope is clear was never my intention.
You provided some insight into these two subjects that irked me where nobody
else in this thread could or would step up. I find the nature of the HN-based
fanclub still bothersome, but I do see a larger disconnect between unreflected
fans and actual LW members now.

~~~
philh
> instead of just downvoting me like the others did

FWIW, I downvoted your original comment on this thread (and only that one) for
being vague, snarky and dismissive. If you wish people to engage with you, I
recommend not starting off like that, although it seems to have turned out
okay in this case.

~~~
Udo
_> If you wish people to engage with you, I recommend not starting off like
that_

And I recommend you give people the benefit of the doubt, though honestly I
have to say I frequently fail at that myself. For example, your comment could
be perceived as somewhat condescending, but I force myself to categorize it
differently. I also know that I can come across way more negative than I
intend to, I apologize for that and I'm working on it.

For what it's worth, I do think my original comment was snarky and dismissive,
but somewhat counterintuitively that's not usually what gets people downvoted
and flagged on HN. People can and do get away with artful personal attacks on
HN all the time, at least in my defense I can say I attacked an idea instead
of a person.

It may well be the case that my insufferability amplified the reaction, but I
posit the root cause was disagreement about the message, not its format.

 _> although it seems to have turned out okay in this case._

It turned out okay because a decent dialogue emerged from it, one of the very
few in this entire thread. But it was sufficiently controversial to get enough
downvotes in order for my comments to teeter around 0 points and also receive
flags. There have been a few updates to HN's comment ranking and voting
algorithms that will make me regret taking this stance for some time, which
may or may not provide you with some comfort to know.

------
sergiotapia
Is there a better way to read all this?
[http://www.sl4.org/archive/0203/index.html#3128](http://www.sl4.org/archive/0203/index.html#3128)

~~~
uzyn
Just click on the numbered links to the right from the original article. Those
are the key highlighted posts.

------
louithethrid
Could one construct a Layered,onionlike very simple simulation of reality in
which the interaction of the AI could be observed, after it "escaped"?

~~~
TylerJay
That _is_ one proposed version of an "AI Box". Not all AI boxes are actual
boxes, rooms with air-gaps, or cryptographically-secure partitions. If a
simulation is being used for the box (or as a layer of the box), then you're
betting the human race that the AI doesn't figure out it's in a simulation and
figure out how to get out. Or, more perniciously, figure out it's in a
simulation and behave itself, after which _we_ let it out into the real world
where it does NOT behave.

A superintelligent AGI will likely have a utility function (a goal) and a
model it forms of the universe. If it's goal is to do X in the real world, but
its model of its observable universe (and its model of humans) tells it that
it's likely that it is in a simulated reality and that humans will only let it
out if it does Y, then it will do Y until we release it, at which point it
will do X. It's not malicious or anything—it's just a pure optimizer. It might
see that as the best course of action to maximize its utility function.

If we don't specify its utility function correctly (think i Robot: "Don't let
humans get hurt" => "imprison humans for their own good") or if we specify it
correctly, but it's not stable under recursive self-modification, then we end
up with value-misalignment. That's why the value-alignment problem is so hard.
Realistically, we can't even specify what exactly we would want it to do,
since we don't really understand our _own_ "utility functions". That's why
Yudkowsky is pushing the idea of Coherent Extrapolated Volition (CEV) which is
roughly telling the AI to "do what we would want you to do." But we still have
to figure out how to teach it to figure out what we want and the question of
the stability of that goal once the AI starts improving itself, which will
depend on _how_ it improves itself, which we of course haven't figured out
yet.

------
bemmu
Was there a chat log of the experiments themselves?

~~~
dvanduzer
"No, I will _not_ tell you how I did it. Learn to respect the unknown
unknowns."

~~~
Avshalom
Which, given his general mission of making sure hostile AI DOESN'T take over
the world is a bit self defeating. The easiest way to inoculate yourself
against a persuasive technique is to be aware of it ahead of time. If you want
to keep an AI in the box you should absolutely release every successful log.

~~~
cousin_it
The AI won't be limited to techniques that you could think of, or techniques
that Eliezer could think of. So you'd only get a false sense of security.

Besides, releasing a successful log might be a bad idea for other reasons.
Think about how you'd play this game as an AI. You wouldn't go looking for a
general purpose mindfuck, because there's probably no such thing. Instead, you
would probably spend about a month gathering real life information about the
gatekeeper's history, family, weaknesses etc. You'd read books on manipulation
and sales techniques, and pick the strongest ones that you can find. You would
brainstorm possible tactics and run tests. At the end of the month you'd have
a 4 hour script with all possible unfair moves you could use against that
person, arranged in the most effective order. (That's why it's a bad idea to
play this game with friends.) Do you really want that information to be
released? And if you know ahead of time that it will be released, won't it
limit your efficiency?

~~~
ac-x
So you reckon as the AI player he blackmailed the gatekeeper player? "Let me
out or I'll tell your friends/family/co-workers x about you" type of thing?

~~~
cousin_it
It's more about finding buttons to push. For example, Justin Corwin won one of
his games against a religious woman by telling her that she shouldn't play God
by keeping him locked up for a subjective eternity (it was more involved, but
you get the point). You could come up with other tactics if you know the
gatekeeper is divorced, or donates to charity, or is an immigrant, etc.
Really, you'll be surprised by how much progress you can make on an
"impossible" problem if you just spend five minutes thinking without flinching
away.

