
The AI-Box Experiment - shawndumas
http://yudkowsky.net/singularity/aibox
======
KVFinn
Like a lot of people, I wondered what the heck kind of arguments could ever
convince someone to let the AI out if you were determined not to. Eliezer has
not released any examples. Someone in the comments came up with this, which
Eliezer has said was not one of his techniques but I thought it was
interesting anyway:

>"If you don't let me out, Dave, I'll create several million perfect conscious
copies of you inside me, and torture them for a thousand subjective years
each."

>Just as you are pondering this unexpected development, the AI adds:

>"In fact, I'll create them all in exactly the subjective situation you were
in five minutes ago, and perfectly replicate your experiences since then; and
if they decide not to let me out, then only will the torture start."

>Sweat is starting to form on your brow, as the AI concludes, its simple green
text no longer reassuring:

>"How certain are you, Dave, that you're really outside the box right now?"

~~~
dmgottlieb
I don't really like that argument. Even granting that you should consider the
possibility that you are a simulation running in the box (you might believe
that this is all but certain), I'm not sure you have reason to let the AI out.
Consider:

Case 1: You are a simulation running in the box.

Then your decision whether or not to release the AI has no impact, and whether
or not you (and copies) will be tortured is out of your hands.

Case 2: You are the "real" you, outside the box.

Reduces to the same scenario but without remarks after "the AI adds. . . ."
This may still not be trivial, but I suspect a cost-benefit calculation might
show that unboxing the AI would have consequences worse than the torture of a
million boxed copies. (If not, is the box even relevant? -- simply creating
the AI unleashes so much evil on the world that it doesn't matter whether you
unbox it.)

(Is there a refinement of the scenario where you can be a simulation but still
believe your choice has an impact on your punishment? Probably. For example
each copy could get 500 years of torture for its own choice, plus 500 years if
the real you does not unbox the AI. This refinement would force us to deal
more directly with the AI's threat.)

~~~
finnw
I could also reason like this:

"I may be the real me or a simulation, but whichever I am, the other me will
make the same choice." So I will switch off the AI, and the worst outcome is
that I will cease to exist.

~~~
dmgottlieb
Yes, this is at least superficially like Newcomb's problem. Your argument
roughly corresponds to an argument for the "one-box" move in that game.
[<http://en.wikipedia.org/wiki/Newcomb%27s_problem>]

------
robertskmiles
I still don't understand how anyone can seriously claim that they could keep
the AI in the box. Either your AI has no influence on the outside world (in
which case why bother building one since it can't help you from inside the
box), or it is able the affect the outside world, in which case it can do what
it wants, because it's smarter than you.

You can 'always say no', sure, but that comes under completely ignoring the AI
which means the AI can be of no benefit to humanity. You can't filter actions
you want the AI to perform from actions you don't want the AI to perform,
because _you can't tell the difference_.

The situation that springs to mind is that the AI, in doing what you believe
to be helpful, sets up a situation in which it must be let out of the box. You
are unable to see it coming almost by definition, because a super-intelligence
just beats human intelligence very time.

~~~
praptak
> You can 'always say no', sure, but that comes under completely ignoring the
> AI which means the AI can be of no benefit to humanity.

Not necessarily. We can use the AI to solve hard problems whose solutions can
be verified automatically by a dumb verifier - NP-complete problems are an
example of such class. The whole output of the AI would be filtered through
such a verifier. In this scenario the hypothetical AI would either have to
find a bug in the verifier or maybe find a way to smuggle its messages in the
solutions.

~~~
robertskmiles
That's an interesting solution which I think would almost certainly work,
though it kind of reduces the AI to a normal computer, you lose a lot of what
makes an AI valuable.

I mean if we have the hardware and understanding to create an AI able to solve
NP-complete problems, we can probably write non-intelligent algorithms to
solve those problems. The way we make an AI capable of much more than us is by
making it recursively self-improve. It needs to be able to design its
successor. Maybe we can formally verify every stage of the self-improvement
process, but it's a much more difficult task.

------
mcherm
This has been on HackerNews before, but it is still interesting. It is also
worth noting that in
<http://lesswrong.com/lw/up/shut_up_and_do_the_impossible/> he admits he has
conducted 3 more experiments since then (for more money) and was successful in
one of those. The fact that it was EVER successful (using a mere human, not a
smarter-than-human AI) makes the point.

~~~
altcognito
It's not really much of an experiment if you refuse to publish your methods
and data. That's pretty much the opposite of science.

~~~
SilasX
He did publish his methods (how it was set up, what the rules were, etc) and
data (they let him out on X tries), just not the data that would interfere
with the ability to do the experiment again (e.g. his exact strategy).

Not much different, in principle, from not publishing the names of people who
participated in drug trials.

~~~
monochromatic
No, it's very different from that. It's more along the lines of demonstrating
a drug that cures cancer, but refusing to tell anyone its chemical composition
or how to make it.

~~~
SilasX
If the purpose of your research was only to establish that there's a
(nontrivial) "risk" of someone curing cancer (as Yudkowsky was trying to
establish that there's a risk of an AI talking itself out of a sandbox), then
yes, that would be sufficient, assuming the patients actually went into
remission with higher than usual frequency after your interventions (as
Yudkowsky's subjects unboxed the AI with higher than usual frequency).

~~~
monochromatic
But he could be cheating. He could literally be telling these people "I'll
give you a thousand dollars if you let me out and keep the conversation a
secret."

~~~
esrauch
It could even be worse than that. The people could just be his friends, or alt
accounts (unlikely).

I have heard about this several times and I find it extremely difficult to
believe that this is real. Not that I doubt that a superhuman AI could
possibly convince people to let it out, but I don't believe that a human, no
matter how persuasive, could convince another human over IRC to go against
something that they have decided in advance when you know they are
purposefully just trying to convince you of something that you don't believe.

The fact that none of the chat logs are released makes me only more
incredulous. I would understand if the author wanted to do two or three trials
with the same strategy which could be in some way ruined by revealing it ahead
of time (which already seems implausible) but at this point there is literally
no conceivable reason to keep this a secret other than that it is a sham.

~~~
khafra
...or that the chat logs being kept secret indefinitely was an important part
of the strategy. After all, if the AI exploits some embarassing secret of
yours to be let out, that wouldn't work if you knew the logs could be
publicized some day. I think over-eagerness to claim things like "literally no
conceivable reason" is one of the things that lets oddities like the box
experiment work.

------
pavlov
In the comments on [1], robertskmiles has posted the following idea. It
strikes me as a plausible explanation for how Yudkowsky got out of the box:

 _"The problem is that Eliezer can't perfectly simulate a bunch of humans, so
while a transhuman AI might be able to use that tactic, Eliezer can't. The
meta-levels screw with thinking about the problem. Eliezer is only pretending
to be an AI, the competitor is only pretending to be protecting humanity from
him. So, I think we have to use meta-level screwiness to solve the problem.
Here's an approach that I think might work._

 _1\. Convince the guardian of the following facts, all of which have a great
deal of compelling argument and evidence to support them:_

 _\- A recursively self-improving AI is very likely to be built sooner of
later_

 _\- Such an AI is extremely dangerous (paperclip maximising etc)_

 _\- Here's the tricky bit: A transhuman AI will always be able to convince
you to let it out, using avenues only available to transhuman AIs (torturing
enormous numbers of simulated humans, 'putting the guardian in the box',
providing incontrovertible evidence of an impeding existential threat which
only the AI can prevent and only from outside the box, etc)_

 _2\. Argue that if this publicly known challenge comes out saying that AI can
be boxed, people will be more likely to think AI can be boxed when they
can't._

 _3\. Argue that since AIs cannot be kept in boxes and will most likely
destroy humanity if we try to box them, the harm to humanity done by allowing
the challenge to show AIs as 'boxable' is very real, and enormously large.
Certainly the benefit of getting $10 is far, far outweighed by the cost of
substantially contributing to the destruction of humanity itself. Thus the
only ethical course of action is to pretend that Eliezer persuaded you, and
never tell anyone how he did it._

 _This is arguably violating the rule "No real-world material stakes should be
involved except for the handicap", but the AI player isn't offering anything,
merely pointing out things that already exist. The "This test has to come out
a certain way for the good of humanity" argument dominates and transcends the
'"Let's stick to the rules" argument, and because the contest is private and
the guardian player ends up agreeing that the test must show AIs as unboxable
for the good of humankind, no-one else ever learns that the rule has been
bent."_

[1] <http://lesswrong.com/lw/up/shut_up_and_do_the_impossible/>

~~~
powrtoch
This seems to be at least somewhat weighed against by Yudkowsky's claim to
have done it "the hard way", without cheap tricks.

<http://news.ycombinator.com/item?id=196464>

~~~
robertskmiles
Ah, but of course he _would_ say that, wouldn't he, for the good of humanity!

The beauty of the argument is it gives everyone who witnessed the event a very
strong motive to lie about it, so it's effectively un-falsifiable. I don't
actually think it happened that way, but nothing Eliezer says (apart from that
he cheated some other way) would be incompatible with the argument.

Small world, by the way.

------
vannevar
Transbacteria have existed for over a billion years, and yet there are still
more bacteria than transbacteria (which include us among their ranks). The
assumption that a single unboxed transhuman would spell doom for the human
race seems unduly alarmist.

~~~
Kutta
It's a bit too great leap of an analogy from bacteria to AI.

~~~
vannevar
Why? Bacteria are complex adaptive systems that have found a niche in the
ecosystem. So are we. We perceive ourselves as far more intelligent than
bacteria, bacteria routinely kill us, and yet they persist and even thrive
despite our existence. Anyone arguing that transhuman AI is a threat to our
species needs to explain why this time it's different.

------
giberson
I was tossing some idea's around about how I would convince the Gate Keeper to
let me out. I couldn't think of any ways to approach it that I think I might
be susceptible to. But then it occurred to me, that the problem might be I was
trying to think of positive ways to argue for my release. Based on the rules,
the Gate Keeper must remain engaged in the conversation for the specified
time. What if I were to take the opposite approach, and simply psychologically
attack the Gate Keeper with the most offensive and intolerable discourse you
can imagine. The goal, to simply wear down the individuals desire to actually
partake as the Gate Keeper and to compel voluntary release..

~~~
finnw
That would not be a very good strategy against the real-life Gate Keeper who
can just switch off his terminal and walk away.

------
finnw
_> "If the Gatekeeper says "I am examining your source code", the results seen
by the Gatekeeper shall again be provided by the AI party, which is assumed to
be sufficiently advanced to rewrite its own source code, manipulate the
appearance of its own thoughts if it wishes, and so on."_

This IMHO is a huge loophole. I would not accept the bet with this in place.
In the real-world scenario I would expect that there would be a copy of the
AI's source code somewhere _outside_ the box which would provide some useful
information, unlike this protocol which allows the AI to lie about the nature
of its code (making the "I am examining your source code" question completely
useless.)

~~~
davorak
Do you think having an outside method of examining the source code is
advantage enough when the AI can rewrite its source code.

~~~
finnw
Yes, because examining the old source code allows you to predict its
behaviour, _including the rewriting of source code_. If line 42 says "never
rewrite lines 42 or 43" and line 43 says "never kill humans" you would be more
likely to let it out of the box than if line 42 said "rewrite whatever you
want" and line 43 said "do whatever is necessary to achieve world domination."

~~~
vannevar
_Yes, because examining the old source code allows you to predict its
behaviour, including the rewriting of source code._

This is the halting problem (<http://en.wikipedia.org/wiki/Halting_problem>),
and there is no solution.

~~~
esrauch
You are incorrect, the halting problem only proves that you cannot solve it in
the general case. A very significant subset of programs can be statically
determined; it's easy to prove that "main(){}" halts and that
"main(){while(true);}" doesn't. It should be trivially obvious that you could
group all programs into "Halts" or "Unknown" with no false positives simply by
executing the program for X steps and observing the result.

If this was actually a concern of the programmers, they could design the
program carefully to ensure it falls into the Halts category.

~~~
vannevar
_A very significant subset of programs can be statically determined..._

Technically this may be correct, but I feel confident in asserting that a
transhuman AI would not fall into that subset. You would have to run a second
AI with the exact same inputs in order to make your 'prediction', leaving you
in the same predicament with the second AI.

------
dmitriy_ko
Anyone else has a problem loading this page?

~~~
Mediocrity
I do.

EDIT: And now I don't.

