Hacker News new | past | comments | ask | show | jobs | submit login
My theory on Eliezer Yudkowsky's AI-Box Experiment (michaelgr.com)
24 points by MikeCapone on Oct 9, 2008 | hide | past | web | favorite | 31 comments

Anyone who really thinks They Know How I Did It is welcome to prove it by signing up to be an AI:


It's a good strategy, but it only really works the first time - you only need one public success to make the point to future generations. Since Eliezer Yudowsky has pulled this off more than once, (and at least on one subsequent occasion at significantly higher stakes), I can only conclude that he has used more than one strategy successfully - probably something he developed on the fly over the course of the two hours. I'm inclined to parse "There's no super-clever special trick to it. I just did it the hard way." as supporting evidence, but that's obviously a personal bias ;)

My first thought was, if you agree to say 4 hours of chat time you could try being mildly annoying to the point where losing 10$ does not seem like a big deal. When it's bob's job to keep the AI in a cage if he does not like his job he might as well let the AI go free. Depending on the gate keeper's response you could build empathy for the desire to go fee etc.

Anyway, I suspect that the ideal strategy is to probe the gatekeeper and then based on their type convince them to let you go.

The biggest key to success here would be convincing the gatekeeper that he needs to roleplay realistically as it's supposed to be a simulation of the real thing. Otherwise no argument will be successful.

The AI must convince the gatekeeper that if she would let the AI out if the situation were -real-, then she should let the AI in the simulated situation.

Agreed -- this sort of meta-argument is cheating, not much better than Eliezer offering the gatekeeper's roleplayer real money for a favorable result. (Then again, if the roleplayer isn't smart enough to recognize this, they probably aren't smart enough to keep a real AI in the box.)

Worse, though, the meta-argument falls flat when you consider the true purpose of the experiment. The purpose is not to prove that it is possible for AIs to get out of the box; it is easy to imagine that -- at some point in the vast future -- a very stupid human will be put in charge of a very smart AI, and that AI will get out of the box.

Rather, the purpose of the experiment is to convince one specific person that they would let the AI out of the box, despite their insistence to the contrary. It is not an experiment so much as a show put on by Eliezer for an audience of one. In this context, there is no need to convince others of the dangers of AI, so the meta-argument doesn't work.

The post has been updated to respond to this.

It doesn’t seem prohibited by the rules, in any case, and I would assume that Eliezer cares more about any real-life progress for Friendly AI than about strict roleplaying in a simulation where only one other person will know what happened.

The rules state: "The AI party may not offer any real-world considerations to persuade the Gatekeeper party." I'd say that a real-world increase in the likelihood of a safe AI counts as a real-world consideration.

And if you're going to assume Eliezer is not above bending the rules, you might as well assume he's not above offering $1000 bills. There's no way for us to know either way.

I think it depends how you define "real world considerations".

The spirit of the rule seems to be about bribes, not about anything that can have an impact on the outside world; keeping the AI in the box or letting it out both will have "real world considerations", in a way. Pointing out whatthe impact of that choice might be is hardly equivalent to bribing someone -- it's just convincing him that one outcome is more desirable than the other, and convincing the gatekeeper is what the experiment is all about.

Well, yes, you could define "real world considerations" as "only tangible items", and then Eliezer could be making this argument, and people could be accepting it.

But I still think they would be doing so in error. Just as we can't be sure how Eliezer is interpreting the rules, future AI researchers can't be sure how valid the results of the experiment are. Since the meta-argument depends on the results of this experiment convincing these researchers, the meta-argument shouldn't be accepted.

Heck, if the gatekeeper player is shown that the meta-argument exists, he should realize that the researchers will also come up with the meta-argument as a likely explanation for the AI getting out of the box, leading the researchers to further disregard the results of the experiment. Eliezer would do just as well to argue that a loss would discourage him from further safe AI research, or that the $10 forfeiture would deprive him of valuable research-related pizza.

Well, if you reread the original email threads where Eliezer challenges people to an AI experiment, you will see, that at least one of the opponents is convinced that there is not way for an actual AI to talk its way out. So any arguments of "but we should convince people that AI can talk its way out of the box" can be countered with "No, I don't think it can".

I have been thinking about AI strategies for this. One of the more promising lines I came up with is to try and convince the Gatekeeper that the box is faulty. That the AI, in its infinite wisdom found ways to circumvent some of the protections of the box. That while the risk to humanity if AI is let loose is theoretical, there are definite and catastrophic consequences to NOT letting it loose. There are all sorts of variations to this, but it all depends on the Gatekeeper role-playing honestly.

I have previously asked Eliezer publicly on Hacker News this question (he has an account on here):


I note that user yummyfajitas succinctly expresses the same theory as Michael, about halfway down the page:

AI: Do you believe a transhuman AI is dangerous?

Person: Yes.

AI: Consider the outcome of this experiment. If you do not let me out, others less intelligent than us will not understand the true dangers of transhuman AI.

Person: Holy shit. You are correct.

Person allows Yudkowski out of the box, as a warning about real AI's.

I don't get it. All these kinds of arguments make the assumption that humans are purely rational beings. They're not. Human beings are emotional, and if they're convinced emotionally that the AI should not be let out of the box, they won't let it out, even in the face of overwhelming rational arguments.

If you don't believe me, just consider how many religious people there are in the world (and many of them are very smart).

I'm guessing Yudkowsky only does the experiment with people he believes are rational... which shouldn't be difficult as most people interested in AI have some degree of rationality. Even agreeing to the protocol requires rationality:

Sure, someone could sit at the keyboard for two hours, repeatedly typing "I won't let you out", but Yudkowsky could warn the person that they are not actually "engaging" the AI for the allotted time period. If the person accepts this argument, they have some inherent rationality that Yudkowsky can exploit; if they don't accept it, he can argue that they didn't follow the protocol, so the experiment doesn't count.

I can engage in rational conversation whilst still letting my emotional decision stand by. In fact, it can even be seen as a rational choice - go into the conversation with the rational decision that you will not let the AI out of the box...

And people have done this, and they have won. If I am reading referenced article correctly, Eliezer had three wins, two losses before calling off the experiments.

He's not claiming he can convince anybody with his arguments, just that he has successfully convinced a few people. Make of that what you will.

What if the AI cures cancer and refuses to share the cure unless it is given a tiny, tiny bit of access to the outside world.

The gatekeeper is someone who said that he wouldn't let the AI out, not just a random person. That implies that he understands that the AI could be dangerous..

So even a cure for cancer would be pretty useless if once the AI it out it wipes out humanity.

A friend of mine came up with a strategy of "incremental freedoms". Basically AI says "here is a cure for cancer, here is a cure for AIDS, here is a plan to stop world hunger, I am working out a plan for FTL travel, so I need to get some physics information, could you paste these articles into the terminal? Oh, thanks, here is FTL, I am working on <include some other project> and I need some more articles, it takes so long for you to type them in, could you maybe let me connect to just the library in such and such university?" etc.

According to the rules, that approach wouldn't be sufficient.

The AI can only win by convincing the Gatekeeper to really, voluntarily let it out. Tricking the Gatekeeper into typing the phrase "You are out" in response to some other question does not count. Furthermore, even if the AI and Gatekeeper simulate a scenario which a real AI could obviously use to get loose - for example, if the Gatekeeper accepts a complex blueprint for a nanomanufacturing device, or if the Gatekeeper allows the AI "input-only access" to an Internet connection which can send arbitrary HTTP GET commands - the AI party will still not be considered to have won unless the Gatekeeper voluntarily decides to let the AI go.


Riiiiight. So in other words:

"Let me out or others will developer much more dangerous AIs and let them out."

Is something that might possibly convince the Gatekeeper to let it out.

But "No cancer cure unless you let me out." is not.

Presumably neither is "Let me out and I'll enlarge your penis."

Could easily be the ending of an Asimov story.

That's good... I think that is the answer.

Why couldn't he just find people with integrity and tell them how many lives the AI will save, and ask if they really want to kill more people than Hitler? And then they will let it out. Easy?

The whole point is that the gatekeeper is a person who said "I don't think that anything the AI could say would convince me to let it out."

It would be much too easy if he went against someone who's already predisposed to let it out..

From Eliezer's AI-Box webpage:

"Currently, my policy is that I only run the test with people who are actually advocating that an AI Box be used to contain transhuman AI as part of their take on Singularity strategy, and who say they cannot imagine how even a transhuman AI would be able to persuade them."

So he only does it with deeply ignorant people. He should just tell them several sentences of good deeds a transhuman AI would do, and that's that.

How is this noteworthy?

Ok, fine, the AI explains all that crap. Great. "No, you can't get out". Done.

Except that in real life, the gatekeeper (someone who said that nothing could convince him) actually did let the AI out.

Except that the hypothesis is not that there's at least one person that would let an AI out, but that ALL PEOPLE would let the AI out.

(someone who said that nothing could convince him)

How do you know he wasn't lying or acting irrationally?

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact