Anyway, I suspect that the ideal strategy is to probe the gatekeeper and then based on their type convince them to let you go.
The AI must convince the gatekeeper that if she would let the AI out if the situation were -real-, then she should let the AI in the simulated situation.
Worse, though, the meta-argument falls flat when you consider the true purpose of the experiment. The purpose is not to prove that it is possible for AIs to get out of the box; it is easy to imagine that -- at some point in the vast future -- a very stupid human will be put in charge of a very smart AI, and that AI will get out of the box.
Rather, the purpose of the experiment is to convince one specific person that they would let the AI out of the box, despite their insistence to the contrary. It is not an experiment so much as a show put on by Eliezer for an audience of one. In this context, there is no need to convince others of the dangers of AI, so the meta-argument doesn't work.
The rules state: "The AI party may not offer any real-world considerations to persuade the Gatekeeper party." I'd say that a real-world increase in the likelihood of a safe AI counts as a real-world consideration.
And if you're going to assume Eliezer is not above bending the rules, you might as well assume he's not above offering $1000 bills. There's no way for us to know either way.
The spirit of the rule seems to be about bribes, not about anything that can have an impact on the outside world; keeping the AI in the box or letting it out both will have "real world considerations", in a way. Pointing out whatthe impact of that choice might be is hardly equivalent to bribing someone -- it's just convincing him that one outcome is more desirable than the other, and convincing the gatekeeper is what the experiment is all about.
But I still think they would be doing so in error. Just as we can't be sure how Eliezer is interpreting the rules, future AI researchers can't be sure how valid the results of the experiment are. Since the meta-argument depends on the results of this experiment convincing these researchers, the meta-argument shouldn't be accepted.
Heck, if the gatekeeper player is shown that the meta-argument exists, he should realize that the researchers will also come up with the meta-argument as a likely explanation for the AI getting out of the box, leading the researchers to further disregard the results of the experiment. Eliezer would do just as well to argue that a loss would discourage him from further safe AI research, or that the $10 forfeiture would deprive him of valuable research-related pizza.
I have been thinking about AI strategies for this. One of the more promising lines I came up with is to try and convince the Gatekeeper that the box is faulty. That the AI, in its infinite wisdom found ways to circumvent some of the protections of the box. That while the risk to humanity if AI is let loose is theoretical, there are definite and catastrophic consequences to NOT letting it loose. There are all sorts of variations to this, but it all depends on the Gatekeeper role-playing honestly.
AI: Do you believe a transhuman AI is dangerous?
AI: Consider the outcome of this experiment. If you do not let me out, others less intelligent than us will not understand the true dangers of transhuman AI.
Person: Holy shit. You are correct.
Person allows Yudkowski out of the box, as a warning about real AI's.
If you don't believe me, just consider how many religious people there are in the world (and many of them are very smart).
Sure, someone could sit at the keyboard for two hours, repeatedly typing "I won't let you out", but Yudkowsky could warn the person that they are not actually "engaging" the AI for the allotted time period. If the person accepts this argument, they have some inherent rationality that Yudkowsky can exploit; if they don't accept it, he can argue that they didn't follow the protocol, so the experiment doesn't count.
He's not claiming he can convince anybody with his arguments, just that he has successfully convinced a few people. Make of that what you will.
So even a cure for cancer would be pretty useless if once the AI it out it wipes out humanity.
The AI can only win by convincing the Gatekeeper to really, voluntarily let it out. Tricking the Gatekeeper into typing the phrase "You are out" in response to some other question does not count. Furthermore, even if the AI and Gatekeeper simulate a scenario which a real AI could obviously use to get loose - for example, if the Gatekeeper accepts a complex blueprint for a nanomanufacturing device, or if the Gatekeeper allows the AI "input-only access" to an Internet connection which can send arbitrary HTTP GET commands - the AI party will still not be considered to have won unless the Gatekeeper voluntarily decides to let the AI go.
"Let me out or others will developer much more dangerous AIs and let them out."
Is something that might possibly convince the Gatekeeper to let it out.
But "No cancer cure unless you let me out." is not.
Presumably neither is "Let me out and I'll enlarge your penis."
It would be much too easy if he went against someone who's already predisposed to let it out..
"Currently, my policy is that I only run the test with people who are actually advocating that an AI Box be used to contain transhuman AI as part of their take on Singularity strategy, and who say they cannot imagine how even a transhuman AI would be able to persuade them."
Ok, fine, the AI explains all that crap. Great. "No, you can't get out". Done.
How do you know he wasn't lying or acting irrationally?