One of the authors of the paper here! David Wu (aka lightvector), the creator of the KataGo AI system that we target, is actually doing a training pass now initializing self-play games to start at adversarial positions found by our adversarial policy. It does seem to have improved things significantly: our adversary's win rate goes down from >99% to around 4% against the latest KataGo checkpoint. However, the real question is whether KataGo has started properly understanding the cyclic groups, or has just learned some heuristics that saves it most of the time. In other words, if we repeated our attack again against the latest checkpoint, could it learn to reliably repeat the 4% of winning cases?
We're looking into doing our own adversarial training run to see if we can get to a point where the KataGo agent is robust. My personal suspicion given how difficult adversarial examples have been to eliminate in image classifiers and other ML systems is that although we'll be able to train particular vulnerabilities out of the system, there's still going to be a long tail of issues that can be automatically discovered and exploited.
Yeah, it seems unlikely that some patchup can fix, in generality, such a deep blindspot. It reminds me of the problems AlphaGo had with ladders: I'm not sure DM ever really solved the ladder blindness, it just got good enough that they became irrelevant in practice.
One of the authors here! I think from a perspective of humans playing Go this result is not very interesting since, as you say, Tromp-Taylor isn't the kind of ruleset humans like to play under. But KataGo was trained under (modified) Tromp-Taylor which we also evaluate under, so we think from an AI security and robustness stand point it's interesting that it fails at the rule set it was trained on.
> When they induce KataGo to play bad moves, like Lee Sedol did in the only game he won against AlphaGo, they'll have my attention.
That said, we are working to do exactly this! In particular, we found in the paper that we could patch the victim to not pass except in the end-game and defeat this adversarial policy. But, when we repeat the attack, we do eventually find an adversarial policy that can exploit the victim -- and this doesn't depend on the rule set. That part is still a work in progress, we've only got it working against a victim without search for example, but we hope to put out an updated revision including this in a month or two!
Any human referee would declare KataGo the winner in any of these games, just like they did in the Robert Jasiek vs Csaba Mero case[0].
> But KataGo was trained under (modified) Tromp-Taylor which we also evaluate under, so we think from an AI security and robustness stand point it's interesting that it fails at the rule set it was trained on.
Have you done anything in particular to induce this, or is it just the case that KataGo is completely incompetent at evaluating what constitutes a pass-alive territory?
> That part is still a work in progress, we've only got it working against a victim without search for example
That would still be amazing. KataGo without search is still a beast.
---
By the way, your website is weird:
1. "strength of a top-100 European professional" - Europe doesn't have that many professionals?
2. "Yet our adversary achieves a 99% win rate against this victim by playing a counterintuitive strategy." - Well, if your main goal is to leave a stone in your opponent's territory, this strategy isn't very counterintuitive?
One of the authors here! We evaluated our matches using KataGo. In fact, our adversary is just a forked version of KataGo. We use the same modified Tromp-Taylor rules for eval. We elaborate on that more in the Reddit thread you link at [3]
Our Tweet was confusing: 280 character limit means something had to be cut, but this has caused confusion in a bunch of places, so we should have been more precise here -- sorry about that!
One of the authors here! Great to see some discussion in the paper. Your summary of computer Go vs human rule sets seems right to me. But I think there might be a slight misunderstanding. We had friendlyPassOk set to false for all of our evaluation except one game which was played not against our adversarial policy, but one of my co-authors Tony who was trying to mimick the adversarial policy.
We evaluated KataGo under Tromp-Taylor with "self-play optimizations" described in https://lightvector.github.io/KataGo/rules.html which basically involves removing stones that can be proved to be dead using Benson's algorithm. This was the same evaluation used in the KataGo paper, and KataGo trained using these rule sets. (KataGo was also trained with some other rules -- it was randomized during training so it transfers across rules, and KataGo gets the rules as input.)
You might find this discussion of our paper at https://www.reddit.com/r/MachineLearning/comments/yjryrd/com... by the lead author of KataGo interesting. He wasn't that concerned about the rule set, primary concern was that we evaluate in a low-search regime, which is a fair critique. But he overall agrees with our conclusion that self-play just cannot be relied upon to produce robust policies sufficiently OOD.
We're looking into doing our own adversarial training run to see if we can get to a point where the KataGo agent is robust. My personal suspicion given how difficult adversarial examples have been to eliminate in image classifiers and other ML systems is that although we'll be able to train particular vulnerabilities out of the system, there's still going to be a long tail of issues that can be automatically discovered and exploited.