Right, but that ignores the key piece, which is whether such systems can infer the equivalence relation denoted by "is". A is A. A is B implies B is A. A is B and B is C implies A is C. When a system sees that A is B, yet cannot infer that B is A, it exhibits an asymmetry, which is interesting. Whether this asymmetry exists in the underlying language is unclear.
There's an asymmetry here, too. "The sky is ${color}", with no extra context, has one obvious answer for us living here, today. Whereas for "Blue is the color of ${thing}", with no extra answer, there's an insane amount of equally sensible substitutions for ${thing}. Without extra content, the model has no reason to privilege "sky" over any other equally valid answer.
> When a system sees that A is B, yet cannot infer that B is A
So the issue is that they cannot infer that general rule due to a fundamental limitation of the transformer LLM architecture, not just a training data issue? I skimmed the paper and it seems to be the case.