I thought the same, but the description of the cat picture is pretty spot on. I wonder if this is a dataset issue. Cat pictures are far more prevalent than abstract art on the internet so might well be overrepresented. Can Vision LLMs deal with a long tail of underrepresented objects when small? Or can they only do so at scale?