https://huggingface.co/datasets/jonathan-roberts1/zerobench/... is a good way to review them. Some of them seem pretty poorly designed and lots use some arbitrary math thing where any single flaw might compound to a wrong answer but in an unclear way. others have multiple answers and others still seems they failed to understand their own task.
#4 sticks out as pretty poorly designed because as a human I can't get their answer. Like how many cats, one is just the bowtie, does that count as a cat? I could construe it's there but it's not actually in the picture. how many leaves? if it's hard to tell them apart are they actually "distinct. same with the window panes, based on the answer number they think they're being tricky counting the window in the background but semantically you would probably not want a model to pick up on that and even then I don't think it's actually possible to tell if that's 4 panes (2 double paned sides with inner lattice) or like 12.
Another random sample #64, I see 5 pens 2 of which are "clicky" and one that I can't tell if it's clicky or twisty. 3/5 do not have lids so %60.00 but they got 21.43 which means they counted the markers even though they only asked for PENS, they failed to semantically parse their own question.
Large Multimodal Models (LMMs) exhibit major shortfalls when interpreting images and, by some measures, have poorer spatial cognition than small children or animals. Despite this, they attain high scores on many popular visual benchmarks, with headroom rapidly eroded by an ongoing surge of model progress. To address this, there is a pressing need for difficult benchmarks that remain relevant for longer. We take this idea to its limit by introducing ZeroBench-a lightweight visual reasoning benchmark that is entirely impossible for contemporary frontier LMMs. Our benchmark consists of 100 manually curated questions and 334 less difficult subquestions. We evaluate 20 LMMs on ZeroBench, all of which score 0.0%, and rigorously analyse the errors. To encourage progress in visual understanding, we publicly release ZeroBench.
#4 sticks out as pretty poorly designed because as a human I can't get their answer. Like how many cats, one is just the bowtie, does that count as a cat? I could construe it's there but it's not actually in the picture. how many leaves? if it's hard to tell them apart are they actually "distinct. same with the window panes, based on the answer number they think they're being tricky counting the window in the background but semantically you would probably not want a model to pick up on that and even then I don't think it's actually possible to tell if that's 4 panes (2 double paned sides with inner lattice) or like 12.
Another random sample #64, I see 5 pens 2 of which are "clicky" and one that I can't tell if it's clicky or twisty. 3/5 do not have lids so %60.00 but they got 21.43 which means they counted the markers even though they only asked for PENS, they failed to semantically parse their own question.