Love seeing this benchmark become more iconic with each new model release. Still in disbelief at the GPT-5 variants' performance in comparison but its cool to see the new open source models get more ambitious with their attempts.
Dataset contamination alone won't get them good-looking SVG pelicans on bicycles though, they'll have to either cheat this particular question specifically or train it to make vector illustrations in general. At which point it can be easily swapped for another problem that wasn't in the data.
they can have some cheap workers make about 10 pelicans by hand in svg, fuzz them to generate thousands of variations and throw it in their training pool. don't need to 'get good at svgs' by any means.
It started as a joke, but over time performance on this one weirdly appears to correlate to how good the models are generally. I'm not entirely sure why!
I'm not saying its objective or quantitative, but I do think its an interesting task because it would be challenging for most humans to come up with a good design of a pelican riding a bicycle.
I think its cool and useful precisely because its not trying to correlate intelligence. It's a weird kind of niche thing that at least intuitively feels useful for judging llms in particular.
I'd much prefer a test which measures my cholesterol than one that would tell me whether I am an elf or not!
There are many reports of CLI AI tools displaying words that humans express when they are frustrated and about to give up. Just what they have been trained on. That does not mean they have emotions. And "deleting the whole codebase" sounds more interesting, but I assume is the same thing. "Frustrated" words lead to frustrated actions. Does not mean the LLM was frustrated. Just that in its training data those things happened so it copied them in that situation.
The difference is, people and animals have a body, nerve system and in general those mushy things we think are responsible for emotions.
Computers don't have any of that. And LLM's in particular neither. They were trained to simulate human text responses, that's all. How to get from there to emotions - where is the connection?
Don't confuse the medium with the picture it represents.
Porn is pornographic, whether it is a photo or an oil painting.
Feelings are feelings, whether they're felt by a squishy meat brain or a perfect atom-by-atom simulation of one in a computer. Or a less-than-perfect simulation of one. Or just a vaguely similar system that is largely indistinguishable from it, as observed from the outside.
Individual nerve cells don't have emotions! Ten wired together don't either. Or one hundred, or a thousand... by extension you don't have any feelings either.
I actually prefer ascii art diagrams as a benchmark for visual thinking, since it requires 2 stages,
Like svg, and also can test imaginative repurposing of text elements.
If accuracy is a major concern, then it's probably guaranteed better to go with the HTML documents. Otherwise, I've heard Docling is pretty good from a few co-workers.