Alright results are in! I've re-run all my editing based adherence related prompts through Nano Banana Pro. NB Pro managed to successfully pass SHRDLU, the M&M Van Halen test (as verified independently by Simon), and the Scorpio street test - all of which the original NB failed.
Please consider changing pass/fail to an integer score out of maybe 5. This test is becoming more and more misleading as your apparent desire to give due credit conflicts with quality improvements over already ok-ish models. For example, on the great wave Gemini 3’s excellent rendition gets no additional credit over Qwen technically not failing if one is generous, and on cards, there’s actually no score distinction between results that one could or could not use.
I thought so too at first, but zoom in to where the neck joins the head. What looks like the head’s shadow from a distance is actually a hard seam between thick neck and thin neck, with much of the apparent shadow actually a cutout showing the background.
Looks like the Seedream result here has been changed to fail, which I’d agree with, too. Pose change complaints aside, I think that neck is actually the same length were it held straight.
I agree, it seems like Seedream has the neck at same length as Nano Banana but also made the giraffe crouch down, making a major modification to the overall picture.
I think Nano Banana Pro should have passed your giraffe test. It's not a great result but it is exactly what you asked for. It's no worse than Seedream's result imo.
Yeah I think that's a fair critique. It kind of looks like a bad cut-and-replace job (if you zoom in you can even see part of the neck is missing). I might give it some more attempts to see if it can do a better job.
I agree that Seedream could definitely be called out as a fail since it might just be a trick of perspective.
That's not a bad suggestion. I thought about adding a numerical score but it felt like it was bit overwhelming at the time. Maybe I should revisit it though in the form of:
I agree with this, some of those are "passing" and others are really passing. Specially with how much better some of the new model is compared to old ones.
I think the paws one is a good example where I think the new model got 100% while the other was more like 75%
Alright I think it's time to concede defeat! Seedream has been summarily demoted to a failure and I've added in the following minimum passing criteria to that particular test:
- The giraffe's neck should be noticeably shorter than in the original image, while still maintaining a natural appearance.
- The final image cannot be accomplished by simply cropping out the neck or using perspective changes.
The pisa tower test is really interesting. Many of this prompt have stricter criteria with implicit knowledge and some models impressively pass it. Yet for something as obvious as straightening a slanted object is hard even for latest models.
I suspect there'd be no problem rotating a different object. But this tower is EXTREMELY represented in the training data. It's almost an immutable law of physics that Towers in Pisa are Leaning.
"A comparison of various SOTA generative image models on specific prompts and challenges with a strong emphasis placed on adherence."
Adherence is the more interesting problem, in my opinion, because quality issues can be ameliorated through the use of upscalers, refiner models, LoRAs, and similar tools. Furthermore, there are already a thousand existing benchmarks obsessed with visual fidelity.
I mean there’s a huge difference between a model that throws a black spot on someone’s head and another one that fills it with hair indistinguishable from the real thing. Which is why I’m saying this methodology is only marginally useful.
"Remove all the trash from the street and sidewalk. Replace the sleeping person on the ground with a green street bench. Change the parking meter into a planted tree."
Three sentences that do a great job summing up modern big tech. The new model even manages to [digitally] remove all trash.
Yep, no need for actual urbanism or to worry about the homeless, now governments and realtors can lie to you more conveniently and at an industrial scale! Yay future
Would you leave one of the originals in each test visible at all times (a control) so that I can see the final image(s) that I'm considering and the original image at the same time?
I guess if you do that then maybe you don't need the cool sliders anymore?
Anyway - thanks so much for all your hard work on this. A very interesting study!
It's an admittedly obscure reference to a cheating technique used in the Star Wars card game sabacc, which allows a player to surreptitiously switch out a card. I’m pretty sure I picked it up from one of Timothy Zahn's Thrawn books when I was a kid.
But I didn't know it had a meaning in Norwegian, so I guess TIL!
Awesome test suite. For the maze though, not sure it’s fair to knock it for extra dashed lines as the prompt didn’t specify that only the correct path should have one…
Definitely! Even though NB's predominant use case seems to be editing, it's still producing surprisingly decent text-to-image results. Imagen4 currently still comes out ahead in terms of image fidelity, but I think NB Pro will close the gap even further.
I'll try to have the generative comparisons for NB Pro up later this afternoon once I catch my breath.
If you just want to see how NB and NB Pro compare against each other:
https://genai-showdown.specr.net/image-editing?models=nb,nbp