Alright results are in! I've re-run all my editing based adherence related promp...

handsclean · 2025-11-21T12:36:06 1763728566

Please consider changing pass/fail to an integer score out of maybe 5. This test is becoming more and more misleading as your apparent desire to give due credit conflicts with quality improvements over already ok-ish models. For example, on the great wave Gemini 3’s excellent rendition gets no additional credit over Qwen technically not failing if one is generous, and on cards, there’s actually no score distinction between results that one could or could not use.

tylervigen · 2025-11-21T00:18:56 1763684336

I think Nano banana pro’s answer to the giraffe edit is far superior to the Seedream response, but you passed Seedream and failed NB pro.

Maybe that one is just not a good test?

handsclean · 2025-11-21T12:31:27 1763728287

I thought so too at first, but zoom in to where the neck joins the head. What looks like the head’s shadow from a distance is actually a hard seam between thick neck and thin neck, with much of the apparent shadow actually a cutout showing the background.

Looks like the Seedream result here has been changed to fail, which I’d agree with, too. Pose change complaints aside, I think that neck is actually the same length were it held straight.

tziki · 2025-11-21T01:19:16 1763687956

I agree, it seems like Seedream has the neck at same length as Nano Banana but also made the giraffe crouch down, making a major modification to the overall picture.

strbean · 2025-11-21T02:36:22 1763692582

If you look closely, the NBP giraffe has a gaping hole in it's neck.

IncreasePosts · 2025-11-21T17:50:09 1763747409

maybe that's just how his mom built him

robertwt7 · 2025-11-21T05:45:49 1763703949

yeah i agree, the prompt is to "shorten the giraffe's neck length", not to bent it. i feel like the Gemini 3 produces better result on that one

sosodev · 2025-11-20T21:58:01 1763675881

I think Nano Banana Pro should have passed your giraffe test. It's not a great result but it is exactly what you asked for. It's no worse than Seedream's result imo.

vunderba · 2025-11-20T22:07:13 1763676433

Yeah I think that's a fair critique. It kind of looks like a bad cut-and-replace job (if you zoom in you can even see part of the neck is missing). I might give it some more attempts to see if it can do a better job.

I agree that Seedream could definitely be called out as a fail since it might just be a trick of perspective.

sefrost · 2025-11-20T23:18:39 1763680719

Have you ever considered a “partial pass”?

Perhaps it would be an easy cop out of making a decision if you had to choose something outside of pass/fail.

vunderba · 2025-11-21T00:19:23 1763684363

That's not a bad suggestion. I thought about adding a numerical score but it felt like it was bit overwhelming at the time. Maybe I should revisit it though in the form of:

  Fail = 0 points
  Partial = 0.5 points
  Success = 1 point

There's definitely a couple of pictures where I feel like I'm at the optometrist and somehow failing an eye exam (1 or 2, A... or B).

jofzar · 2025-11-21T02:01:01 1763690461

I agree with this, some of those are "passing" and others are really passing. Specially with how much better some of the new model is compared to old ones.

I think the paws one is a good example where I think the new model got 100% while the other was more like 75%

aqme28 · 2025-11-21T00:18:35 1763684315

I don’t understand at all why Seedream gets a pass there. The neck appears the same length but now it’s at a different angle.

vunderba · 2025-11-21T03:40:28 1763696428

Alright I think it's time to concede defeat! Seedream has been summarily demoted to a failure and I've added in the following minimum passing criteria to that particular test:

- The giraffe's neck should be noticeably shorter than in the original image, while still maintaining a natural appearance.

- The final image cannot be accomplished by simply cropping out the neck or using perspective changes.

kevlened · 2025-11-20T22:03:50 1763676230

I agree. From where I'm sitting, Seedream just bent the neck while Nano Banana Pro actually shortened the neck.

jonplackett · 2025-11-20T23:55:03 1763682903

Yeah it’s better than the weirdness of seedream for sure.

humamf · 2025-11-20T21:52:30 1763675550

The pisa tower test is really interesting. Many of this prompt have stricter criteria with implicit knowledge and some models impressively pass it. Yet for something as obvious as straightening a slanted object is hard even for latest models.

kridsdale3 · 2025-11-20T22:12:56 1763676776

I suspect there'd be no problem rotating a different object. But this tower is EXTREMELY represented in the training data. It's almost an immutable law of physics that Towers in Pisa are Leaning.

gridspy · 2025-11-20T22:50:07 1763679007

It's also a tower that has famously been deliberately un-straightend just enough to remain a tourist attraction while remaining stable.

steadicat · 2025-11-21T12:11:18 1763727078

What?!? The tower was slightly straightened for safety reasons. It was never intentionally made to lean more.

tiagod · 2025-11-21T16:58:48 1763744328

Cool site, thanks! By the way, the "Before" and "After" buttons are swapped.

dyauspitr · 2025-11-21T00:51:58 1763686318

Seedream generally looks like low quality outputs and it doesn’t seem like you’re assigning points for quality. This is only marginally helpful.

vunderba · 2025-11-21T05:31:41 1763703101

That's because, for the most part, I'm not:

"A comparison of various SOTA generative image models on specific prompts and challenges with a strong emphasis placed on adherence."

Adherence is the more interesting problem, in my opinion, because quality issues can be ameliorated through the use of upscalers, refiner models, LoRAs, and similar tools. Furthermore, there are already a thousand existing benchmarks obsessed with visual fidelity.

dyauspitr · 2025-11-21T08:22:44 1763713364

I mean there’s a huge difference between a model that throws a black spot on someone’s head and another one that fills it with hair indistinguishable from the real thing. Which is why I’m saying this methodology is only marginally useful.

rl3 · 2025-11-21T01:57:58 1763690278

"Remove all the trash from the street and sidewalk. Replace the sleeping person on the ground with a green street bench. Change the parking meter into a planted tree."

Three sentences that do a great job summing up modern big tech. The new model even manages to [digitally] remove all trash.

noduerme · 2025-11-21T07:36:01 1763710561

The better to sell you real estate...

andrepd · 2025-11-21T10:19:23 1763720363

Yep, no need for actual urbanism or to worry about the homeless, now governments and realtors can lie to you more conveniently and at an industrial scale! Yay future

jamiek88 · 2025-11-22T00:53:33 1763772813

And one day we’ll wear glasses that do the same! Then we can solve (ignore) all problems!

andrepd · 2025-11-23T00:06:36 1763856396

Tuning Black Mirror into reality, one episode at a time.

Nifty3929 · 2025-11-20T22:26:49 1763677609

Would you leave one of the originals in each test visible at all times (a control) so that I can see the final image(s) that I'm considering and the original image at the same time?

I guess if you do that then maybe you don't need the cool sliders anymore?

Anyway - thanks so much for all your hard work on this. A very interesting study!

noduerme · 2025-11-21T04:54:25 1763700865

I had to look up what a "skifter" is. An AI answer showed that it's Norwegian for a switch.

I'm curious, does the word have a further meaning in the context of cheating at cards?

vunderba · 2025-11-21T05:29:00 1763702940

It's an admittedly obscure reference to a cheating technique used in the Star Wars card game sabacc, which allows a player to surreptitiously switch out a card. I’m pretty sure I picked it up from one of Timothy Zahn's Thrawn books when I was a kid.

But I didn't know it had a meaning in Norwegian, so I guess TIL!

noduerme · 2025-11-21T07:33:43 1763710423

Hah. I loved those Timothy Zahn books. Don't remember that one, though!

Wyverald · 2025-11-20T20:22:14 1763670134

thanks, I love your website. Are you planning to do NB Pro for the text-to-image benchmark too?

vunderba · 2025-11-21T00:24:10 1763684650

Outside the time frame of being able to edit my original reply, but I've finally re-run the Text-to-Image portion of the site through NB Pro.

  Results

  gpt-image-1: 10 / 12 
  Nano Banana Pro: 9 / 12
  Nano Banana: 8 / 12

It's worth mentioning that even though it only scored slightly better than the original NB, many of the images are significantly better looking.

https://genai-showdown.specr.net?models=nb,nbp

happyopossum · 2025-11-21T06:33:31 1763706811

Awesome test suite. For the maze though, not sure it’s fair to knock it for extra dashed lines as the prompt didn’t specify that only the correct path should have one…

Wyverald · 2025-11-21T00:41:07 1763685667

thanks for the update. One small note: for the d20 test, NB Pro had duplications of 13 and 17 too, not just 19.

vunderba · 2025-11-21T03:42:07 1763696527

Good catch - I've been staring at these images so long day I'm starting to get the equivalent of "Tetris Effect"!

https://en.wikipedia.org/wiki/Tetris_effect

vunderba · 2025-11-20T20:35:51 1763670951

Definitely! Even though NB's predominant use case seems to be editing, it's still producing surprisingly decent text-to-image results. Imagen4 currently still comes out ahead in terms of image fidelity, but I think NB Pro will close the gap even further.

I'll try to have the generative comparisons for NB Pro up later this afternoon once I catch my breath.