If you generate 25x more images, you can afford to cherry-pick.

Lerc · 2024-10-16T22:24:33.000000Z

That transfers computer time to user time. It's great when you want variations, less so when you want precision and consistency. Picking the best image tires the brain quite quickly, you have to take into account the at a glance quality without it overriding the detail quality.

I'd be curious to see how a vision model would go if it were finetuned to select the best image match to a given criteria.

It's possible that you could do O1 style training to build a final stage auto-cherrypicker.

cube2222 · 2024-10-16T20:50:53.000000Z

It would be interesting to have benchmarks that take this into account (maybe they already do or I’m misunderstanding how those benchmarks work). I.e. when comparing quality between two different models of vastly different performance, you could be doing best-of-n in the faster model.

Vt71fcAqt7 · 2024-10-16T21:14:55.000000Z

That sounds like it could be an intiresting metric. Worth noting that there is a difference between an algorithmic "best of n" selection (via eg. an FID score) vs. manual cherry picking which takes more factors into account such as user preference and also takes time to evaluate, which is what GP was suggesting.

psb217 · 2024-10-17T14:08:25.000000Z

This is a bit pedantic, but FID score wouldn't really be a viable metric for best of n selection since it's a metric that's only computable for distributions of samples. FID score is also pretty high variance for small sample sizes, so you need a lot of samples to compute a meaningful FID score.

Better metrics (assuming goal is text->image) would be some sort of inception score or CLIP-based text matching score. These metrics are computable on single samples.

cube2222 · 2024-10-16T21:31:20.000000Z

Yeah I’d likely just pick the best scoring one (that is, the pick is made by the evaluation tool, not the model) - to simulate “whatever the receiver deemed best for what they wanted”.