I’ve seen the standard evals and benchmarks for new LLMs, but they don’t really capture how I actually use them. My own test is pretty specific: whenever a new LLM drops, I ask it to “Write an advanced three.js music visualizer.” Then I compare it to older models by checking:
1. Does it use a recent version of three.js?
2. Does the generated code run out of the box?
3. How complex/innovative is the visualizer?
I’m really curious to hear about other people’s “real-world” benchmarks. What’s your personal test prompt or scenario that reveals whether a new LLM is actually useful for you? How do you decide if it’s truly better than the last version?