Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How do you personally evaluate LLMs?
2 points by cloudking 26 days ago | hide | past | favorite
I’ve seen the standard evals and benchmarks for new LLMs, but they don’t really capture how I actually use them. My own test is pretty specific: whenever a new LLM drops, I ask it to “Write an advanced three.js music visualizer.” Then I compare it to older models by checking:

1. Does it use a recent version of three.js?

2. Does the generated code run out of the box?

3. How complex/innovative is the visualizer?

I’m really curious to hear about other people’s “real-world” benchmarks. What’s your personal test prompt or scenario that reveals whether a new LLM is actually useful for you? How do you decide if it’s truly better than the last version?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: