Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What are you using for LLM response testing and benchmarking?
4 points by tin7in on June 14, 2023 | hide | past | favorite | 5 comments
What are you using to test your LLM responses, benchmark them, maybe compare different versions?

I've seen a few YC startups focusing on this but I haven't decided yet if we should build this internally or use an external tool.




What features are critical for you in your response testing and benchmarking? I think it'll be easier to answer with specifics.


We are using chat/completion APIs and we have a set of prompts that users can apply on selected text.

I would like to see how changing the system prompt or any of the logic in the pre-defined prompts affects the output.


Could you let me know if one of these is what you're looking for?

I put this list together, I'm pretty sure one of these should solve what you're after: https://llm-utils.org/List+of+tools+for+prompt+engineering


This is great, thank you! Promptclub and ChainForge look exactly what I need.


This list has been nice: https://lmsys.org/blog/2023-05-25-leaderboard/

And you can participate in the arena, which pits them against each other. I'm surprised that I actually voted for GPT-3.5 over GPT-4 for a lot of my use cases.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: