Hey HN!
In my experiments with prompt engineering, I ran into a problem: most prompts I'm designing can't be quantitatively tested because they don't have a right/wrong answer (eg. providing essay feedback, deciphering corporate-speak in meeting minutes). That means I can't run evals, super powerful tools like ChainForge[1] are too high-overhead, and running one prompt at a time in ChatGPT... sucks.
I built Prompt Octopus[2] to evaluate as many prompts as I want, side by side, and it's sped up my workflow dramatically. You can plug in an API key online or self-host (I added python + node.js boilerplates in the repo). Click the Octopus icon in the top right to change your model type, see your history, and change the number of prompt-response boxes you're working with. I'm open sourcing it here and want your feedback, both on the UX and the self-hosting experience!
This week I hope to add diff checking, batch API calls to speed things up, and options to add more LLMs.
[1] https://chainforge.ai/
[2] https://promptoctopus.com