Hey HN! We recently launched our tool for testing AI systems. The goal is to make it really easy for teams to maintain benchmarks for things like "factual accuracy in QA". So if you're building a customer support bot, you can test (during development) how often it lies about your products. It's all automatically graded, and shows you only the interesting results.
It's essentially a scaled up version of manually entering a bunch of test cases and seeing how it performs.
If you're interested in LLM testing, evals, or benchmarking, lets chat!