HoneyHive founder here. Nir and team have built an amazing OSS package and have ...

lukan · 2024-07-17T20:56:21 1721249781

No idea how honest this is (I might have gotten a bit cynical) - but reading this sounds like you guys have a really healthy constructive competition with elements of cooperation! Love to see that.

threeseed · 2024-07-17T19:16:52 1721243812

Your startup is as deceptive as Traceloop.

You make claims like "detect LLM errors like hallucination" even though you have no guaranteed ability to do this.

At best you can assist in detection.

As someone who works at a large enterprise deploying LLMs I can tell you many people are getting pretty tired of the false claims.

nirga · 2024-07-17T19:32:40 1721244760

I replied to you in a different thread, I don't think calling our companies "deceptive" will help you or me get anywhere. While I agree with you that detection will never be hermetic, I don't think this is the goal. By design you'll have hallucinations and the question should be how can you monitor the rate and look for changes and anomalies.

mshcodes · 2024-07-18T14:37:45 1721313465

Our stance here is most model-graded evaluators from packages like RAGAS and similar don't work well out-of-the-box, and require tons of tuning and alignment (i.e. you need to change the evaluator prompt/criteria, run it against your own traces, and see if you agree with the results, and continue this process in batches over time). We make that process easy in HoneyHive by allowing you to change the underlying evaluator prompt and test it against your recent traces to validate performance: https://docs.honeyhive.ai/evaluators/llm.

The main takeaway is you can't think of model-graded evaluators as static tests that you can set up and forget about; you need to constantly tune, align them, and validate them against your own human judgement (aka treat them like an LLM application in-and-of-itself!). They cannot detect fine-grained errors reliably, but they've been proven to detect extreme outliers well and can serve as a fuzzy signal at best. For anyone interested, here's a great paper on how to align and validate evaluators: https://arxiv.org/pdf/2404.12272

The real solution here is still relying on human judgement as much as possible and using tools that make reading your data easier/scalable. Think of evaluators as a sampling function to reduce the number of traces humans need to manually review. Another point worth noting is deterministic metrics (eg: keyword assertions) often cover ~60-80% of failure modes in the real world and should be used liberally.