Nir and team have built an amazing OSS package and have been fantastic to collaborate with (despite being competitors)! As an industry, I think more of us need to work together to standardize telemetry protocols, schemas, naming conventions, etc. since it’s currently all over the place and leads to a ton of confusion and headache for developers (which ultimately goes against the whole point of using devtools in the first place).
We recently integrated OpenLLMetry into our SDKs with the sole purpose of offering standardization and interoperability with customers’ existing DevSecOps stacks. Customers have been loving it so far!
No idea how honest this is (I might have gotten a bit cynical) - but reading this sounds like you guys have a really healthy constructive competition with elements of cooperation! Love to see that.
I replied to you in a different thread, I don't think calling our companies "deceptive" will help you or me get anywhere. While I agree with you that detection will never be hermetic, I don't think this is the goal. By design you'll have hallucinations and the question should be how can you monitor the rate and look for changes and anomalies.
Our stance here is most model-graded evaluators from packages like RAGAS and similar don't work well out-of-the-box, and require tons of tuning and alignment (i.e. you need to change the evaluator prompt/criteria, run it against your own traces, and see if you agree with the results, and continue this process in batches over time). We make that process easy in HoneyHive by allowing you to change the underlying evaluator prompt and test it against your recent traces to validate performance: https://docs.honeyhive.ai/evaluators/llm.
The main takeaway is you can't think of model-graded evaluators as static tests that you can set up and forget about; you need to constantly tune, align them, and validate them against your own human judgement (aka treat them like an LLM application in-and-of-itself!). They cannot detect fine-grained errors reliably, but they've been proven to detect extreme outliers well and can serve as a fuzzy signal at best. For anyone interested, here's a great paper on how to align and validate evaluators: https://arxiv.org/pdf/2404.12272
The real solution here is still relying on human judgement as much as possible and using tools that make reading your data easier/scalable. Think of evaluators as a sampling function to reduce the number of traces humans need to manually review. Another point worth noting is deterministic metrics (eg: keyword assertions) often cover ~60-80% of failure modes in the real world and should be used liberally.
Nir and team have built an amazing OSS package and have been fantastic to collaborate with (despite being competitors)! As an industry, I think more of us need to work together to standardize telemetry protocols, schemas, naming conventions, etc. since it’s currently all over the place and leads to a ton of confusion and headache for developers (which ultimately goes against the whole point of using devtools in the first place).
We recently integrated OpenLLMetry into our SDKs with the sole purpose of offering standardization and interoperability with customers’ existing DevSecOps stacks. Customers have been loving it so far!