Thanks! We’ve broken our evals down into three primary categories — integrity, consistency and performance.
Integrity tests tackle data quality issues (e.g. no PII in input data, no duplicate rows, schema checks on specific fields).
Consistency tests help ensure your fine-tuning & validation datasets are well constructed in relation to one another (e.g. don’t have overlap, are sized correctly), and your production data doesn’t drift from your reference data.
Performance tests are focused on your model outputs, and measure common metrics for each task (e.g. accuracy, F1, PR for classification) as well as custom metrics designed to be evaluated by an LLM (e.g. “make sure these outputs don’t contain profanity”). You can apply these metrics to specific subpopulations of your data by setting filters on your input fields.
Re: adding your own evals — yes, you can! The evals are not statically defined — they are flexible structures that allow you to customize them to your needs.
Re: importing evaluations from other libraries — this is something we’re adding more support for. We’ve just added an integration with Great Expectations, and can add an integration with OpenAI’s evals if that is something the community is interested in.
We’ve actually been building a testing and evaluation platform from the start, but started with discriminative ML tasks like classification and regression. We waited to do a Launch HN because we were mostly focused on enterprise / mid-market.
These past few months, however, we’ve prioritized building out features for testing and monitoring LLMs.
LLMs certainly have their unique challenges, but the evaluation problem in general is not new, and much of what we’ve built historically is very much applicable to this new crop of ML use cases!
We realize the lack of information about pricing isn’t ideal, and that people will be turned away by this. In the meantime, we do have a free plan with generous limits that allows you to get started self-serve. This plan isn’t time bounded, so there won’t be pressure to upgrade unless you need increased data limits.
On open-core — we’ve been considering open-sourcing the engine that evaluates your models. Will have more on this soon!
We’re definitely prioritizing increasing transparency, and we appreciate your feedback about it!
Broadly, on the monitoring side, we’re more focused on evaluating the quality of the model’s outputs (is it violating your rules, handling specific subpopulations / edge cases correctly etc.). OpenLLMetry is more focussed on telemetry and tracing, whereas for us ‘monitoring’ is a means to running your tests on production data.
Openlayer’s also intended to be used on non-LLM use cases. Here are a few other ways we’re different:
1. Support for other ML task types
2. Includes a development mode for versioning and experimentation
3. Native slack and email alerts (openllmetry might integrate with other platforms that do that, but not sure)
4. Collaboration is deeply embedded into the product
Traceloop's landing page is all about model quality, not metrics. Their open source OpenLLMetry is the metrics part and hooks into the OpenTelemetry ecosystem. There should be no issue with getting alerts via the ecosystem, it's promanent on their pages.