Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Openlayer (YC S21) – Testing and Evaluation for AI
94 points by rishramanathan on Dec 5, 2023 | hide | past | favorite | 31 comments
Hey HN, Rish, Vikas and Gabe here. We're building Openlayer (https://www.openlayer.com/), an observability platform for AI. We've developed comprehensive testing tools to check both the quality of your input data and the performance of your model outputs.

The complexity and black-box nature of AI/ML have made rigorous testing a lot harder than it is in most software development. Consequently, AI development involves a lot of head-scratching and often feels like walking in the dark. Developers need reliable insights into how and why their models fail. We're here to simplify this for both common and long-tail failure scenarios.

Consider a scenario in which your model is working smoothly. What happens when there's a sudden shift in user behavior? This unexpected change can disrupt the model's performance, leading to unreliable outputs. Our platform offers a solution: by continuously monitoring for sudden data variations, we can detect these shifts promptly. That's not all though – we’ve created a broad set of rigorous tests that your model, or agent, must pass. These tests are designed to challenge and verify the model's resilience against such unforeseen changes, ensuring its reliability under diverse conditions.

We support seamlessly switching between (1) development mode, which lets you test, version, and compare your models before you deploy them to production, and (2) monitoring mode, which lets you run tests live in production and receive alerts when things go sideways.

Say you're using an LLM for RAG and want to make sure the output is always relevant to the question. You can set up hallucination tests, and we'll buzz you when the average score dips below your comfort zone.

Or imagine you're managing a fraud prediction model and are losing sleep over false negatives. Openlayer offers a two-step solution. First, it helps pinpoint why the model misses certain fraudulent data points using debugging tools such as explainability. Second, it enables converting these identified cases into targeted tests. This allows you to deep dive into tackling specific incidents, like fraud within a segment of US merchants. By following this process, you can understand your model's behavior and refine it to capture future fraudulent cases more effectively.

The MLOps landscape is currently fragmented. We’ve seen countless data and ML teams glue together a ton of bespoke and third-party tools to meet basic needs: one for experiment tracking, another for monitoring, and another for CI automation and version control. With LLMOps now thrown into the mix, it can feel like you need yet another set of entirely new tools.

We don’t think you should, so we're building Openlayer to condense and simplify AI evaluation. It’s a collaborative platform that solves long-standing ML problems like the ones above, while tackling the new crop of challenges presented by Generative AI and foundation models (e.g. prompt versioning, quality control). We address these problems in a single, consistent way that doesn't require you to learn a new approach. We’ve spent a lot of time ensuring our evaluation methodology remains robust even as the boundaries of AI continue to be redrawn.

We're stoked to bring Openlayer to the HN community and are keen to hear your thoughts, experiences, and insights on building trust into AI systems.




how is it different from Traceloop and openllmetry (https://github.com/traceloop/openllmetry)?


Broadly, on the monitoring side, we’re more focused on evaluating the quality of the model’s outputs (is it violating your rules, handling specific subpopulations / edge cases correctly etc.). OpenLLMetry is more focussed on telemetry and tracing, whereas for us ‘monitoring’ is a means to running your tests on production data.

Openlayer’s also intended to be used on non-LLM use cases. Here are a few other ways we’re different:

1. Support for other ML task types

2. Includes a development mode for versioning and experimentation

3. Native slack and email alerts (openllmetry might integrate with other platforms that do that, but not sure)

4. Collaboration is deeply embedded into the product


Traceloop's landing page is all about model quality, not metrics. Their open source OpenLLMetry is the metrics part and hooks into the OpenTelemetry ecosystem. There should be no issue with getting alerts via the ecosystem, it's promanent on their pages.

https://www.traceloop.com/


I think the target personas is different. While they might have the same capabilities, but the job-to-be-done is different.

openllmetry is focus on engineers, who wants to use this as more of a piping solution and it sits on top of opentelemtry. While opentelemetry is a popular solution. It is just applying a solution to a new problem.

OpenLayers to me is thinking from the ML/AI problems from ground up and while serving the data scientists and probably prompt engineers.



For one thing, the name is certainly better than the latter's lol


I find OpenLLMetry to be better.

1. OpenLayer does not say metrics or monitoring to me

2. OpenLLMetry builds on OpenTelemetry, which it very much reminds me of as a name. It's also a much easier add-on to our existing stack. I don't want to have to log into some company's website to view metrics for a single part of my stack when trying to understand why things are not working as expected.

3. OpenLLMetry is open core, which is what devs desire. Who is really using closed source software in this space now (the logmon space, not ai, though both are largely chasing after open dreams)


How does it compare to other platforms like: https://rungalileo.io Or https://lilacml.com


Compared to Galileo, we offer a more comprehensive suite of evals that support more tasks than LLMs and NLP.

We offer more features around error and subpopulation analysis, versioning, running evals during development, and collaboration. Through what (I believe) is a more clean and simple DevEx and UI!

re: Lilac, there’s some intersect w/r/t dataset exploration, but we have more evals than the ones they offer. More than data quality, we give insights into data drift and model performance and let you set up expectations and get alerts on whether they fail during development and production. + distinct in some of the ways described above

We’re really happy to see more tools and platforms in this space. Definitely a big uptick since we started 3 years ago, w the advent of gen ai this is all top of mind (and deservedly so).


Hmm YC 21- so they pivoted into this after 2 years doing something different?


We’ve actually been building a testing and evaluation platform from the start, but started with discriminative ML tasks like classification and regression. We waited to do a Launch HN because we were mostly focused on enterprise / mid-market.

These past few months, however, we’ve prioritized building out features for testing and monitoring LLMs.

LLMs certainly have their unique challenges, but the evaluation problem in general is not new, and much of what we’ve built historically is very much applicable to this new crop of ML use cases!


There is nothing wrong with pivoting


Another YC pivot to ai from yesterday: https://news.ycombinator.com/item?id=38516795


Nothing wrong with pivoting. Or maybe i'm misreading you and the parent's "tone".


Agree with nothing wrong about pivoting, mostly an anecdotal point wondering if there is going to be a trend of this for YC companies.

I do wonder if non-ai talent pivoting to AI is wise or defendable. Do you want to use a service where the creators don't really understand the tech and are not much more than a wrapper around an API? Is that really a defensible posture in a competitive market?

We will see. I'm sure some will hire talent and have the data to do something special.


Awesome idea. I'm curious how comprehensive your set of evaluations is. For example, how does it compare to OpenAI Evals? Could I import evaluations from there? Add my own?


Thanks! We’ve broken our evals down into three primary categories — integrity, consistency and performance.

Integrity tests tackle data quality issues (e.g. no PII in input data, no duplicate rows, schema checks on specific fields).

Consistency tests help ensure your fine-tuning & validation datasets are well constructed in relation to one another (e.g. don’t have overlap, are sized correctly), and your production data doesn’t drift from your reference data.

Performance tests are focused on your model outputs, and measure common metrics for each task (e.g. accuracy, F1, PR for classification) as well as custom metrics designed to be evaluated by an LLM (e.g. “make sure these outputs don’t contain profanity”). You can apply these metrics to specific subpopulations of your data by setting filters on your input fields.

Re: adding your own evals — yes, you can! The evals are not statically defined — they are flexible structures that allow you to customize them to your needs.

Re: importing evaluations from other libraries — this is something we’re adding more support for. We’ve just added an integration with Great Expectations, and can add an integration with OpenAI’s evals if that is something the community is interested in.


nice to see this launch - i was waiting until they had a JS native library, but we’ve been using it since and it covers everything we need


Thanks! Glad Openlayer is working well for you :)


Just FYI, "openlayers" is the name of a widely used open source web mapping frontend library. There's a possibility for some confusion there.

https://openlayers.org/


congrats on the product, looks great. what model formats are supported?


You can upload just the predictions of the model (and whatever metadata you want to track), so in that sense any format is supported.

If you want to unlock explainability for your tabular classification or regression, or text classification models, you can upload the actual model binary. We support a bunch of frameworks out-of-the-box, but you can use any architecture through our custom upload.

More info:

https://docs.openlayer.com/documentation/how-to-guides/uploa...

https://docs.openlayer.com/documentation/how-to-guides/write...


Curious how well this works / how it would work if users are not directly interacting with the LLM!


Not sure I follow — could you elaborate on what you mean by “directly interacting”?


Using Open in a name has become a hype.


Congrats! FYI your link rendering seems funky and doesn't seem to be clickable?


Oops, thanks for the heads up! Fixed.


No github*, no pricing, both likely to be issues on HN

*ok, there is a gallery project, but something like this I would expect to be the open source variety of startups. I very much expect something like this to be open core.


We realize the lack of information about pricing isn’t ideal, and that people will be turned away by this. In the meantime, we do have a free plan with generous limits that allows you to get started self-serve. This plan isn’t time bounded, so there won’t be pressure to upgrade unless you need increased data limits.

On open-core — we’ve been considering open-sourcing the engine that evaluates your models. Will have more on this soon!

We’re definitely prioritizing increasing transparency, and we appreciate your feedback about it!


big fan of openlayer since rdv!


This is really going to confuse people searching for OpenLayers, a major web mapping package :(

https://openlayers.org/

It has an API with class names like "Observable", and there are frequent discussions on inputs and performance. It's gonna make searching for one or the other really hard...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: