Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Traceloop (YC W23) – Detecting LLM Hallucinations with OpenTelemetry
101 points by GalKlm 51 days ago | hide | past | favorite | 72 comments
Hey everyone, we are Nir and Gal from Traceloop (https://www.traceloop.com). We help teams understand when their LLM apps are failing or hallucinating at scale. See a demo: https://www.traceloop.com/video or try it yourself at https://www.traceloop.com/docs/demo.

When moving your LLM app to production, significant scale makes it harder for engineers and data scientists alike to understand when their LLM is hallucinating or returning malformed responses. When you get to millions of calls to OpenAI a month, methods like “LLM as a judge” can’t work at a reasonable cost or latency. So, what most people we talked to usually do is sample some generations by hand, maybe for some specific important customers, and manually look for errors or hallucinations.

Traceloop is a monitoring platform that detects when your LLM app fails. Under the hood, we built real-time versions of known metrics like faithfulness, relevancy, redundancy, and many others. These are loosely based on some well-known NLP metrics that work well for LLM-generated texts. We correlate them with changes we detect in your system - like updates to prompts or to the model you’re using - to detect regressions automatically.

Here are some cool examples we’ve seen with our customers -

1. Applying our QA relevancy metric to an entity extraction task, we managed to discover issues where the model was not extracting the right entities (like an address instead of a person’s name); or returning random answers like “I’m here! What can I help you with today?”.

2. Our soft-faithfulness metric was able to detect cases in summarization tasks where a model was completely making up stuff that never appeared in the original text.

One of the challenges we faced was figuring out how to collect the data that we need from our customers' LLM apps. That’s where OpenTelemetry came in handy. We built OpenLLMetry (https://github.com/traceloop/openllmetry), and announced it here almost a year ago. It standardized the use of OpenTelemetry to observe LLM apps. We realized that the concepts of traces, spans, metrics, and logs that were standardized with OpenTelemetry can easily extend to gen AI. We partnered with 20+ observability platforms to make sure that OpenLLMetry becomes the standard for GenAI observability and that the data that we collect can be sent to other platforms as well.

We plan to extend the metrics we provide to support agents that use tools, vision models, and other amazing developments in our fast-paced industry.

We invite you to give Traceloop a spin and are eager for your feedback! How do you track and debug hallucinations? How much has that been an issue for you? What types of hallucinations have you encountered?




Just wanted to say great work on standardizing otel for LLM applications (https://github.com/open-telemetry/semantic-conventions/tree/...] and opensourcing OpenLLMetry. We're also building in this space, focusing more on eval (agenta). I think using otel would make the whole space move much faster.


Thanks so much! I always say that I'm a strong believer in open protocols so I'd love to assist you if you want to use OpenLLMetry as your SDK. We onboarded other startups / competitors like Helicone and Honeyhive and it's been tremendously successful (hopefully that's what they'll tell you as well)


HoneyHive founder here.

Nir and team have built an amazing OSS package and have been fantastic to collaborate with (despite being competitors)! As an industry, I think more of us need to work together to standardize telemetry protocols, schemas, naming conventions, etc. since it’s currently all over the place and leads to a ton of confusion and headache for developers (which ultimately goes against the whole point of using devtools in the first place).

We recently integrated OpenLLMetry into our SDKs with the sole purpose of offering standardization and interoperability with customers’ existing DevSecOps stacks. Customers have been loving it so far!


No idea how honest this is (I might have gotten a bit cynical) - but reading this sounds like you guys have a really healthy constructive competition with elements of cooperation! Love to see that.


Your startup is as deceptive as Traceloop.

You make claims like "detect LLM errors like hallucination" even though you have no guaranteed ability to do this.

At best you can assist in detection.

As someone who works at a large enterprise deploying LLMs I can tell you many people are getting pretty tired of the false claims.


I replied to you in a different thread, I don't think calling our companies "deceptive" will help you or me get anywhere. While I agree with you that detection will never be hermetic, I don't think this is the goal. By design you'll have hallucinations and the question should be how can you monitor the rate and look for changes and anomalies.


Our stance here is most model-graded evaluators from packages like RAGAS and similar don't work well out-of-the-box, and require tons of tuning and alignment (i.e. you need to change the evaluator prompt/criteria, run it against your own traces, and see if you agree with the results, and continue this process in batches over time). We make that process easy in HoneyHive by allowing you to change the underlying evaluator prompt and test it against your recent traces to validate performance: https://docs.honeyhive.ai/evaluators/llm.

The main takeaway is you can't think of model-graded evaluators as static tests that you can set up and forget about; you need to constantly tune, align them, and validate them against your own human judgement (aka treat them like an LLM application in-and-of-itself!). They cannot detect fine-grained errors reliably, but they've been proven to detect extreme outliers well and can serve as a fuzzy signal at best. For anyone interested, here's a great paper on how to align and validate evaluators: https://arxiv.org/pdf/2404.12272

The real solution here is still relying on human judgement as much as possible and using tools that make reading your data easier/scalable. Think of evaluators as a sampling function to reduce the number of traces humans need to manually review. Another point worth noting is deterministic metrics (eg: keyword assertions) often cover ~60-80% of failure modes in the real world and should be used liberally.


Where can I learn more detail about the metrics you support and how they work?

I tried multiple other solutions but kept running into the problem that occasionally the framework would give me some score/evaluation of an LLM response that didn't make any sense, and there was minimal information about how it came up with the score. Often, I'd end up digging into the implementation of the framework to find the underlying evaluation prompt or classifier only to realize that the metric name is confusing or results are low confidence. I'm more cautious about using these tools now and look more deeply at how they work so that I can assess grading quality before relying on them to identify problematic outputs (e.g. hallucinations).


I think the issue is that many of these metrics (e.g. RAGAS) are LLM as a judge metrics. These are very far from reliable. Making them reliable is still a research problem. I've seen a couple of startups training their own LLM judge models to solve this problem. There are also some work to attempt to improve the reliability through sampling such as G-eval (https://github.com/nlpyang/geval).

One need to think of these metrics as a way to filter all the data to find potential issues, and not as a final evaluation criteria. The golden criteria should be human evaluators.


Are there any approaches today that you've found are at least mostly reliable? Bonus points if it is somewhat clear/easy/predictable to know when it isn't or won't be.

We use human evaluation but that is naturally far from scalable, which has especially been a problem when working on more complicated workflows/chains where changes can have a cascading effect. I've been encouraging a lot of dev experimentation on my team but would like to get a more consistent eval approach so we can evaluate and discuss changes with more grounded results. If all of these metrics are low confidence, they become counterproductive since people easily fall into the trap of optimizing the metric.


I tend to find classic NLP metric more predictable and stable than "LLM as a judge" metrics so I'd try to see if you rely on them more.

We've written a couple of blog posts about some of them: https://www.traceloop.com/blog


for your blog can i offer a big downvote for the massive ai generated cover image thing? its a trend for normies but for developers its absolutely meaningless. give us info density pls


roger that! I like them though (am I a normie then?)


We trained our own models for some of them, and we combined some well known NLP metrics (like Gruen [1]) to make this work.

You're right that it's hard to figure out how to "trust" these metrics. But you shouldn't look at them as a way to get an objective number about your app's performance. They're more of a way to detect deltas - regressions or changes in performance. When you get more alerts, or more negative results (or less alerts / less negative results) - you can tell you're improving. And this works for tools like RAGAS as well as our own metrics in my view.

[1] https://www.traceloop.com/blog/gruens-outstanding-performanc...


This is poorly worded. Detecting "hallucinations" as the term is commonly used, as in a model making up answers not actually in its source text, or answers that are generally untrue, is fundamentally impossible. Verifying the truth of a statement requires empirical investigation. It isn't a feature of language itself. This is just the basic analytic/synthetic distinction identified by Kant centuries ago. It's why we have science in the first place and don't generate new knowledge by reading and learning to make convincing sounding arguments.

Your far more scaled-down claim, however, that you can detect answers that don't address a prompt at all, or make claims when summarizing known other text that isn't actually in the original text, is definitely doable, but raises a maybe naive or stupid question. If you can do this, why not sell an LLM that simply doesn't do these stupid things in the first place? Or why do the people currently selling LLMs not just automatically detect obvious errors and not make them? Doesn't your business as constituted depend upon LLM vendors never figuring out how to do this themselves?


And whats the false positive rate? Its good and dandy that you find most answers that are hallucinations but do you flag a significant % of answers that are not really hallucinations too? For instance, if a summarization doesnt use any sentences or even words from the original text, that doesnt necessarily mean its a hallucination. It could simply be a full paraphrased summary


Could you not detect likely hallucinations by running the same prompt multiple times between different models and looking at the vector divergence between the outputs? Kind of like an agreement between say GPT, Llama, other models which all agree - yes, this is likely a hallucination.

It's not 100% but enough to basically say to the human: "hey, look at this".


You can do it and it's a good way of doing that - from our experiments that can catch most errors. You don't even need to use different models - even using the same model (I don't mean asking "are you sure?" - just re-running the same workflow) will give you nice results. The only problem is that it's super expensive to run it on all your traces so I wouldn't recommend that as a monitoring tool.


I dont know if they are just casually using logos on their homepage but why the heck would Google (and even IBM) be using a product like this? Like… your entire future depends on getting this right and youre using a startup with 2-5 people to do this for you?!!

Make it make sense..


If that’s true, they should want to evaluate all the options out there to ensure they’re not missing out.

Though I think it’s more likely there’s some Googler who happen to use this service, note how the wording is “Engineers […] use our products[…]” rather than “Companies”.


Someone with a gmail address most likely.


Congratulations on launch.

This is a crowded market, and there are many tools doing the same thing.

How are you differentiating yourself from other tools like:

Langfuse Portkey Keywords ai Promptfoo


Thanks!

We differentiate in 2 ways:

1. We focus on real-time monitoring. This is where we see the biggest pain with our customers, so we spent a lot of time researching and building the right metrics that can run at scale, fast and at low cost (and you can try them all in our platform).

2. OpenTelemetry - we think this is the best way to observe LLM app. It gives you a better understanding of how other parts of the system are interacting with your LLM. Say you're calling a vector DB, or making an HTTP call - you get them all on the same trace. It's also better for the customers - they're not vendor locked to us and can easily switch to another platform (or even use them in parallel).


not to mention langsmith? braintrust? humanloop? does that count? not sure what else - lets crowdsource a list here so that people can find them in future


Im not sure which ones are Otel compliant. Im only aware of 3 that are Otel compliant:

1. Traceloop Otel 2. Langtrace.ai Otel 3. OpenLIT Otel 4. Portkey 5. Langfuse 6. Arize LLM 7. Phoniex SDK 8. Truera LLM 9. Truelens 10. Context 11. Braintrust 12. Parea 13. Context AI 14. openlayer.com 15. Deepchecks 16. langsmith 17. Confident AI 18. Helicone 19. Langwatch.ai 20. Arthur 21. Aporia 22. scale.com 23. Whylabs 24. gentrace.ai 25. humanloop.com 26. fixpoint.co 27. W n B Traces 28. Langtail 29. Fiddler 30. Evidently Ai 31. Superwise 32. Exxa 33. Honeyhive 34. Flowstack 35. Log10 36. Giskard 37. Raga AI 38. AgentOps 39. Patronus AI 40. Mona 41. Bricks Ai 42. Sentify 43. LogSpend 44. Nebuly 45. Autoblocks 46. Radar / Langcheck 47. Dokulabs 48. Missing studio 49. Lunary.ai 50. Censius.ai 51. ML flow 52. Galileo 53. trubrics 54. Prompt Layer 55. Athina 56. getnomos.com 57. c3.ai 58. baselime.io 59. Honeycomb llm


This is a great list, I'm planning on writing some sample apps and blogs about OpenTelemetry for LLM's and this will be helpful. Which are the most popular open source ones amongst these?


dear god, where is this list from? surely not hand curated?


60. Radiant.AI 61. Weights & Biases (Weave) 62. Quotient AI (some observability there)


* 6. Arize LLM Otel (OpenInference)


thanks for the list, super handy!


I have it internally, I can share it if you want!

But to the point of comparison between these and tools like Traceloop - it's interesting to see this space and how each platform takes it's own path and finds its own use cases.

LangSmith works well within the LangChain ecosystem together with LangGraph, LangServe. But if you're using LlamaIndex, or even just vanilla OpenAI you'll be spending hours to set up your observability systems.

Braintrust and Humanloop (and to some extend other tools I saw in this area) take the path of "full development platform for LLMs".

We try to look at it as developers look at tools like Sentry. Continue working in your own IDE with your own tools (wanna manage your prompts in a DB or in git? Wanna use LLMs your own way with no frameworks? no problem). We install in your app, with one line and we work around your existing code base and make monitoring, evaluation and tracing work.


I'd love to see that list!


Ping me over slack (traceloop.com/slack) or email nir at traceloop dot com


I started an open list (on github) of awesome open source repos for AI Engineers. It covers repos that help with building RAG apps, Agents, Dataset preparation, Fine tuning, Evaluation, Observability etc. Good to crowdsource these repos and products. https://github.com/sydverma123/awesome-ai-repositories


As users of otel, we are looking at reusing otel for our LLM stack, and as it is easy to instrument, don't need a new framework for that part.

However, the more interesting part is the storage: Imagine ingesting 100pg PDFs or 1M tweets, and doing many/big LLM map/reduce with big (128K+) context. In observability land, we generally have small payloads, sample data, and retire data... and backends + pricing assumes that. In LLMs, we instead might want some hot, rest in the DWH, and store everything.

How have folks been dealing with these kind of mismatches? Eg, Clickhouse backends for otel? Something else? Small stuff in otel and big stuff manually in a doc store / s3 json / parquet?


You're right. We faced those same issues. So we plan to move those prompts and completions to be sent as log events with some reference to the trace/span and not actually on the span.

The span can then only contain the most important data like the prompt template, model that was used, token usage, etc. You can then split the metadata (spans and traces) and the large payloads (prompts + completions) to different data stores.


At Portkey, this is a problem we deal with quite a bit. Also the reason that Datadog and the traditional observability vendors did not work for LLM use cases since they're not built to handle large volumes of data.

We've done this through a careful combination of Clickhouse + MinIO for fast retrieval of log items + selected retrieval from the MinIO buckets.

Cost becomes a very big factor when managing, filtering and searching through TBs of data even for fairly small use cases.

One thing we lost in the process is full-text search over the request & response pairs and while we try to intelligently add metadata to requests to make searching easier, it isn't the complete experience yet. Still WIP as a problem statement to solve and maybe the last straw here. Any suggestions?


Clickhouse has text + vector indexes, so that may be native, though we have never used them and I find vector indexes tricky to scale w other DBs. Text... Or neither... may be enough in practice tho as we mostly only care about searching on metadata dimensions like task.

We are thinking about sampled hot data for ops staff in otel DB+UIs, and long-term full data in S3/Clickhouse for custom tooling. It'd be cool if we could send Clickhouse historical otel sessions to grafana etc on demand, but likely a bridge too far...


I think you can (pretty) easily set this up with an otel collector and something that replays data from S3 - there's a native implementation that converts otel to clickhouse


Our scenario would be more like using Clickhouse / a dwh for session cohort/workflow filtering and then populating otel tools for viz goodies. Interestingly, to your point, the otel python exporter libs are pretty simple, so SQL results -> otel spans -> Grafana temp storage should be simple!


> Know when your LLM app is hallucinating or malfunctioning

It astonishes me that you are willing to make so many deceptive claims on your website like this.

You have no ability to detect with any certainty hallucinations. No one in the industry does.


I think it depends on the use case and how you define hallucinations. We've seen our metrics perform well (=correlates with human feedback) for use cases like summarization, RAG question-answering pipeline, and entity extraction.

At the end of the day things like "answer relevancy" are pretty dichotomic in a sense that for a human evaluator it will be pretty clear whether an answer is answering a question or not.

I wonder if you can elaborate on why you claim that there's no ability to detect with any certainty hallucinations.


clearly LLM app has added such logic to their app:

``` if (query.IsHallucinated()) { notifyHumanOfHallucination(); } ```

this one line will get them that unicorn eval


I think that LLMs are hallucinating by design. I'm not sure we'll ever get to a 0% hallucinations and we should be ok with it (at least for the next coming years?). So getting an alert on hallucination becomes less interesting. What is more interesting perhaps is knowing the rate that this happens. And keeping track on whether this rate increases or decreases with time or with changes to models.


Big congrats on the official launch!

Slightly tooting my own horn here, but at OpenPipe we've got a collaboration set up with Traceloop. That means you can record your production traces in Traceloop then export them to OpenPipe where you can filter/enrich them and use them to fine-tune a super strong model. :)


congrats on launch!

the thing about OTel is that it is by nature vendor agnostic. so if i use OpenLLMetry, i should be able to pipe my otel traces to whatever existing o11y tool I use right? what is the benefit of a dedicated monitoring platform?

(not cynical, just inviting you to explain more)


Great question and I see you already got a similar answer but I'll add some of my thoughts on this. We are actively promoting OpenLLMetry as a vendor agnostic way of observing LLMs (see some examples [1], [2]). We believe that people may start with whatever vendor they work with today and may gradually shift or use something like Traceloop because of specific features we have - for example the ability to take the raw data that we output with OpenLLMetry and add another layer of "smart metrics" (like qa relevancy, faithfulness, etc.) that we calculate on our backend / pipelines; or better tooling around observability of LLM calls, agents, etc.

[1] https://docs.newrelic.com/docs/opentelemetry/get-started/tra...

[2] https://docs.dynatrace.com/docs/observe-and-explore/dynatrac...


Not OP here (but building in the same space). The reason you instrument LLM data is usually to improve quality/speed of your applications. The tools to extract the insights to enable that, and the integration with your LLM experimentation workflow is the differentiator between a general observability solution and LLM specific one.


oh cool. do you also consume OTel? or something else?


Right now we have our own instrumentation but we're working towards Otel compatibility.


Thank you for spending your time on something that is a barrier to AI adoption.

Can you talk about your detection rates? False positives and false negatives. Perhaps you are still figuring this out

I’m not sure why so many folks are being so derisive on this post.


Thanks! It can vary greatly between use cases - but we've seen extremely high detection rates for tagged texts (>95%). When switching to production, this gets trickier since you don't know what you don't know (so it's hard to tell how many "bad examples" we're missing). Our false positive rate (number of examples that were tagged as bad but weren't) has been around 2-3% out of the overall examples tagged as bad (positive) and we always work on decreasing this.


Check out these Wikipedia articles:

Confabulation https://en.m.wikipedia.org/wiki/Confabulation

Hallucination https://en.m.wikipedia.org/wiki/Hallucination

What drove the AI industry to blow off accepted naming from psychopathology and use the word for PERCEPTUAL errors to refer to LANGUAGE OUTPUT errors?

When AI hallucinates, and AI people already use the preferred term “hallucination” to label confabulations, then what’s the new word for “hallucinations?”

How will we avoid serious errors in understanding if hallucination in AI means confabulation in humans and $NEW_TERM in AI means hallucination in humans?

Just seems harmful to gloss over this humongous vocabulary error.

How can we claim to respect the difficulty of naming things if we all select the wrong answer to a basic undergrad psychology multiple choice question with only two options?

It feels like painting ourselves into a corner which will inevitably make computer scientists look dumb. Who here wants to look dumb for no reason?

I don’t want to be negative, but is using the blatantly wrong word for confabulation a good idea in the long term?


if i may theorize: one of these two terms is generally recognised by the broader english speaking community


Acknowledging that AI is unreliable, the solution is to layer another AI to hopefully let you know about it. Of course, brilliant, why did I expect anything different from the AI industry.


?? but who is monitoring the AI layer monitoring the AI who produced the original output ??

openai audited by claudeai which is then audited by gemini ai...

then to close the loop, gemini ai is then audited by openai


I had read the OP's comment as sarcastic, but you never know these days lol

Your concern would be exactly mine as well, and why I assumed "brilliant" was sarcasm, cause it feels like handing over the problem to the same solution that got you the problem in the first place?


It has the same logic of saying you dont want to use a computer to monitor or test your code since it will mean that a computer will monitor a computer. AI is a broad term, I agree you can use GPT (or any LLM) to grade an LLM in an accurate way but that’s not the only way you can monitor.


> computer to monitor or test your code since it will mean that a computer will monitor a computer

I mean... you don't trust the computer in that case, you trust the person who wrote the test code. Computers do what they're told to do, so there's no trust required of the computer itself. If you swap out the person (that you're trusting) writing that code with an AI writing that test code, then it's closer to your analogy - and in that case, I (and the guy above me, it seems) wouldn't trust for anything impactful.

Even if you're not using an LLM specifically (which no one in this chain even said you were), an AI built off some training set to eliminate hallucinations is still just an AI. So you're still using an AI to keep an AI in check, which begs the question (posed above) of: what keeps your AI in check?

Poking fun at a chain of AI's all keeping each other in check isn't really a dig at you or your company. It's more of a comment on the current industry moment.

Best of luck to you in your endeavor anyway, by the way!


Thanks! I wasn’t offended or anything, don’t get the wrong impression.

What strikes me odd is the fact that an AI that checks AI is an issue. Because AI can mean a lot of things - from a encoder architecture, a neural network, or a simple regression function. And at the end of the day, similar to what you said - there was a human building and fine tuning that AI.

Anyway, this feels more of a philosophical question than an engineering one.


(it was sarcastic. Too late to edit in a /s)


people are lazy, we're more than happy to not be in the loop


I'm sorry but this is not what we do. We don't use LLMs to grade your LLM calls.


Congrats on the official launch, Nir and Gal! Deeply appreciate your contributions to OTel as well.


"accross" is misspelled on your front page


"Support Respons" is also misspelled in your product screenshot


Thanks for spotting those! We'll fix it asap


there's a well known artist named traceloops who has a prolific/longstanding body of work. why did you choose this name?


I know! When we started every time I was googling "traceloop" this was the first result.

2 reasons why we chose it (in this order):

1. traceloop.com was available

2. we work with traces


an available .com is basically the only reason you should use https://paulgraham.com/name.html


I doubt anyone would be confused with Traceloops the artist vs Traceloop the LLM Observability Platform


[flagged]


I think that's the key benefit of using OpenTelemetry - it's pretty efficient and the performance footprint is negligible.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: