Hacker News new | past | comments | ask | show | jobs | submit login
OpenTelemetry (github.com/open-telemetry)
195 points by privacyonsec 33 days ago | hide | past | favorite | 65 comments

I don't quite get it, what does this do?

The readme says absolutely nothing (e.g. "The OpenTelemetry specification describes the cross-language requirements and expectations for all OpenTelemetry implementations." and goes on to describe how to submit changes or on which proprietary platform meeting minutes can be found) and the Overview document goes into depths about terminology (what a trace is, what a span within a trace is, how to link spans, etc.).

Nowhere does it say if this is supposed to, for example, replace proprietary crash reporting in apps so that we can know what is being reported back to the mothership, or if this is something completely different.

You can check the first link that states the overview - https://github.com/open-telemetry/opentelemetry-specificatio...

It's a distributed request tracing specification & set of libraries implementing the said spec

we also explain it a bit on our website, https://opentelemetry.io

The fact that this is the top comment and most people are confused about what Open Telemetry is about should indicate that it is not clear.

This is too often a mistake - not knowingly. It's like proof reading, I always think my writing is correct until someone else points out mistakes.

Thanks for the feedback, I just opened a PR to add a link from the spec readme to our website to hopefully guide people to better information.

With so many people finding things via GitHub, I'd highly recommend also including a single paragraph at the top of the repo readme to explain what it is, rather than expect people to be curious enough to go digging for more info. E.g. you could use the explanation from the website: "OpenTelemetry provides a single set of APIs, libraries, agents, and collector services to capture distributed traces and metrics from your application. You can analyze them using Prometheus, Jaeger, and other observability tools."

Yeah, particularly for a developer oriented product. GitHub readme is essentially a landing page and should be treated the same way, not assuming any foreknowledge for the reader.

I want to clarify about my comment and what I meant - when you're working on your own product for many months and years, it is hard to know what others don't know about the product. Similar to proof reading, your eyes just "gloss" over :) I didn't mean to patronize or say you made a mistake.

Before I stopped reading, there was an example of a button being pushed... but there was no actual trace of a button being pushed... just lots of words and descriptions, which killed it for me.

I just want to see what I'm getting into, not a vocabulary or jargon lesson, at least at first.

That would lead us to then wonder why it was done that way, etc... and drive interest.

why not just put a couple paragraphs on top of the readme to explain what it is, what does it do, and add this link, instead of posting in a forum where nobody will see it.

A specific question I had was tracing seems to be all about distributed context (from the way everyone talks/market it) where each distributed call is captured in a span.

But I didn't see (perhaps I missed it) about tracing inside a single instance where I want the individual method call to be traced to know which method call was expensive within that instance.

You can check out Mastering Distributed Tracing by Yuri Shkuro (the author of Jaeger). Lesson 2 talks about tracing monoliths - https://github.com/yurishkuro/opentracing-tutorial/tree/mast...

Spans are intervals with a start and stop and a set of tags for querying.

These can be local or cross-service.

I guess the documentation focusses on the "distributed" part of tracing but it works perfectly well for local spans too.

Just a quick summary.

Trace - One unique id for an "action", say customer purchasing an item. A set of spans exist under a trace.

Span - An interval with a start time and duration. Can add tags for querying. Can add logs for richer information but not indexed. Each span belongs to exactly one trace.

Baggage - key value pairs you can pass between services. They aren't tagged as part of the span or trace, but the receiving service can use these values to make decisions or add them to its local spans as logs or spans.

Under the hood, it's just a set of HTTP headers passed between services. The first span creates a trace id and passes that to all downstream services as a header.

Each service independently pushes it's span data into a centralised collector which indexes the data and exposes it over an UI/API.

OpenTelemetry provides a specification and some reference implementation across languages.

At least, this is my understanding of it. I've dabbled with Jaegar but I'm not affiliated with any of these.

To all the OpenTelemetry people reading this page - please include the above description somewhere in your project. I actually googled the definition of telemetry because I had no idea what you were talking about, I thought it had something to do with space observations given all the telescopes? My second guess was that it had something to do with the way data from accelerometers was handled? Standardising tooling for sensors?

This comment was the first thing to explain that it's better logging. That's actually something I'd be interested in, running several different datascience services/analyses, logging was something I implemented then gave up on because it sucks so much. You're losing a lot of potential players because you have given 0 effort to explaining in simple terms what you are. A "From logging to telemetry" primer would probably garner you a lot more interest.

it's probably not on the front page but they actually explain what is telemetry and observability here:


But I agree this is a better explanation :)

My first thought was "TV viewership metrics", but in this context adjective "open" seemed a little contradictory.

Thank you, these definitions are great. I muddled my way trough the example and had to infer the difference between, trace and span from context.

I still don't get how trances/spans are shared across rpc boundaries but I know that's a thing one can do.

As I understand it, you pass your current span ID to the remote RPC endpoint, it continues adding to it from there, creating their own nested spans etc.

When it gets back to you and all spans are merged, you can see the whole picture, but don't need to provide all the details all the way down.

Here be dragons. I really want to use this project, but the spec and libraries keep changing. I would not adopt this yet, too much is in flux.

As an aside, I think this project is suffering from analysis paralysis. Ship a stable 1.0, iterate towards a 2.0 eventually based on user feedback. You aren't going to release a perfect 1.0.

I agree. They were very quick to update their website to say "we're making a new thing", but the new thing isn't ready.

I used the metric implementation in Go (as someone with extensive Prometheus experience), and was shocked at how they managed to write so much code to do so little. You will be amazed at how many levels of indirection there are, with a little assumption hard-coded at every single point of indirection. (The net result being, in my case, that I couldn't actually create a gauge in Prometheus.)

I just use Jaeger and Prometheus together and they've never treated me wrong. I'm watching otel closely, but I wouldn't suggest that you use it quite yet, at least not for Go.

Disclaimer: I work on OpenTelemetry specification and Collector.

I don't contribute to Go implementation but I encourage you to submit the feedback to https://github.com/open-telemetry/opentelemetry-go or https://gitter.im/open-telemetry/opentelemetry-go

You can also attend Go SIG meetings and provide the feedback directly. SIG meetings are open for attendance: https://github.com/open-telemetry/community#special-interest...

I am sure the Maintainers of Go implementation will appreciate your feedback.

Which Jaeger go library are you using? https://github.com/jaegertracing/jaeger-client-go?

For prometheus, I was using the official Go client, but I found the victoria metrics one to be a lot simpler and lighter. My go.mod was a lot smaller after swapping.

I use the official ones for both, but haven't even considered looking for others.

Jaeger is a little annoying in that what configuration the library accepts depends on the language. For example, you have to modify your code to emit B3 (zipkin-compatible) traces, which is annoying. Leads to PRs like this: https://github.com/grafana/grafana/pull/17009

Disclaimer: I work on OpenTelemetry specification and Collector.

It is true that the spec and libraries have been changing. This is to be expected since we haven't yet made a stable release. We are aiming for 1.0 GA release this year. The current plan is to freeze trace spec this week, followed by metric spec freeze in 3 weeks, followed by the release of several language implementations in 4 weeks after the spec freeze.

Like everyone else I wish we could release earlier, but the reality is that making dozens of companies agree on a standard takes time. I am very excited that we are approaching the finish line.

But it's been around for 5 years. First as opentracing (actually thanks Ben and LightStep!), then google decided to compete with opencensus (no thanks google), and now we have opentelemetry.

It's really, really, really annoying. After 5 years you might expect things to have calmed down. It was great at the start - like, actually. Even when baggage was rusty it had a direction and was usable. It probably still is, but the churn is crazy.

You say making dozens of companies agree is difficult, and yeah - it is - but thats because of other companies throwing out specs trying to compete with Lighttstep.

This is not in competition to lightstep or any one vendor in particular. It is to converge all open source & proprietary solutions to follow a common spec and allow them to build on top of a common base.

OpenTracing dealt only with Tracing, it didn't care about logs or metrics. OpenCensus did all 3. This is combining the best of OpenTracing (which is Tracing) with idea that OpenCensus brought - just one library to collect and ship all the observability data that your application is supposed to emit.

And like always when vendors / standards are built: It takes a while. Not surprised here.

On the fundamentals you agree with me: the goalposts shifted from telemtry to logging.

IMO they're different things. One deals with call graphs, timings, and associated baggage. The other is logs. Should we hold off on the spec for another 5 years and add datadog/prometheus metrics too?

I dont think that one spec should be all encompassing. Opentracing is great for tracing. Logs can be baggage. Memory use can be baggage. The temperature of the CPU and the current disk space is another platform altogether.

It's like saying everything in the world should run webassembly, and no other ISA matters. It's an idea! But, certainly a distraction from the benefits. Why not let tracing deal with tracing before messing with the 1.0?

> the goalposts shifted from telemtry to logging

Telemetry has always included logging as one of it's components.

> Should we hold off on the spec for another 5 years and add datadog/prometheus metrics too?

I think you should look at OTel once again. OTel already includes metrics and supports exporting to Prometheus.

And I don't want to reply to the rest of your comments because I don't feel you understand this problem space enough before making such remarks on it.

Fair. I worked with Ben to integrate opentracing into docker before Kube won, and at that point it was tracing only. Tracing as in spans, annotations, and baggage spanning services. So, it has changed from call graphs/timing etc to telemetry in the broad sense.

We're talking about churn here, not my knowledge, and I think considering how you say I don't understand it we can both agree it has churned into something else. As OP said, there is a lot of churn. Too much for your common system to keep up with in the (pun intended) span of a couple years. 3 rebrands in fewer years isn't promising.

Unfortunately this is what happens when you marketing promises instead of libraries. If CNCF wasn't that eager to market OTel over Opencensus this spec would have evolved naturally but you know, this is business work.

I don't use any of their libraries, but follow the spec closely.

At the very least, it's forcing some standardization among APM vendors (DataDog 64b trace_id, I'm looking at you).

Also, related: https://www.w3.org/TR/trace-context/

Even 32 bytes is completely over the top for trace ID. Even the 16 bytes in this spec is pretty much just wasted space.

I think Datadog's trace_id is 64 bits which I thought was a good balance between efficiency and uniqueness. Pretty sure Dapper used 64 bits as well.

I also don't understand the purpose of the 16 bytes for the trace_id in the spec. That's a huge range of numbers. Anyone know the rationale?

Disclaimer: I work on OpenTelemetry spec.

Many tracing solutions settled on 128bits/16 bytes trace ids. Here is Jaeger's rationale: https://github.com/jaegertracing/jaeger/issues/858

It is also recommended by W3C: https://www.w3.org/TR/trace-context/#trace-id

BigBrotherBird (now OpenZipkin... thanks legal, sigh) used 128b trace_ids when we first built it at Twitter. I don’t recall the reasoning, but that’s the first system I know of which chose that size.

Dapper used 64b IDs for span and trace, but being locked inside the Googleplex probably limited its influence on compatibility issues.

My point is that 128b is the common standard now, and that’s all that I really care about - that the standard exists and APM systems conform to it. To that end, I am very pro-otel.

Thanks for your work.

Neither Jaeger nor W3C seem to present any justification for 16 byte trace identifiers, just FUD.

So how does this differ from open census, openmetrics, opentracing? It would be great if one open-something combined metrics and other observability tools in one spec/toolset like application insights does. Otherwise, developers waste a lot of time evaluating various open-whatevers, and having to use more than one.

I think OpenCensus and OpenTracing are being combined into OpenTelemetry. I'm not sure about OpenMetrics.

That's correct. And OpenMetrics is a standardization effort on metrics (basically, standardize and improve Prometheus format).

It would be great if they could all come under common OpenMonitoring umbrella, as ApplicationInsights does it.

Not sure what OpenMonitoring is but OpenTelemetry = Tracing + Metrics + Logging. So we don't really need any more Open-X standards in the observability space IMO

You are right, acc. https://opentelemetry.io/about/

Logging support is incubating.

Yeah, we're trying to solve for this over at https://vector.dev. We intend to decouple the pipeline from the method. As demonstrated by OT, this stuff changes, and you shouldn't have to rip out your plumbing every time it does. We're aiming to support all data (logs, metrics, and traces) as well as popular standards.

How does it differ from ApplicationInsights in Azure?

disclaimer: otel maintainer, etc.

one thing I want to point out is that, eventually, we'd like for a lot of the complaints people have to be... well, things that you don't have to complain about, because it's not important. three or four years from now, it'd be nice to see a world where most people don't actually have to interact with otel at all because it's either built-in to the libraries/frameworks they're using, or because they're using some kind of wrapper that helps with best practices.

you can see an example of this with something we're doing at lightstep - we're introducing launchers that wrap upstream otel and standardize config format/defaults between multiple languages (https://github.com/lightstep/?q=launcher), and trying to provide "missing link" documentation (https://otel.lightstep.com).

i suspect that eventually the question of "how do i use opentelemetry" becomes a moot point because it's already there.

The java implementation seems to lean on trying to pass context objects around implicitly as a ThreadLocal.

This will cause pain and suffering in the presence of async, multithreaded code: your trace context will be present in the thread where your request handler began, but won't be present in threads running callbacks from non-blocking IO libraries (netty, akka-http, async-http-client, redisson, etc).

Async style is not incompatible with storing the trace context in thread-local storage. You just capture the trace context when you create the callback. That's how Dapper works, in C++ and Java. See section 2.2 of the Dapper paper.


From that section: "most Google developers use a common control flow library to construct callbacks"

OpenTelemetry can't guarantee that their users will use a specific library to create callbacks and can't then ensure that this library wraps callbacks with the appropriate threadlocal setup.

Users would by default run into problems unless they take a number of precautionary steps.

Java implementation maintainer here.

Our context propagation story is still evolving and under active development. We are hyper-aware of the issues with managing propagation with asynchronous contexts, and are working on building a solution that instrumentation authors can use to manage the propagation of context both synchronous and asychronous. If you have expertise in this area, we would love help and feedback on what we're building!

> building a solution that instrumentation authors can use to manage the propagation of context both synchronous and asychronous

The bigger problem I’ve had - with OpenCensus - is managing the context within my application using async code, where I want to add interior spans and also call libraries which are creating spans themselves.

Do your plans include these scenarios? Am I an “instrumentation author” here?

There is really no way to make anything related to ThreadLocals work with “async” code, and the simplest most reliable solution we have found is to treat the Context as a method parameter. Looking at https://github.com/open-telemetry/opentelemetry-java/issues/... I worry that more layers of indirection will be added, not less.

When we integrated async jvm (scala) services with another tracing provider, we took two approaches. One was to pass the trace context down from the request handler through anything that would declare a span or need to send trace headers down to another service. The other was to instantiate service clients per request, and pass the trace context into the service client constructor.

For Java, what's the advantage of using this library, versus directly using JMX? Are the abstractions better? I took a look at the documentation and it wasn't clear to me.

Maintainer of the java implementation here.

OpenTelemetry is really orthogonal to JMX. We're trying to provide a standard way to capture spans (traces) and metrics and send them to observability systems, both open source and vendor-provided. Those metrics might originate from JMX, or any other source of metrics you might have. Our APIs do provide a way to directly capture metrics, but also to hook into existing metric providers (like JMX or JFR).

I've been using the opentelemetry package for C# to push tracing data to Honeycomb and it's really good. It takes a bit of learning but it's extremely powerful. Once you're over the hump it's very easy to quickly add new telemetry and its night and day difference to typical logs/metrics.

The reference Java implementation uses too much hard-coded static variables. Look sexy for demonstration, but is it configuration nightmare.

I had a lot more success and ease of use using open zipkin, highly recommend.

I'm not affiliated with otel, but I'd say it is just otel is still moving forward quickly and there's no good (almost not) documentation on how to use it.

I did some small setup with otel, Java and Quarkus, and I will publish a blog post soon on how I did it.

The lack of documentation is a problem, and I’m not sure that moving fast is a great excuse.

OpenCensus has the same problem - some pages have said “coming soon” for years https://opencensus.io/advanced-concepts/context-propagation/

One of the maintainers of the java implementation here.

If you have issues with the configuration story for the java SDK, please let us know! We're trying hard to make it very configurable and extensible, and user feedback would be very valuable to have.


The tracing package is pretty solid. The metrics, is still changing.

I 100% support this I think theres great work behind it. Eager to see how they tackle logs.

Here is the draft plan for logs: https://github.com/open-telemetry/opentelemetry-specificatio...

Logs are not going to be part of OpenTelemetry 1.0 release (only traces and metrics will). Logs are coming later (no specific timeline yet).

Disclaimer: I work on OpenTelemetry spec and wrote most of the linked doc. Comments/issues/PRs welcome in the repo.

I'm really looking forward to using this project when it's stable. I briefly experimented with their Python SDK to try and record metrics but their documentation seemed either out of date or non-existent for some features. I'll wait for them to reach 1.0 in the hope that the documentation gets better and existing issues are resolved.

disclaimer: i'm an otel maintainer (specifically on community/web stuff!)

to echo some points from comments, I'm painfully aware of the state of the official documentation. I'm actually in the process of revamping the site to provide a framework for docs in each language to try and alleviate this. I expect it to be ready by 1.0.

First thing that came to my mind when I saw the project title was the opentracing.io project. Glad to see that this is actually opentracing and opencensus merging their approaches.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact