The readme says absolutely nothing (e.g. "The OpenTelemetry specification describes the cross-language requirements and expectations for all OpenTelemetry implementations." and goes on to describe how to submit changes or on which proprietary platform meeting minutes can be found) and the Overview document goes into depths about terminology (what a trace is, what a span within a trace is, how to link spans, etc.).
Nowhere does it say if this is supposed to, for example, replace proprietary crash reporting in apps so that we can know what is being reported back to the mothership, or if this is something completely different.
It's a distributed request tracing specification & set of libraries implementing the said spec
This is too often a mistake - not knowingly. It's like proof reading, I always think my writing is correct until someone else points out mistakes.
I just want to see what I'm getting into, not a vocabulary or jargon lesson, at least at first.
That would lead us to then wonder why it was done that way, etc... and drive interest.
But I didn't see (perhaps I missed it) about tracing inside a single instance where I want the individual method call to be traced to know which method call was expensive within that instance.
These can be local or cross-service.
I guess the documentation focusses on the "distributed" part of tracing but it works perfectly well for local spans too.
Trace - One unique id for an "action", say customer purchasing an item. A set of spans exist under a trace.
Span - An interval with a start time and duration. Can add tags for querying. Can add logs for richer information but not indexed. Each span belongs to exactly one trace.
Baggage - key value pairs you can pass between services. They aren't tagged as part of the span or trace, but the receiving service can use these values to make decisions or add them to its local spans as logs or spans.
Under the hood, it's just a set of HTTP headers passed between services. The first span creates a trace id and passes that to all downstream services as a header.
Each service independently pushes it's span data into a centralised collector which indexes the data and exposes it over an UI/API.
OpenTelemetry provides a specification and some reference implementation across languages.
At least, this is my understanding of it. I've dabbled with Jaegar but I'm not affiliated with any of these.
This comment was the first thing to explain that it's better logging. That's actually something I'd be interested in, running several different datascience services/analyses, logging was something I implemented then gave up on because it sucks so much. You're losing a lot of potential players because you have given 0 effort to explaining in simple terms what you are. A "From logging to telemetry" primer would probably garner you a lot more interest.
But I agree this is a better explanation :)
I still don't get how trances/spans are shared across rpc boundaries but I know that's a thing one can do.
When it gets back to you and all spans are merged, you can see the whole picture, but don't need to provide all the details all the way down.
As an aside, I think this project is suffering from analysis paralysis. Ship a stable 1.0, iterate towards a 2.0 eventually based on user feedback. You aren't going to release a perfect 1.0.
I used the metric implementation in Go (as someone with extensive Prometheus experience), and was shocked at how they managed to write so much code to do so little. You will be amazed at how many levels of indirection there are, with a little assumption hard-coded at every single point of indirection. (The net result being, in my case, that I couldn't actually create a gauge in Prometheus.)
I just use Jaeger and Prometheus together and they've never treated me wrong. I'm watching otel closely, but I wouldn't suggest that you use it quite yet, at least not for Go.
I don't contribute to Go implementation but I encourage you to submit the feedback to https://github.com/open-telemetry/opentelemetry-go or https://gitter.im/open-telemetry/opentelemetry-go
You can also attend Go SIG meetings and provide the feedback directly. SIG meetings are open for attendance: https://github.com/open-telemetry/community#special-interest...
I am sure the Maintainers of Go implementation will appreciate your feedback.
For prometheus, I was using the official Go client, but I found the victoria metrics one to be a lot simpler and lighter. My go.mod was a lot smaller after swapping.
Jaeger is a little annoying in that what configuration the library accepts depends on the language. For example, you have to modify your code to emit B3 (zipkin-compatible) traces, which is annoying. Leads to PRs like this: https://github.com/grafana/grafana/pull/17009
It is true that the spec and libraries have been changing. This is to be expected since we haven't yet made a stable release. We are aiming for 1.0 GA release this year. The current plan is to freeze trace spec this week, followed by metric spec freeze in 3 weeks, followed by the release of several language implementations in 4 weeks after the spec freeze.
Like everyone else I wish we could release earlier, but the reality is that making dozens of companies agree on a standard takes time. I am very excited that we are approaching the finish line.
It's really, really, really annoying. After 5 years you might expect things to have calmed down. It was great at the start - like, actually. Even when baggage was rusty it had a direction and was usable. It probably still is, but the churn is crazy.
You say making dozens of companies agree is difficult, and yeah - it is - but thats because of other companies throwing out specs trying to compete with Lighttstep.
OpenTracing dealt only with Tracing, it didn't care about logs or metrics. OpenCensus did all 3. This is combining the best of OpenTracing (which is Tracing) with idea that OpenCensus brought - just one library to collect and ship all the observability data that your application is supposed to emit.
IMO they're different things. One deals with call graphs, timings, and associated baggage. The other is logs. Should we hold off on the spec for another 5 years and add datadog/prometheus metrics too?
I dont think that one spec should be all encompassing. Opentracing is great for tracing. Logs can be baggage. Memory use can be baggage. The temperature of the CPU and the current disk space is another platform altogether.
It's like saying everything in the world should run webassembly, and no other ISA matters. It's an idea! But, certainly a distraction from the benefits. Why not let tracing deal with tracing before messing with the 1.0?
Telemetry has always included logging as one of it's components.
> Should we hold off on the spec for another 5 years and add datadog/prometheus metrics too?
I think you should look at OTel once again. OTel already includes metrics and supports exporting to Prometheus.
And I don't want to reply to the rest of your comments because I don't feel you understand this problem space enough before making such remarks on it.
We're talking about churn here, not my knowledge, and I think considering how you say I don't understand it we can both agree it has churned into something else. As OP said, there is a lot of churn. Too much for your common system to keep up with in the (pun intended) span of a couple years. 3 rebrands in fewer years isn't promising.
At the very least, it's forcing some standardization among APM vendors (DataDog 64b trace_id, I'm looking at you).
Also, related: https://www.w3.org/TR/trace-context/
I also don't understand the purpose of the 16 bytes for the trace_id in the spec. That's a huge range of numbers. Anyone know the rationale?
Many tracing solutions settled on 128bits/16 bytes trace ids. Here is Jaeger's rationale: https://github.com/jaegertracing/jaeger/issues/858
It is also recommended by W3C: https://www.w3.org/TR/trace-context/#trace-id
Dapper used 64b IDs for span and trace, but being locked inside the Googleplex probably limited its influence on compatibility issues.
My point is that 128b is the common standard now, and that’s all that I really care about - that the standard exists and APM systems conform to it. To that end, I am very pro-otel.
Thanks for your work.
Logging support is incubating.
one thing I want to point out is that, eventually, we'd like for a lot of the complaints people have to be... well, things that you don't have to complain about, because it's not important. three or four years from now, it'd be nice to see a world where most people don't actually have to interact with otel at all because it's either built-in to the libraries/frameworks they're using, or because they're using some kind of wrapper that helps with best practices.
you can see an example of this with something we're doing at lightstep - we're introducing launchers that wrap upstream otel and standardize config format/defaults between multiple languages (https://github.com/lightstep/?q=launcher), and trying to provide "missing link" documentation (https://otel.lightstep.com).
i suspect that eventually the question of "how do i use opentelemetry" becomes a moot point because it's already there.
This will cause pain and suffering in the presence of async, multithreaded code: your trace context will be present in the thread where your request handler began, but won't be present in threads running callbacks from non-blocking IO libraries (netty, akka-http, async-http-client, redisson, etc).
OpenTelemetry can't guarantee that their users will use a specific library to create callbacks and can't then ensure that this library wraps callbacks with the appropriate threadlocal setup.
Users would by default run into problems unless they take a number of precautionary steps.
Our context propagation story is still evolving and under active development. We are hyper-aware of the issues with managing propagation with asynchronous contexts, and are working on building a solution that instrumentation authors can use to manage the propagation of context both synchronous and asychronous. If you have expertise in this area, we would love help and feedback on what we're building!
The bigger problem I’ve had - with OpenCensus - is managing the context within my application using async code, where I want to add interior spans and also call libraries which are creating spans themselves.
Do your plans include these scenarios? Am I an “instrumentation author” here?
There is really no way to make anything related to ThreadLocals work with “async” code, and the simplest most reliable solution we have found is to treat the Context as a method parameter. Looking at https://github.com/open-telemetry/opentelemetry-java/issues/... I worry that more layers of indirection will be added, not less.
OpenTelemetry is really orthogonal to JMX. We're trying to provide a standard way to capture spans (traces) and metrics and send them to observability systems, both open source and vendor-provided. Those metrics might originate from JMX, or any other source of metrics you might have. Our APIs do provide a way to directly capture metrics, but also to hook into existing metric providers (like JMX or JFR).
I had a lot more success and ease of use using open zipkin, highly recommend.
I did some small setup with otel, Java and Quarkus, and I will publish a blog post soon on how I did it.
OpenCensus has the same problem - some pages have said “coming soon” for years https://opencensus.io/advanced-concepts/context-propagation/
If you have issues with the configuration story for the java SDK, please let us know! We're trying hard to make it very configurable and extensible, and user feedback would be very valuable to have.
I 100% support this I think theres great work behind it. Eager to see how they tackle logs.
Logs are not going to be part of OpenTelemetry 1.0 release (only traces and metrics will). Logs are coming later (no specific timeline yet).
Disclaimer: I work on OpenTelemetry spec and wrote most of the linked doc. Comments/issues/PRs welcome in the repo.
to echo some points from comments, I'm painfully aware of the state of the official documentation. I'm actually in the process of revamping the site to provide a framework for docs in each language to try and alleviate this. I expect it to be ready by 1.0.