Hacker News new | past | comments | ask | show | jobs | submit login
OpenTelemetry (opentelemetry.io)
168 points by 9woc 41 days ago | hide | past | favorite | 69 comments

I love OpenTelemetry, but for Java, I wish adoption would slow down until the implementation catches up.

For example, if you visit the OpenCensus site, it tells you on every page that OpenTelemetry is the way to go now. Yet OpenTelemetry's Java metrics implementation is still listed as alpha, and functionality as basic as tags didn't work when I last tried it.

Or, I've been working with Datadog lately, and I want to add dynamic span metadata. My options appear to be either to wire up the (deprecated) OpenTracing client, or set up a full collector/agent suite of OpenTelemetry processes (whose default tutorial configurations appear to be invalid in places).

worth I guess, but bumpier than I'd like.

Yeah, OpenTelemetry was OpenCensus, and before that it was OpenTracing. The go client faces some of the same churn; metrics are in alpha and the API isn't documented, though spans and traces are on more solid ground than the Java client it seems.

It's been a lot of change and I can't wait for things to settle down and for companies to stop trying to get their piece. I feel like everyone (Datadog, NewRelic, Google, etc.) saw Lightstep as a competitor and wanted to get involved with a competing spec, which is how we ended up in this consensus mess.

Still, it's good progress and it's so much better than raw logs. Would recommend.

Lightstep was one of the originators of the combined spec -- but yes, the churn has been really painful, and we're sorry for the pre-1.0 pain. Now that the spec is 1.0 and most of the language SIGs have put out 1.0 API/SDK releases for tracing, hopefully that pain will reced.

Do you have an insight on the timeline for stabilizing the metrics and logging specs?

A good place to look at is the milestones on GitHub: https://github.com/open-telemetry/opentelemetry-specificatio...

Logging is still experimental in the spec. Metrics API is feature freeze and the protocol is stable, so it's more on language SDKs to stabilize their implementations. This is a focus for several of them right now.

Echoing this comment. I and my team are big fans of OpenTelemetry! It's made distributed tracing so plug and play.

Only complaint is the churn over the past year, especially in the metrics world. We use OTel Python and have had to pin our libraries to an older version so we can continue to record metrics stably until the python SDK catches up to the latest metrics specification.

I belive OpenTelemetry will make distributed tracing mainstream. Traces are the best tool to debug problems, yet as their are expensive (both in dev work and license) to implement they are way less popular than logs.

Historicaly APM (application performance monitoring) also owned collection, vendor locked-in and didn’t evolve price per value.

Open-source collection standard allows users to choose or switch to the best vendor.

Personally, I recommend Sumo Logic (disclaimer: I work there). We have rich full-text query capabilities on spans with parsing and aggreagations on query time…

I don't think it will. The pricing is still too high for it to become mainstream. I've worked at places that have those systems and it really does make debugging an incredible amount easier, but Sumo Logic is $93/GB for data. That says with a 30 day retention period, so I'm assuming that's the charge to store a 1 GB of data for a month.

It's just too hard of a sell for management. I don't even want to calculate how many times the cost of S3 storage that is. And then add on top of it that you can't gauge how much data you'll generate until you've actually started implementing it.

I don't think distributed tracing will become mainstream until there's something like ELK for it that lets you start out in-house. Then people start to see the value, and you have a solid case when maintaining it becomes a full-time job.

I haven't seen anything that seems production-ready enough to use at work, but also doesn't involve running an entire Cassandra cluster or something. Is there an implementation that can run using Postgres? I have no doubt that it wouldn't scale well, but I need a foot in the door to prove the value before I can justify paying $93/GB to my director.

I'm curious, who's your target customer? Should small teams be looking early on or is this more of a mid-sized company thing?

In my experience Sumo is a mid to large solution. A startup can get by with their own stack to start. Then they might use something managed as they get big, and once they’re big enough to have dedicated teams, they can run something themselves again. Minus storage and data transfer (the bulk of expense) the hardware to run your own stack probably runs $20 a day or less on a cloud provider, and with a small team where communication about standards is easier it takes less engineering time. My professional opinion is that they should then switch to a managed solution like Sumo when the number of services they are supporting grows. Everyone, myself included, underestimates how many full time people you’re going to need to dedicate to running a serious OpenTel/Elk/Prometheus/Jaeger stack yourself. The pain of migration and upgrading, managing indexes and proper whitelists, even just working with teams to get onboarded and get their dashboards/alerts setup right, using spot instances to desperately claw back every dollar…at my current workplace our custom stack was supposed to be cheaper and ended up costing magnitudes more (under control now) during this learning curve. When I start at a new company I’m probably going to suggest we use Sumo, and this is coming from someone who used to say “we don’t buy solutions we build solutions”.

Lately I am leaning into managed more and more - I want to spend my time building great dashboards, or analyzing my data, or writing some code for something else entirely. If you run your own stack you are always keeping up not doing the really useful stuff. Getting a working baseline is hard enough, let alone shiny extras. You’ll watch the talks at whatever conference, you’ll read the single line “quick start docs”, it will sound amazing. Then you’ll find yourself deep in months later with a giant data transfer or storage bill and hours of painstaking tuning ahead of you.

If you’re really large, then you can go back to running it yourself.

Though Sumo Logic started by targeting mid-size to enterprise, recently it invests more in self-serve:

1. Free plan includes Tracing.

2. There is monthly self-serve subscription available.

At almost $100 per GB of logs ingested, I think it’s targeted at SV startups.

It is ~$93 / month for 1 "GB per day" not per GB.

Thanks! It was not clear from https://www.sumologic.com/pricing/

Hello, Hacker News! https://share.getcloudapp.com/4guP9P6p (stats gathered from opentelemetry-js)

for extra fairness and showcasing interoperability, here's the same data rendered in more than one backend! https://twitter.com/lizthegrey/status/1453525797243797510

I hate that this specification and most of the other ones use spans that have a beginning and end rather than events that start and end the span. What if it crashes before it sends out the span? What if it is taking a very long time to complete?

The span is simply the way the data is modeled. The way the tracing works is it calls a component called a Span Processor on both span start and end. It would be possible to implement a Span Processor that sends span_start and span_end events to your backend without waiting for the span to start.

In terms of crashing and taking a long time to complete, these are problems that are very difficult to solve on the backend as well. Simply having the start event without the end event is not enough info to say for sure that there will _never_ be an end event. For low data volumes and simple use-cases this may not seem like a big deal, but it gets complex extremely quickly.

Data query wise it's significantly more complicated to compile 2 db row spans than it is to do queries on one db row spans. The library that makes and manages those spans before uploading can deal with those use cases internally if it wants, a lot of these systems have support for single point events too.

I think you are right and data model should have been event based rather than span based. https://medium.com/opentracing/open-for-event-based-tracing-... digs into that topic and basically says that distributed tracing is about causality & partial ordering which bring us back to the 70's with Lamport's logical clocks and all following work.

However, this design "flaw" is well known and seems assumed. I'm not able to find relevant GitHub issues right now, but I remember this topic being discussed on OpenTracing or OpenTelemetry bug tracker and the outcome was something like "Spans might not be the best data model, but people are now used to it and we have to ship the spec within a reasonable time, so let's stick to it".

Edit: https://github.com/open-telemetry/opentelemetry-specificatio... might be relevant to your concerns.

For the long running scenario, why not use Links? The specification describes[1] a similar scenario and how links can be leveraged to causally relate spans even across trace boundaries.

As for the crashing scenario, this seems like an application-level concern. Ideally it is not waiting until the crash to send traces. Depending on the environment, the application could handle the crash and deliver the telemetry leading up to it before shutting down.

[1]: https://github.com/open-telemetry/opentelemetry-specificatio...

Yes, logging the beginning and ending of a span to your observability platform is a common and valuable approach, since you can figure out from the open spans of a dead process what requests might have been running when it died, making it easier to diagnose requests of death.

You can emit multiple spans yourself, no?

Open Telemetry seems super complex compared to our current setup of Micrometer with Prometheus and Log4j2 with Json output.

I wonder if it's worth it to replace both metrics and logging just to take advantage of OT's tracing capability.

A key benefit you might not be getting that OpenTelemetry (and Fluentd for logs in the interim) provides is routing. The ability to trial multiple downstream storage solutions is useful!

Both Micrometer and using log4j2 with a log collector (we're using fluent-bit) already support writing to multiple downstream endpoints though, so not a clear win for OTel.

you could also choose to feed json logs with span and trace ids through, but you'd then have to manually track the context, parent ids, propagation, etc. which OTel takes care of for you.

We are trying to get a better understanding of how customers use our desktop application. We are urged by management to use Qualtrics, but it feels like it is not the right tool for the job.

How would you say does opentelemetry compare to Qualtrics?

Is Qualtrics a tool where users need to actively fill out surveys and opentelemetry collects passively user interactions?

There is also other names floating around in this space like DataDog or OpsTrace, where I am not clear how to place them.

Though maybe not the typical case you could start creating spans for user workflows and compare the time it took to do workflow A vs alternative workflow B (in A/B testing). Or you can simply use metrics to collect usage information like how many times a certain action is performed in your system which may tell how popular it is.

I didn't think to much on this but I played with it once in a hackathon and made a tool that showed how one customer site performed versus another customer site by showing performance comparisons on workflow spans and augmenting this information with metrics of most used tools to get an indication if the other site was maybe faster because it used different parts of the product to perform certain actions faster.

OpenTelemetry allows you to measure performance of your frontend and backend, but it's not a survey tool. DataDog is one of many compatible backends that works with OpenTelemetry, but you can also use Jaeger OSS, Honeycomb, Lightstep, and others.

Strictly only the performance or can it also answer usage questions like:

When customers go through this workflow step, they use this option/button 90% of the time and the other options are used rarely?

It may be possible, but it will not be pleasant to use tracing to answer those kinds of questions. Sounds like something you can do with off the shelf product analytics tools like Amplitude or Mixpanel. Do you already have something like that?

We're using New Relic track performance (API response times, DB load, etc) as well as to track actions that users make on our front ends and APIs.

BI/funnel analytics is a use case that you might have more luck with https://www.tinybird.co/ instead.

Anybody has a recommendation on litterature/blogs addressing the subject of observability? I’m contemplating an SRE job offer and although I have my share of knowledge, I am still out of my league whenever I glance at the cncf map in the observability category. I feel that the field has exploded with new components and new ways, and I’m sometimes afraid of taking the wrong path.

This is a good place to start if you're interested in OTEL https://www.aspecto.io/opentelemetry-bootcamp/

* Observability Engineering (Charity Majors, Liz Fong-Jones, George Miranda) * Software Telemetry (Jamie Riedesel)

Forgive me for having no idea what this project is, but is this a telemetry library for web infrastructure, or is this an open source telemetry library for monitoring high frequency signals from hardware devices, or...?

the former (web+backend), we're part of the Cloud Native Computing Foundation.

It is nice that the proto files are unambiguous that trace time is represented as unix epoch nanos, and it is declared `fixed64` the way Jesus intended, but it's too bad the specification made it to 1.0 without simply mandating that timestamps are always in unix epoch nanos. Having uncertainty about the precision of timestamps has caused and will continue to cause loads of bugs. If there is some language platform out there that can't figure out the time in nanos, that language platform should simply be ignored.

From what I’ve read, you have to accommodate ±1 ms NTP clock skew unless you have an atomic clock or GPS source in each datacenter, so the extra precision isn’t reliable.

That's fine. Many clocks are off by much more than 1ms. But the timestamps should be represented as nanos everywhere.

As someone not familiar with this project, and looking at the landing page, I have no idea what this does. I understand it has something to do with telemetry, but for what? And what would it help me with? How would I use it.

I see a telescope. Is it space related? Telemetry of satellites?

Just my two cents as someone not familiar with the project, and has no idea how to use it or how it would solve a problem I have.

Does this section here help? It’s below the fold on Mobile, but it is on the front page of the site.

> OpenTelemetry is a collection of tools, APIs, and SDKs. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior.

It's a collection of open standards and libraries for exporting/injesting Traces and Metrics from your applications (and logging is in the works)

How does this relate to OpenMetrics, if at all?

OpenMetrics relates to the metrics part of OpenTelemetry, where the latter is currently in alpha/experimental phase. OpenTelemetry has a statement on being interoperable with OpenMetrics for both accepting, forwarding and generating in the OpenMetrics format [0]. So they are two separate CNCF projects where OpenMetrics has overlap with a part of OpenTelemetry.

[0] https://github.com/open-telemetry/opentelemetry-specificatio...

I really hope we’ll have an open standard for tracing. Right now the common solutions are vendor-locking and can’t be integrated in OSS projects

That's definitely a goal of OTel! And it seems to be working. Anecdotally, OTel support has been very important for my workplace (we sell observability tooling https://www.honeycomb.io/) in terms of prospects. I think a lot of decision makers at companies are realizing that they can't afford not to have vendor-neutral instrumentation.

And tracing is GA in most SDKs as well, so you should feel ready to adopt it. Several vendors and OSS tools support it (or you can export your data to another format with the OTel Collector).

Took me a while to figure out OTel is OpenTelemetry here and you were not proposing another spec. Comment for others to realize this.

Phillip also wrote https://leaddev.com/monitoring-observability/rise-openteleme... today, but is too shy to plug it :)

Open tracing works just fine.

OpenTracing and OpenCensus have been merged into the OpenTelemetry project. I think OT isn't sunset officially yet, but the future is OTel.

Not a mention of "opt-in only, because we respect privacy"...

Please stop helping spy on users! Be the change you want to see in the world!

Ugh, not this again. Every time opentelemetry gets on the front page, people grouse about spying.

OT is for distributed tracing across services. It's a glorified profiler (not to downplay how awesome OT is). Has nothing to do with spying on end user behavior any more than URLs, JavaScript or HTTP headers has to do with spying.

Could you use it to spy? Yes, you can use anything to spy, if you choose to.

That's because this tool has nothing to do with "spying on users".

It's a set of tools and a specification for instrumenting applications and collecting logs/metrics/traces from them. It's used to ask questions like "why is this page slow to load?" and get answers like "because it talks to the billing microservice which has a really slow SQL query".

This isn't the same sort of telemetry that desktop software vendors collect.

Sadly, it seems you are wrong: https://opentelemetry.io/docs/js/

"the browser"

That means they are running code on my machine and exfiltrating data out.

This response really strikes me as throwing the baby out with the bath water.

The OpenTelemetry project is focused on Observability use cases and not user analytics/tracking/etc. The fact that of all the libraries, a JS browser library exists doesn't automatically mean the parent comment is "wrong".

I suggest anyone making snap judgements about OpenTelemetry spend a few minutes reading more about the project. This general pattern of comments surfaces every time this project is posted here.

Perhaps you will be surprised to learn that your web browser sends requests all over the place.

Define “they”, please.

The developer of the web app... who is already running all sorts of other code in your browser.

Sure. Like any JavaScript. As long as the provider of the service is transparent about what is collected, what’s the problem? There’s nothing preventing a system provider from making this collection opt-in.

What’s with the assumption that the service provider cannot observe how their system performs?

If you are invited to a dinner, you aren't entitled to eating all the food yourself because the hosts put it on the table, you aren't allowed to do anything you feel like in your host's home, use their toothbrush, drill holes in their walls.

In a similar way, your software is borrowing time and computing resources from your users. You need to use them sparingly and respect that you do not own their hardware, but are a guest they have invited in. It's not your computer to do whatever you want with.

They as in "they" who are receiving the data being collected about the user. It isn't hard to see who the bad guys are.

If this is implemented in a totally transparent, user controlled, consensual way, and the collected data made ephemeral and subject to user request for deletion and viewing on demand while being stored, then there's nothing bad going on.

Anything less is unacceptable - a line is being crossed. Right now a majority of the encounters users have with telemetry is totally opaque and behind the scenes, nonconsensual except for the barest shreds of a EULA or button click fig leaf.

I feel like this is sort of a "you reposted in the wrong neighborhood" moment.

OpenTelemetry is a vendor-neutral framework and standard for generating performance data (spans, metrics, logs) about systems, particularly distributed ones where it's impossible to just load up a debugger to figure out why something is failing or slow. It's not a framework for collecting user data. User data and performance data are completely separate things.

Your concerns about privacy are important, but not really relevant to the topic.

> "they" who are receiving the data being collected about the user

I'm curious why you automatically assume data is being collected about the user? OpenTelemetry is about Observability, not user tracking.

If you're collecting data from software running on someone's device, that's someone else's user data. I think the term of art is metadata.

In some ways, such information is just as important as biometrics and passwords and pii.

Deanonymizing becomes possible when metadata is cross-referenced. Metadata should be subject to as strict protection and consent rules as documents, health info, or any other "obvious" private data.

If it's running on your server and not logging third party activity or metadata, the information belongs to you. Else, you need the third party's consent, etc.

"It's just for benign development purposes" doesn't cut it anymore regardless of intentions. We need a cultural and legal area change with regards to privacy and primacy of private data protections.

NOFI, just to understand if we talk the same thing here, you mean to say that a trace that traces the time it took the product you use to do a query that goes through three different backend services is private data that is owned by you and can only be collected by your consent. Not so much because you are against collecting that information, you may very well accept it to be collected, but because technically the data is collected from your premises and under your usage and thus requires consent as it would otherwise not be generated if not for you. To protect your rights in situations where the data collected would be less harmless, the rules must be strictly enforced. Correct?

Protecting privacy and preventing the weaponization of private data is critically important. Recognizing what constitutes private data seems to be the difficult thing, so I'm trying to clarify it.

I understand it seems extreme, but it's really not.

Any data generated by or on a user's device is private data. Recording that data is surveillance. No matter how harmless it might seem, data collection should be consensual, transparent, and ephemeral. You should specify, explain, and obtain consent for any and every variable. Anything you do with the data should be reportable to the user. Anything less creates opportunities for abuse.

Logging ip address connections is a great example. It's simple and trivial but how many RIAA piracy lawsuits do we have to see inflicted on innocents before we decide keeping those logs might not be a great idea?

The misuse of metadata by law enforcement through plausible narrative crafting is ubiquitous. It's not about the developer's intentions, it's about collection of surveillance records that can be seized or stolen and weaponized.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact