Migrating to OpenTelemetry

CSMastermind · on Nov 16, 2023

> The data collected from these streams is sent to several vendors including Datadog (for application logs and metrics), Honeycomb (for traces), and Google Cloud Logging (for infrastructure logs).

It sounds like they were in a place that a lot of companies are in where they don't have a single pane of glass for observability. One of if not the main benefit I've gotten out of Datadog is having everything in Datadog so that it's all connected and I can easily jump from a trace to logs for instance.

One of the terrible mistakes I see companies make with this tooling is fragmenting like this. Everyone has their own personal preference for tool and ultimately the collective experience is significantly worse than the sum of its parts.

badloginagain · on Nov 16, 2023

I feel we hold up single-observability-solution as the Holy Grail, and I can see the argument for it- one place to understand the health of your services.

But I've also been in terrible vendor lock-in situations, being bent over the barrel because switching to a better solution is so damn expensive.

At least now with OTel you have an open standard that allows you to switch easier, but even then I'd rather have 2 solutions that meet my exact observability requirements than a single solution that does everything OKish.

mikeshi42 · on Nov 16, 2023

Biased as a founder in the space [1] but I think with OpenTelemetry + OSS extensible observability tooling, the holy grail of one tool is more realizable than ever.

Vendor lock in with Otel now is hopefully a thing of the past - but now that more obs solutions are going open source, hopefully it's not necessarily true that one tool would be mediocre over all use cases (since DD and the likes are inherently limited by their own engineering teams, vs OSS products can have community/customer contributions to improve the surface area over time on top of the core maintainer's work).

[1] https://github.com/hyperdxio/hyperdx

pranay01 · on Nov 17, 2023

I think that OpenTelemetry will solve this problem of vendor lock in. I am a founder building in this space[1] and we see many of our users switching to opentelemetry as that provides an easy way to switch if needed in future.

At SigNoz, we have metrics, traces and logs in a single application which helps you correlate across signals much more easily - and being natively based on opentelemetry makes this correlation much easier as it leverages the standard data format.

Though this might take sometime, as many teams have proprietary SDK in their code, which is not easy to rip out. Opentelemetry auto-instrumentation[2] makes it much easier, and I think that's the path people will follow to get started

[1]https://github.com/SigNoz/signoz [2]https://opentelemetry.io/docs/instrumentation/java/automatic...

sofixa · on Nov 17, 2023

Switch the backend destination of metrics/traces/logs, but all your dashboards, alerts, and potentially legacy data still need to be migrated. Drastically better than before where instrumentation and agents were custom for each backend, but there's still hurdles.

dexterdog · on Nov 16, 2023

Depending on your usage it can be prohibitively expensive to use datadog for everything like that. We have it for just our prod env because it's just not worth what it brings to the table to put all of our logs into it.

shric · on Nov 17, 2023

I once worked out what it would cost to send our company's prod logs to datadog. It was 1.5x our total AWS cost. The company ran entirely on AWS

dabeeeenster · on Nov 16, 2023

Is prod not 99% of your logs?

dexterdog · 2023-11-27T01:24:53 1701048293

Not even close

maccard · on Nov 16, 2023

I've spent a small amount of time in datadog, lots in grafana, and somewhere in between in honeycomb. Out applications are designed to emit traces, and comparing honeycomb with tracing to a traditional app with metrics and logs, I would choose tracing every time.

It annoys me that logs are overlooked in honeycomb, (and metrics are... fine). But, given the choice between a single pane of glass in grafana or having to do logs (and metrics sometimes) in cloudwatch but spending 95% of my time in honeycomb - I'd pick honeycomb every time

mdtusz · on Nov 16, 2023

Agreed - honeycomb has been a boon, however some improvements to metric displays and the ability to set the default "board" used in the home page would be very welcome. Also would be pretty happy if there was a way to drop events on the honeycomb side for a way to dynamically filter - e.g. "don't even bother storing this trace if it has a http.status_code < 400". This is surprisingly painful to implement on the application side (at least in rust).

Hopefully someone that works there is reading this.

masterj · on Nov 16, 2023

It sounds like you should look into their tail-sampling Refinery tool https://docs.honeycomb.io/manage-data-volume/refinery/

phillipcarter · on Nov 17, 2023

Yep, this is the one to use. Refinery handles exactly this scenario (and more).

serverlessmom · on Nov 17, 2023

I think Honeycomb is perfect for one kind of user, who's entirely concerned with traces and very long retention. For a more general OpenTelemetry-native solution, check out Signoz.

viraptor · on Nov 16, 2023

Have you tried the traces in grafana/tempo yet? https://grafana.com/docs/grafana/latest/panels-visualization...

It seems to miss some aggregation stuff, but also it's improving every time I check. I wonder if anyone's used it in anger yet and how far is it from replacing datadog or honeycomb.

arccy · on Nov 16, 2023

tempo still feels very much: look at a trace that you found from elsewhere (like logs).

with so much information in traces and the pure volume, the aggregation really is the key to actionable info out of a tracing setup if it's going to be the primary entry point.

maccard · on Nov 17, 2023

I've not. Honestly, I'm not in the market for tool shopping at the moment, I need another honeycomb-style moment of "this is incredible" to start looking again. I think it would take "Honeycomb, but we handle metric rollups and do logs" right now.

ankit01-oss · on Nov 17, 2023

You can also check out SigNoz - https://github.com/SigNoz/signoz. It has logs, metrics, and traces under a single pane. If you're using otel libraries and otel collector you can do a lot of correlation between your logs and traces. I am a maintainer, and we have seen a lot of our users using signoz to have the ease of having three signals in a single pane.

devin · on Nov 16, 2023

Eh, personally I view honeycomb and datadog as different enough offerings that I can see why you'd choose to have both.

rewmie · on Nov 16, 2023

> It sounds like they were in a place that a lot of companies are in where they don't have a single pane of glass for observability.

One of the biggest features of AWS which is very easy to take for granted and go unnoticed is Amazon CloudWatch. It supports metrics, logging, alarms, metrics from alarms, alarms from alarms, querying historical logs, trigger actions, etc etc etc. and it covers each and every single service provided by AWS including metaservices like AWS Config and Cloudtrail.

And you barely notice it. It's just there, and you can see everything.

> One of the terrible mistakes I see companies make with this tooling is fragmenting like this.

So much this. It's not fun at all to have to go through logs and metrics on any application,and much less so if for some reason their maintainers scattered their metrics emission to the four winds. However, with AWS all roads lead to Cloudwatch, and everything is so much better.

yourapostasy · on Nov 16, 2023

> ...with AWS all roads lead to Cloudwatch, and everything is so much better.

Most of my clients are not in the product-market fit for AWS CloudWatch, because most of their developers don't have the development, testing and operational maturity/discipline to use CloudWatch cost-effectively (this is at root an organization problem, but let's not go off onto that giant tangent). So the only realistic tracing strategy we converged upon to recommend for them is "grab everything, and retain it up to the point in time we won't be blamed for not knowing root cause" (which in some specific cases can be up to years!), while we undertake the long journey with them to upskill their teams.

This would make using CloudWatch everywhere rapidly climb up into the top three largest line item in the AWS bill, easily justifying spinning that tracing functionality in-house. So we wind up opting into self-managed tooling like Elastic Observability or Honeycomb where the pricing is friendlier to teams in unfortunate situations that need to start with everything for CYA, much as I would like to stay within CloudWatch.

Has anyone found a better solution to these use cases where the development maturity level is more prosaic, or is this really the best local maxima at the industry's current SOTA?

everfrustrated · on Nov 17, 2023

In addition, one of the largest limitations of CloudWatch is it doesn't work well with a many-aws-account strategy.

Some part of the value of Datadog etc is having a single pane of glass over many aws accounts.

tapoxi · on Nov 16, 2023

I made this switch very recently. For our Java apps it was as simple as loading the otel agent in place of the Datadog SDK, basically "-javaagent:/opt/otel/opentelemetry-javaagent.jar" in our args.

The collector (which processes and ships metrics) can be installed in K8S through Helm or an operator, and we just added a variable to our charts so the agent can be pointed at the collector. The collector speaks OTLP which is the fancy combined metrics/traces/logs protocol the OTEL SDKs/agents use, but it also speaks Prometheus, Zipkin, etc to give you an easy migration path. We currently ship to Datadog as well as an internal service, with the end goal being migrating off of Datadog gradually.

andrewstuart2 · on Nov 16, 2023

We tried this about a year and a half ago and ended up going somewhat backwards into DD entrenchment, because they've decided that anything not an official DD metric (that is, collected by their agent typically) is custom and then becomes substantially more expensive. We wanted a nice migration path from any vendor to any other vendor but they have a fairly effective strategy for making gradual migrations more expensive for heavy telemetry users. At least our instrumentation these days is otel, but it's the metrics we expected to just scrape from prometheus that we had to dial back and start using more official DD agent metrics and configs to get, lest our bill balloon by 10x. It's a frustrating place to be. Especially since it's still not remotely cheap, just that it could be way worse.

I know this isn't a DataDog post, and I'm a bit off topic, but I try to do my best to warn against DD these days.

shawnb576 · on Nov 16, 2023

This has been a concern for me too. But the agent is just a statsd receiver with some extra magic, so this seems like a thing that could be solved with the collector sending traffic to an agent rather than the HTTP APIs?

I looked at the OTel DD stuff and did not see any support for this, fwiw, maybe it doesn't work b/c the agent expects more context from the pod (e.g. app and label?)

andrewstuart2 · on Nov 16, 2023

Yeah, the DD agent and the otel-collector DD exporter actually use the same code paths for the most part. The relevant difference tends to be in metrics, where the official path involves the DD agent doing collection directly, for example, collecting redis metrics by giving the agent your redis database hostname and creds. It can then pack those into the specific shape that DD knows about and they get sent with the right name, values, etc so that DD calls them regular metrics.

If you instead went the more flexible route of using many of the de-facto standard prometheus exporters like the one for redis, or built-in prometheus metrics from something like istio, and forward those to your agent or configure your agent to poll those prometheus metrics, it won't do any reshaping (which I can see the arguments for, kinda, knowing a bit about their backend) and they just end up in the DD backend as custom metrics, and charge you at $0.10/mo per 100 time series. If you've used prometheus before for any realistic deployments with enrichment etc, you can probably see this gets expensive ridiculously fast.

What I wish they'd do instead is have some form of adapter from those de facto standards, so I can still collect metrics 99% my own way, in a portable fashion, and then add DD as my backend without ending up as custom everything, costing significantly more.

xyst · on Nov 16, 2023

> somewhat backwards into DD entrenchment, because they've decided that anything not an official DD metric (that is, collected by their agent typically) is custom and then becomes substantially more expensive.

It a vendor pulled shit like this on me. That’s when I would counsel them. Of course most big orgs would rather not do the leg work to actually become portable, migrate off vendor. So of course they will just pay the bill.

Vendors love the custom shit they build because they know once it’s infiltrated the stack then it’s basically like gangrene (have to cut off the appendage to save the host)

MajimasEyepatch · on Nov 16, 2023

It's interesting that you're using both Honeycomb and Datadog. With everything migrated to OTel, would there be advantages to consolidating on just Honeycomb (or Datadog)? Have you found they're useful for different things, or is there enough overlap that you could use just one or the other?

bhyolken · on Nov 16, 2023

Author here, thanks for the question! The current split developed from the personal preferences of the engineers who initially set up our observability systems, based on what they had used (and liked) at previous jobs.

We're definitely open to doing more consolidation in the future, especially if we can save money by doing that, but from a usability standpoint we've been pretty happy with Honeycomb for traces and Datadog for everything else so far. And, that seems to be aligned with what each vendor is best at at the moment.

MuffinFlavored · on Nov 16, 2023

> from the personal preferences of the engineers

https://www.honeycomb.io/pricing

https://www.datadoghq.com/pricing/

Am I wrong to say... having 2 is "expensive"? Maybe not if 50% of your stuff is going to Honeycomb and 50% going to DataDog. Could you save money/complexity (less places to look for things) having just DataDog or just Honeycomb?

bhyolken · on Nov 16, 2023

Right now, there isn't much duplication of what we're sending to each vendor, so I don't think we'd save a ton by consolidating, at least based on list prices. We could maybe negotiate better prices based on higher volumes, but I'm not sure if Airplane is spending enough at this point to get massive discounts there.

Another potential benefit would definitely be reduced complexity and better integration for the engineering team. So, for instance, you could look at a log and then more easily navigate to the UI for the associated trace. Currently, we do this by putting Honeycomb URLs in our Datadog log events, which works but isn't quite as seamless. But, given that our team is pretty small at this point and that we're not spending a ton of our time on performance optimizations, we don't feel an urgent need to consolidate (yet).

MuffinFlavored · on Nov 16, 2023

When you say DataDog for everything else (as in not traces), besides logs, what else do you mean?

claytonjy · on Nov 16, 2023

Metrics, probably? The article calls out logs, metrics, and traces as the 3 pillars of observability.

bhyolken · on Nov 16, 2023

Yeah, metrics and logs, plus a few other things that depend on these (alerts, SLOs, metric-based dashboards, etc.).

Jedd · on Nov 17, 2023

The killer feature of OpenTelemetry for us is brokering (with ETL).

Partly this lets us easily re-route & duplicate telemetry, partly it means changes to backend products in the future won't be a big disruption.

For metrics we're a mostly telegraf->prometheus->grafana mimir shop - telegraf because its rock solid and feature-rich, prometheus because there's no real competition in that tier, and mimir because of scale & self-host options.

Our scale problem means most online pricing calculators generate overflow errors.

Our non-security log destination preference is Loki - for similar reasons to Mimir - though a SIEM it definitely is not.

Tracing to a vendor, but looking to bring that back to grafana Tempo. Product maturity is a long way off commercial APM offerings, but it feels like the feature-set is about 70% there and converging rapidly. Off-the-shelf tracing products have an appealingly low cost of entry, which only briefly defers lock-in & pricing shocks.

pranay01 · on Nov 17, 2023

Yeah, the ability to send to multiple sources is quite powerful and most of this comes from the configurability of Otel Collector [1].

If you are looking for a open source backend for OpenTelemetry, then you can explore SigNoz[2] (I am one of the founders) We have a quite a decent product for APM/tracing leveraging opentelemerty native data format and semantic convention.

[1]https://opentelemetry.io/docs/collector/ [2]https://github.com/SigNoz/signoz

Jedd · on Nov 17, 2023

Hi Pranay - actually I've had a signoz tab open for about 5 weeks - once I find time I'm meaning to run it up in my lab.

pranay01 · on Nov 17, 2023

Awesome! Do reach out to us in our slack community[1] if you have any questions or need any help on setting things up

[1] https://signoz.io/slack

hagen1778 · 2023-11-24T12:58:56 1700830736

> mimir because of scale & self-host options

Have you looked at VictoriaMetrics [0] before opting for Mimir?

[0] https://victoriametrics.com/blog/mimir-benchmark/

nevon · on Nov 16, 2023

I would love to save a few hundred thousands a year by running Otel collector over Datadog agents, just on the cost-per-host alone. Unfortunately that would also mean giving up Datatog APM and NPM, as far as I can tell, which have been really valuable. Going back to just metrics and traces would feel like quite the step backwards and be a hard sell.

arccy · on Nov 16, 2023

you can submit opentelemetry traces to datadog which should be the equivalent of apm/npm, though maybe with a less polished integration.

nevon · on Nov 17, 2023

Just traces are a long way off from APM and NPM. APM gives me the ability to debug memory leaks from continuous heap snapshots, or performance issues through CPU profiling. NPM is almost like having tcpdump running constantly, showing me where there's packet loss or other forms of connectivity issues.

porker · on Nov 17, 2023

Thank you for sharing this, I've had "look at tracing" on my to do list for months and assumed it was identical to APM. It seems it won't be a direct substitute, which helps explain the cost difference.

nullify88 · on Nov 17, 2023

One thing that's slightly off putting about OpenTelemetry is how resource attributes don't get included as prometheus labels for metrics, instead they are on an info metric which requires a join to enrich the metric you are interested in.

Luckily the prometheus exporters have a switch to enable this behaviour, but there's talk of removing this functionality because it breaks the spec. If you were to use the OpenTelemetry protocol in to something like Mimir, you don't have the option of enabling that behaviour unless you use prometheus remote write.

Our developers aren't a fan of that.

https://opentelemetry.io/docs/specs/otel/compatibility/prome...

valyala · on Nov 20, 2023

FYI, VictoriaMetrics converts resource attributes to ordinary labels before storing metrics received via OoenTelemetry protocol - https://docs.victoriametrics.com/#sending-data-via-opentelem... . This simplifies filtering and grouping of such metrics during querying. For example, you need to write `my_metric{resource_name="foo"}` instead of `my_metric * on(resource_id) group_left() resource_info{resource_name="foo"}` when filtering by `resource_name`.

nullify88 · on Nov 20, 2023

Thanks, that's nice to know VM can accommodate that. A migration will be something we will have consider if Mimir and OpenTelemetry force us to use joins for all our queries.

They maybe trying to address label cardinality but their approach seems like they are throwing the baby out with the bath water. The developer experience suffers as a result because from a dev pov, resource attributes are added to the metric yet this relationship isn't transferred when translated to Prometheus metrics.

ronyaurora · 2023-11-21T13:24:56 1700573096

If you are using the prometheus exporter, you can use the transform processor to get specific resource attributes into metric labels.

With the advantage that you get only the specific attributes you want, thus avoiding a cardinality explosion.

https://github.com/open-telemetry/opentelemetry-collector-co...

nullify88 · 2023-11-21T23:27:26 1700609246

We've migrated away from the prometheus exporter to the prometheus remote write exporter as I'd like a completely "push" based architecture. Ideally I would have liked to be completely otlp but can't for the reasons already explained. So I use promethus remote write instead in to Mimir.

The transform processor could be useful if they ever deprecate the resource_to_telemetry_conversion flag, but its still a pain point because it hinders a developers autonomy, and requires a whitelist of labels to be maintained on the collectors by another team.

roskilli · on Nov 16, 2023

> Moreover, we encountered some rough edges in the metrics-related functionality of the Go SDK referenced above. Ultimately, we had to write a conversion layer on top of the OTel metrics API that allowed for simple, Prometheus-like counters, gauges, and histograms.

Have encountered this a lot from teams attempting to use the metrics SDK.

Are you open to comment on specifics here and also what kind of shim you had to put in front of the SDK? It would be great to continue to retrieve feedback so that we can as a community have a good idea of what remains before it's possible to use the SDK for real world production use cases in anger. Just wiring up the setup in your app used to be fairly painful but that has gotten somewhat better over the last 12-24 months, I'd love to also hear what is currently causing compatibility issues w/ the metric types themselves using the SDK which requires a shim and what the shim is doing to achieve compatibility.

bhyolken · on Nov 16, 2023

Sure, happy to provide more specifics!

Our main issue was the lack of a synchronous gauge. The officially supported asynchronous API of registering a callback function to report a gauge metric is very different from how we were doing things before, and would have required lots of refactoring of our code. Instead, we wrote a wrapper that exposes a synchronous-like API: https://gist.github.com/yolken-airplane/027867b753840f7d15d6....

It seems like this is a common feature request across many of the SDKs, and it's in the process of being fixed in some of them (https://github.com/open-telemetry/opentelemetry-specificatio...)? I'm not sure what the plans are for the golang SDK specifically.

Another, more minor issue, is the lack of support for "constant" attributes that are applied to all observations of a metric. We use these to identify the app, among other use cases, so we added wrappers around the various "Add", "Record", "Observe", etc. calls that automatically add these. (It's totally possible that this is supported and I missed it, in which case please let me know.)

Overall, the SDK was generally well-written and well-documented, we just needed some extra work to make the interfaces more similar to the ones we were using before.

roskilli · on Nov 17, 2023

Thanks for the detailed response.

I am surprised there is no gauge update API yet (instead of callback only), this is a common use case and I don't think folks should be expected to implement their own. Especially since it will lead to potentially allocation heavy bespoke implementations, depending on use case given mutex+callback+other structures that likely need to be heap allocated (vs a simple int64 wrapper with atomic update/load APIs).

Also I would just say that the fact the APIs differ a lot to more common popular Prometheus client libraries does beg the question of do we need more complicated APIs that folks have a harder time using. Now is the time to modernize these before everyone is instrumented with some generation of a client library that would need to change/evolve. The whole idea of an OTel SDK is instrument once and then avoid needing to re-instrument again when making changes to your observability pipeline and where it's pointed. This becomes a hard sell if OTel SDK needs to shift fairly significantly to support more popular & common use cases with more typical APIs and by doing so leaves a whole bunch of OTel instrumented code that needs to be modernized to a different looking API.

arccy · on Nov 16, 2023

the official SDKs will only support an api once there's a spec that allows it.

for const attributes, generally these should be defined at the resource / provider level: https://pkg.go.dev/go.opentelemetry.io/otel/sdk/metric#WithR...

caust1c · on Nov 16, 2023

Curious about the code implemented for logs! Hopefully that's something that can be shared at some point. Also curious if it integrates with `log/slog` :-)

Congrats too! As I understand it from stories I've heard from others, migrating to OTel is no easy undertaking.

bhyolken · on Nov 16, 2023

Thanks! For logs, we actually use github.com/segmentio/events and just implemented a handler for that library that batches logs and periodically flushes them out to our collector using the underlying protocol buffer interface. We plan on migrating to log/slog soon, and once we do that we'll adapt our handler and can share the code.

caust1c · on Nov 16, 2023

Awesome! Great work and thanks for sharing your experience!

throwaway084t95 · on Nov 16, 2023

What is the "first principles" argument that observability decomposes into logs, metrics, and tracing? I see this dogma accepted everywhere, but I'm inquisitive about it

yannyu · on Nov 16, 2023

First you had logs. Everyone uses logs because it's easy. Logs are great, but suddenly you're spending a crapton of time or money maintaining terabytes or petabytes of storage and ingest of logs. And even worse, in some cases for these logs, you don't actually care about 99% of the log line and simply want a single number, such as CPU utilization or the value of the shopping cart or latency.

So, someone says, "let's make something smaller and more portable than logs. We need to track numerical data over time more easily, so that we can see pretty charts of when these values are outside of where they should be." This ends up being metrics and a time-series database (TSDB), built to handle not arbitrary lines of text but instead meant to parse out metadata and append numerical data to existing time-series based on that metadata.

Between metrics and logs, you end up with a good idea of what's going on with your infrastructure, but logs are still too verbose to understand what's happening with your applications past a certain point. If you have an application crashing repeatedly, or if you've got applications running slowly, metrics and logs can't really help you there. So companies built out Application Performance Monitoring, meant to tap directly into the processes running on the box and spit out all sorts of interesting runtime metrics and events about not just the applications, but the specific methods and calls those applications are utilizing within their stack/code.

Initially, this works great if you're running these APM tools on a single box within monolithic stacks, but as the world moved toward Cloud Service Providers and containerized/ephemeral infrastructure, APM stopped being as effective. When a transaction starts to go through multiple machines and microservices, APM deployed on those boxes individually can't give you the context of how these disparate calls relate to a holistic transaction.

So someone says, "hey, what if we include transaction IDs in these service calls, so that we can post-hoc stitch together these individual transaction lines into a whole transaction, end-to-end?" Which is how you end up with the concept of spans and traces, taking what worked well with Application Performance Monitoring and generalizing that out into the modern microservices architectures that are more common today.

tsamba · on Nov 16, 2023

Interesting read. What did you find easier about using GCP's log tooling for your internal system logs, rather than the OTel collector?

bhyolken · on Nov 17, 2023

Author here. This decision was more about ease of implementation than anything else. Our internal application logs were already being scooped up by GCP because we run our services in GKE, and we already had a GCP->Datadog log syncer [1] for some other GCP infra logs, so re-using the GCP-based pipeline was the easiest way to handle our application logs once we removed the Datadog agent.

In the future, we'll probably switch these logs to also go through our collector, and it shouldn't be super hard (because we already implemented a golang OTel log handler for the external case), but we just haven't gotten around to it yet.

[1] https://docs.datadoghq.com/integrations/google_cloud_platfor...

clintonb · on Nov 17, 2023

Their collector is used to send infrastructure logs to GCP (instead of Datadog).

My guess is this is to save on costs. GCP logging is probably cheaper than Datadog, and infrastructure logs may not be needed as frequently as application logs.

shoelessone · on Nov 16, 2023

I really really want to use OTel for a small project but have always had a really tough time finding a path that is cheap or free for a personal project.

In theory you can send telemetry data with OTel to Cloud Watch, but I've struggle to connect the dots with the front end application (e.g. React/Next.js).

yourapostasy · on Nov 17, 2023

Have you checked out Jaeger [1]? It is lightweight enough for a personal project, open source, and featureful enough to really help "turn on the lightbulb" with other engineers to show them the difference between logging/monitoring and tracing.

[1] https://www.jaegertracing.io/

arccy · on Nov 16, 2023

grafana cloud, honeycomb, etc have free tiers, though you'll have to watch how much data you send them. or you can self host something like signoz or the elastic stack. frontend will typically go to an instance of opentelemetry collector to filter/convert to the protocol for the storage backend.

jon-wood · on Nov 17, 2023

At the risk of being downvoted (probably justly) for having a moan, can we please have a moratorium on every blog post needing to have a generally irrelevant picture attached to it? On opening this page I can see 28 words that are actually relevant because almost the entire view is consumed by a huge picture of a graph and the padding around it.

This is endemic now. Doesn't matter what someone is writing about there'll be some pointless stock photo taking up half the page. There'll probably be some more throughout the page. Stop it please.

k__ · on Nov 16, 2023

I had the impression, logs and metrics are a pre-observability thing.

SteveNuts · on Nov 16, 2023

I've never heard the term "pre-observability", what does that mean?

renegade-otter · on Nov 16, 2023

The era when "debugging in production" wasn't standard.

marcosdumay · on Nov 16, 2023

Observability is about logs and metrics, and pre-observability (I guess you mean the high-level-only records simpler environments keep) is also about logs and metrics.

Anything you register to keep track of your environment has the form of either logs or metrics. The difference is about the contents of such logs and metrics.

k__ · on Nov 17, 2023

When I read Observability Engineering, I got the impression it was about long events and tracing, and metrics and logs were a thing of the past people gave up on since the rise of Microservices.

sofixa · on Nov 17, 2023

> Authors Charity Majors, Liz Fong-Jones, and George Miranda from Honeycomb explain what constitutes good observability, show you how to improve upon what you're doing today, and provide practical dos and don'ts for migrating from legacy tooling, such as metrics, monitoring, and log management. You'll also learn the impact observability has on organizational culture (and vice versa).

No wonder, it's either strong bias from people working in a tracing vendor, or outright a sales pitch.

It's totally false though. Each pillar - metrics, logs and traces have their place and serve different purposes. You won't use traces to measure the number of requests hitting your load balancer, or the amount of objects in the async queue, or CPU utilisation, or network latency, or any number of things. Logs can be more rich than traces, and a nice pattern I've used with Grafana is linking the two, and having the option to jump to corresponding log lines from a trace which can describe the different actions performed during that span.

malkia · on Nov 20, 2023

While I was at Google, circa 2015-2016, working on some Ad project, and happened to be on-call our system started doing something wonky, so I think I either called the SRE on the sub-system we were using (spanner? something else - don't remember) to check what's up (as written by our playbook).

They asked me to enable tracing for 30s (we had Chrome extension, that sends some URL common parameter that enables in your web server full tracing (100%) for some short amount of time), and then I did some operations that our internal customers were complaining.

This produced quite a hefty tracing, but only for 30secs - enough for them to trace back where the issue might be coming from. But basically end-to-end from me doing something on the browser, down to our server/backend, downto their systems, etc.

That's how I've learned how important it is - for cases like this, but you can't have 100% ON all the time - not even 1% I think...

sofixa · on Nov 20, 2023

Oh yeah, tracing can be extremely useful, precisely because it should be end to end.

As for the numbers, that's why all tracing collectors and receivers support downsampling out of the box. Recording only 1% or 10% of all traces, or 10% of all successful ones and 100% of failures is a good way of making use of tracing without overburdening storage.

phillipcarter · on Nov 17, 2023

You can sorta measure some of this with traces. For example, sampled traces that contain the sampling rate in their metadata let you re-weight counts, thus allowing you to accurately measure "number of requests to x". Similarly, a good sampling of network latency can absolutely be measured by trace data. Metrics will always have their place, though, for reasons you mention - measuring cpu utilization, # of objects in something etc. Logs vs. traces is more nuanced I think. A trace is nothing more than a collection of structured logs. I would wager that nearly all use cases for structured logging could be wholesale replaced by tracing. Security logging and big object logging is an exception, although that's also dependent on your vendor or backend.

jwestbury · on Nov 17, 2023

> metrics and logs were a thing of the past people gave up on since the rise of Microservices

Definitely not the case, and, in fact, probably the opposite is true. In the era of microservices, metrics are absolutely critical to understand the health of your system. Distributed tracing is also only beneficial if you have the associated logs - so that you can understand what each piece of the system was doing for a single unit of work.

phillipcarter · on Nov 17, 2023

> Distributed tracing is also only beneficial if you have the associated logs - so that you can understand what each piece of the system was doing for a single unit of work.

Ehhh, that's only if you view tracing as "the thing that tells me that service A talks to service B". Spans in a trace are just structured logs. They are your application logging vehicle, especially if you don't have a legacy of good in-app instrumentation via logs.

But even then the worlds are blurring a bit. OTel logs burn in a span and trace ID, and depending on the backend that correlated log may well just be treated as if it's a part of the trace.