Tracing at Slack: Thinking in Causal Graphs

Game_Ender · on Aug 31, 2020

The real takeaway here: Honeycomb is a great tool for log and trace analysis, and by using Honeycomb's minimal data model [1] for traces you can expand tracing throughout much more of your stack. Now Honeycomb itself makes you pay by the event so Slack's internal buffer and collection system really lets them use it effectively for just the data they need. The low adoption of tracing is also due to the high cost of application level integration, which they are deferring by using the existing tracing systems service layer integration.

So if you come at this as a Honeycomb user Slack has not really re-invented any wheel so much built their own backend to funnel data to Honeycomb and use it's excellent trace viewing and light weight analytics UI [0]. In fact API and model which this article goes to great lengths to explain and justify is basically the Honeycomb tracing API and data model [1]. Which itself is an evolution of the system Facebook uses [2]. It is very cool that they have plugged the gap in Honeycomb which is the lack of historical analysis capability by also funneling all data into Presto.

I would love to see the system they use for keeping field/tag names under control. Honeycomb really strains under large numbers of field names.

0 - https://www.honeycomb.io/trace/

1 - https://docs.honeycomb.io/getting-data-in/tracing/send-trace...

2 - https://www.techrepublic.com/article/ex-facebook-engineers-l...

ryanworl · on Aug 31, 2020

What would you consider a "large" number of field/tag names?

setheron · on Aug 31, 2020

I didn't really get this. Spans in other frameworks are already DAGs; and easy to reason about them as such.

Sounds like they just made the API "more generic" by removing request oriented nomenclature ?

solumos · on Aug 31, 2020

Agreed - this definitely has some "Not Invented Here" baked into it. Seems like they didn't like some of the nomenclature associated with Jaeger/Zipkin, even though both are pretty extensible . . .

> In the future, we plan to build on this success by adding a more powerful query language for querying trace data.

Now they just need a custom DSL!

mansu1 · on Sept 3, 2020

Post author here.

Yes, existing span formats also represent DAGs. But there are additional details in the spans, that trip up developers. Also, most developers don't think about using spans to represent DAGs in their applications

Regular Span - Annotations(nested structures) = SpanEvent.

Yes, we made the API much more simpler. As a result you can consume the spans in the format you produce them.

rektide · on Sept 1, 2020

I'm sorry but the problem statements really did not resonate with me. I felt like there was a very strict limited consideration for the existing tools, justifying breaking away, but the SpanEvent they came up with felt practically indistinguishable from the preexisting technologies they spent so long poo-pooing.

mansu1 · on Sept 3, 2020

Author of the post here. I was the tech lead for Zipkin at Twitter, implemented tracing system at Pinterest[1] and a contributor to the Open tracing spec[2]. So, limited consideration of existing tools may not be too accurate of a description :).

A SpanEvent is a simpler span.

Regular Span - Annotations(nested structures) = SpanEvent.

Most developers don't think about using spans to represent DAGs in their applications. The goal of a SpanEvent is to aid developers to think in terms of DAGs.

In my experience, with some care and clever code, you can use existing Span formats. But, most developers give up and do something else instead.

[1] https://medium.com/pinterest-engineering/distributed-tracing... [2] https://opentracing.io/specification/organization/

jeffbee · on Aug 31, 2020

For answering the question "Am I missing my SLA because of slow DB queries on a specific http endpoint?" you might want to look at the trace analysis features offered by Lightstep in addition to or instead of Honeycomb. Lightstep can classify your trace population to find span data disproportionately associated with high latency. It's pretty slick.

chasers · on Aug 31, 2020

If anyone else was curious...

"Our tracing pipeline has been in production for over a year now and we trace 1% of all requests from our clients. For some low volume services, we trace 100%. Our current pipeline processes ~310M traces/day and about ~8.5B spans per day, producing around 2Tb of trace data every day."

ablekh · on Sept 1, 2020

The more I read posts like this, the more I like the modular monolith architecture. :-)

rad_gruchalski · on Aug 31, 2020

> Zipkin and Jaegar are two of the most popular open source tracing projects that follow the above model

Jaeger. This error happens multiple times in the article.

> While the current APIs work very well for their intended use cases, using those APIs in contexts where there is no clear start or end for an operation can be confusing or not possible.

I don’t understand this part. At Klarrio, we are using Jaeger to trace a multi-tenant microservice system. We ingest data over mqtt and forward to kafka. Contexts are created on ingest and forwarded via kafka to further services. Each service adds its own spans to the trace. A message can be sent from one tenant to another, we don’t know when the trace is completed as the tenant can add spans at any time in the future. There isn’t clear end to the process. Traces can span over days. We use it to report latencies in a real time environment to a governmental body.

> whose event loops call into application code (inversion of control), we need complex workarounds in existing APIs that often break the abstractions provided by these libraries

I’d like to know more how a custom solution solves this problem. At the end of the day, the trace id and parent span id still has to be somehow forwarded to the next service.

* edit: there is an example of a curl query towards the end of the article. Jaeger has a built in reporter and it’s already possible to send a context over http to another service. It’s just data?

So it also has to somehow plug into the mentioned frameworks/IoC? What’s different?

In worst case, this is a perfect example of where a contribution for opentracing support would be probably appreciated.

> Further, we also found that in those use cases there may be multiple request flows (producer-consumer patterns via kafka, goroutines, streaming systems) happening at the same time or we may want to track a single event across multiple request flows. In either case, having a single tracer for the execution of the entire application can be a limiting factor.

It’s possible with Jaeger. We do this at Klarrio. Span context can be sent over the wire. Furthermore, one is not limited to a single tracer per application.

> Raw spans are internal implementation details in tracing systems.

Yes, implementation detail.

> These internal implementation details are hidden from users, who interact with these spans via an instrumentation API at creation time and via the trace UI at consumption time.

Consumption, how? In the UI? Yes, they are visible. Through an API? They’re in tags and logs fields. They’re definitely not hidden.

> For example, I’d like to be able to ask a question like “Am I missing my SLA because of slow DB queries on a specific http endpoint?” Even though traces contain this information, users often have to write complex Java programs, a multi-day effort, to answer their questions. Without an easy way to ask powerful analytical questions, users have to settle for the insights that can be gleaned from the trace view. This drastically limits the ability to utilize trace data for triage.

Jaeger has pluggable storage back ends. One of them is kafka. With kafka, it’s one step to such a solution. Either store them in a database allowing for such query (presto, druid... hello?) or write a kafka app managing alerts. Why reinvent the wheel completely?

I’m not convinced but if it works for Slack, great. So... can I try it?

mansu1 · on Sept 3, 2020

Author of the post here.

Thanks for reading the post and your detailed questions.

> I don’t understand this part. At Klarrio, we are using Jaeger to trace a multi-tenant microservice system. We ingest data over mqtt and forward to kafka. Contexts are created on ingest and forwarded via kafka to further services. Each service adds its own spans to the trace. A message can be sent from one tenant to another, we don’t know when the trace is completed as the tenant can add spans at any time in the future. There isn’t clear end to the process. Traces can span over days. We use it to report latencies in a real time environment to a governmental body.

If your trace data is perfect, everything works. The issue comes when your traces are imperfect.

It is my experience that the existing tracing tools break in subtle ways when the parent span doesn't enclose the complete time for the child span (the UI and the trace analysis tools). Things may have improved a bit since I last looked though.

When we treat traces as raw data, we side step these issues since we leave the real interpretation of the data in the causal graph to the reader instead of forcing a specific view of the data.

> I’d like to know more how a custom solution solves this problem. At the end of the day, the trace id and parent span id still has to be somehow forwarded to the next service.

When you use a tracer in application frameworks today, it comes with built in trace context propagation. However, in some contexts this may be not be ideal since your application or framework already has implicit context or some other mechanism built in. In those cases, having a lower level API just to produce spans directly is more useful. It also allows for gradual addition of tracing to your application instead of adding tracing all at once. For example, if you use a jenkins_job_id as your trace_id, you don't have to explicitly propagate any context across all those tasks in that job.

OpenTracing is a higher level API in these cases and adding these lower level APIs may not be ideal.

> one is not limited to a single tracer per application

Yes, you can have multiple tracers per application. But, you are limited to one tracer per request execution in the current libs.

> Consumption, how? In the UI? Yes, they are visible. Through an API? They’re in tags and logs fields. They’re definitely not hidden.

You are right, none of the data we put on the span is hidden from the user. In the existing systems, a user is exposed to an abstraction of a span. They are not exposed to raw spans. It is in this sense, that the true spans are hidden from the user.

> Jaeger has pluggable storage back ends. One of them is kafka. With kafka, it’s one step to such a solution. Either store them in a database allowing for such query (presto, druid... hello?) or write a kafka app managing alerts. Why reinvent the wheel completely?

Yes, we already put this data in Presto and query that data using SQL.

We tried doing it with zipkin data in the past. However, such efforts have failed because beyond a few power users regular users had a hard time querying the raw span data. Further, the current span formats are not database friendly since they have nested structure.

In practice, the simplest system we found is one where the user produces the data in a specific format and then consumes the data in the same format. SpanEvent was created to provide the same view during span creation and span consumption. The short answer to why a new system is it's simpler.

dvt · on Aug 31, 2020

To me, it seems pretty crazy that (what's basically) a chat app needs such high-levels of traceability -- 2Tb of trace data every day? Why? I mean, people have been running IRC servers in their basements for decades just fine, but I guess the millions in engineering budgets need to go somewhere.

m1keil · on Sept 1, 2020

You are comparing two different types of services. While they share commonalities (passing messages between users), the details are quite different. For example, you don’t get the history of IRC server for times you were offline. It’s like comparing between an airbus and a glider, both fly, right?