
Tracing at Slack: Thinking in Causal Graphs - oedmarap
https://slack.engineering/tracing-at-slack-thinking-in-causal-graphs/
======
Game_Ender
The real takeaway here: Honeycomb is a great tool for log and trace analysis,
and by using Honeycomb's minimal data model [1] for traces you can expand
tracing throughout much more of your stack. Now Honeycomb itself makes you pay
by the event so Slack's internal buffer and collection system really lets them
use it effectively for just the data they need. The low adoption of tracing is
also due to the high cost of application level integration, which they are
deferring by using the existing tracing systems service layer integration.

So if you come at this as a Honeycomb user Slack has not really re-invented
any wheel so much built their own backend to funnel data to Honeycomb and use
it's excellent trace viewing and light weight analytics UI [0]. In fact API
and model which this article goes to great lengths to explain and justify is
basically the Honeycomb tracing API and data model [1]. Which itself is an
evolution of the system Facebook uses [2]. It is very cool that they have
plugged the gap in Honeycomb which is the lack of historical analysis
capability by also funneling all data into Presto.

I would love to see the system they use for keeping field/tag names under
control. Honeycomb really strains under large numbers of field names.

0 - [https://www.honeycomb.io/trace/](https://www.honeycomb.io/trace/)

1 - [https://docs.honeycomb.io/getting-data-in/tracing/send-
trace...](https://docs.honeycomb.io/getting-data-in/tracing/send-trace-
data/#manual-tracing)

2 - [https://www.techrepublic.com/article/ex-facebook-
engineers-l...](https://www.techrepublic.com/article/ex-facebook-engineers-
launch-honeycomb-a-new-tool-for-your-debugging-nightmares/)

~~~
ryanworl
What would you consider a "large" number of field/tag names?

------
setheron
I didn't really get this. Spans in other frameworks are already DAGs; and easy
to reason about them as such.

Sounds like they just made the API "more generic" by removing request oriented
nomenclature ?

~~~
solumos
Agreed - this definitely has some "Not Invented Here" baked into it. Seems
like they didn't like some of the nomenclature associated with Jaeger/Zipkin,
even though both are pretty extensible . . .

> In the future, we plan to build on this success by adding a more powerful
> query language for querying trace data.

Now they just need a custom DSL!

------
rektide
I'm sorry but the problem statements really did not resonate with me. I felt
like there was a very strict limited consideration for the existing tools,
justifying breaking away, but the SpanEvent they came up with felt practically
indistinguishable from the preexisting technologies they spent so long poo-
pooing.

~~~
mansu1
Author of the post here. I was the tech lead for Zipkin at Twitter,
implemented tracing system at Pinterest[1] and a contributor to the Open
tracing spec[2]. So, limited consideration of existing tools may not be too
accurate of a description :).

A SpanEvent is a simpler span.

Regular Span - Annotations(nested structures) = SpanEvent.

Most developers don't think about using spans to represent DAGs in their
applications. The goal of a SpanEvent is to aid developers to think in terms
of DAGs.

In my experience, with some care and clever code, you can use existing Span
formats. But, most developers give up and do something else instead.

[1] [https://medium.com/pinterest-engineering/distributed-
tracing...](https://medium.com/pinterest-engineering/distributed-tracing-at-
pinterest-with-new-open-source-tools-a4f8a5562f6b) [2]
[https://opentracing.io/specification/organization/](https://opentracing.io/specification/organization/)

------
jeffbee
For answering the question "Am I missing my SLA because of slow DB queries on
a specific http endpoint?" you might want to look at the trace analysis
features offered by Lightstep in addition to or instead of Honeycomb.
Lightstep can classify your trace population to find span data
disproportionately associated with high latency. It's pretty slick.

------
chasers
If anyone else was curious...

"Our tracing pipeline has been in production for over a year now and we trace
1% of all requests from our clients. For some low volume services, we trace
100%. Our current pipeline processes ~310M traces/day and about ~8.5B spans
per day, producing around 2Tb of trace data every day."

------
ablekh
The more I read posts like this, the more I like the _modular monolith_
architecture. :-)

------
rad_gruchalski
> Zipkin and Jaegar are two of the most popular open source tracing projects
> that follow the above model

Jaeger. This error happens multiple times in the article.

> While the current APIs work very well for their intended use cases, using
> those APIs in contexts where there is no clear start or end for an operation
> can be confusing or not possible.

I don’t understand this part. At Klarrio, we are using Jaeger to trace a
multi-tenant microservice system. We ingest data over mqtt and forward to
kafka. Contexts are created on ingest and forwarded via kafka to further
services. Each service adds its own spans to the trace. A message can be sent
from one tenant to another, we don’t know when the trace is completed as the
tenant can add spans at any time in the future. There isn’t clear end to the
process. Traces can span over days. We use it to report latencies in a real
time environment to a governmental body.

> whose event loops call into application code (inversion of control), we need
> complex workarounds in existing APIs that often break the abstractions
> provided by these libraries

I’d like to know more how a custom solution solves this problem. At the end of
the day, the trace id and parent span id still has to be somehow forwarded to
the next service.

* edit: there is an example of a curl query towards the end of the article. Jaeger has a built in reporter and it’s already possible to send a context over http to another service. It’s just data?

So it also has to somehow plug into the mentioned frameworks/IoC? What’s
different?

In worst case, this is a perfect example of where a contribution for
opentracing support would be probably appreciated.

> Further, we also found that in those use cases there may be multiple request
> flows (producer-consumer patterns via kafka, goroutines, streaming systems)
> happening at the same time or we may want to track a single event across
> multiple request flows. In either case, having a single tracer for the
> execution of the entire application can be a limiting factor.

It’s possible with Jaeger. We do this at Klarrio. Span context can be sent
over the wire. Furthermore, one is not limited to a single tracer per
application.

> Raw spans are internal implementation details in tracing systems.

Yes, implementation detail.

> These internal implementation details are hidden from users, who interact
> with these spans via an instrumentation API at creation time and via the
> trace UI at consumption time.

Consumption, how? In the UI? Yes, they are visible. Through an API? They’re in
tags and logs fields. They’re definitely not hidden.

> For example, I’d like to be able to ask a question like “Am I missing my SLA
> because of slow DB queries on a specific http endpoint?” Even though traces
> contain this information, users often have to write complex Java programs, a
> multi-day effort, to answer their questions. Without an easy way to ask
> powerful analytical questions, users have to settle for the insights that
> can be gleaned from the trace view. This drastically limits the ability to
> utilize trace data for triage.

Jaeger has pluggable storage back ends. One of them is kafka. With kafka, it’s
one step to such a solution. Either store them in a database allowing for such
query (presto, druid... hello?) or write a kafka app managing alerts. Why
reinvent the wheel completely?

I’m not convinced but if it works for Slack, great. So... can I try it?

~~~
mansu1
Author of the post here.

Thanks for reading the post and your detailed questions.

> I don’t understand this part. At Klarrio, we are using Jaeger to trace a
> multi-tenant microservice system. We ingest data over mqtt and forward to
> kafka. Contexts are created on ingest and forwarded via kafka to further
> services. Each service adds its own spans to the trace. A message can be
> sent from one tenant to another, we don’t know when the trace is completed
> as the tenant can add spans at any time in the future. There isn’t clear end
> to the process. Traces can span over days. We use it to report latencies in
> a real time environment to a governmental body.

If your trace data is perfect, everything works. The issue comes when your
traces are imperfect.

It is my experience that the existing tracing tools break in subtle ways when
the parent span doesn't enclose the complete time for the child span (the UI
and the trace analysis tools). Things may have improved a bit since I last
looked though.

When we treat traces as raw data, we side step these issues since we leave the
real interpretation of the data in the causal graph to the reader instead of
forcing a specific view of the data.

> I’d like to know more how a custom solution solves this problem. At the end
> of the day, the trace id and parent span id still has to be somehow
> forwarded to the next service.

When you use a tracer in application frameworks today, it comes with built in
trace context propagation. However, in some contexts this may be not be ideal
since your application or framework already has implicit context or some other
mechanism built in. In those cases, having a lower level API just to produce
spans directly is more useful. It also allows for gradual addition of tracing
to your application instead of adding tracing all at once. For example, if you
use a jenkins_job_id as your trace_id, you don't have to explicitly propagate
any context across all those tasks in that job.

OpenTracing is a higher level API in these cases and adding these lower level
APIs may not be ideal.

> one is not limited to a single tracer per application

Yes, you can have multiple tracers per application. But, you are limited to
one tracer per request execution in the current libs.

> Consumption, how? In the UI? Yes, they are visible. Through an API? They’re
> in tags and logs fields. They’re definitely not hidden.

You are right, none of the data we put on the span is hidden from the user. In
the existing systems, a user is exposed to an abstraction of a span. They are
not exposed to raw spans. It is in this sense, that the true spans are hidden
from the user.

> Jaeger has pluggable storage back ends. One of them is kafka. With kafka,
> it’s one step to such a solution. Either store them in a database allowing
> for such query (presto, druid... hello?) or write a kafka app managing
> alerts. Why reinvent the wheel completely?

Yes, we already put this data in Presto and query that data using SQL.

We tried doing it with zipkin data in the past. However, such efforts have
failed because beyond a few power users regular users had a hard time querying
the raw span data. Further, the current span formats are not database friendly
since they have nested structure.

In practice, the simplest system we found is one where the user produces the
data in a specific format and then consumes the data in the same format.
SpanEvent was created to provide the same view during span creation and span
consumption. The short answer to why a new system is it's simpler.

------
dvt
To me, it seems pretty crazy that (what's basically) a chat app needs such
high-levels of traceability -- 2Tb of trace data every day? _Why?_ I mean,
people have been running IRC servers in their basements for decades just fine,
but I guess the millions in engineering budgets need to go _somewhere_.

~~~
m1keil
You are comparing two different types of services. While they share
commonalities (passing messages between users), the details are quite
different. For example, you don’t get the history of IRC server for times you
were offline. It’s like comparing between an airbus and a glider, both fly,
right?

