
Jaeger – A Distributed Tracing System - kiyanwang
https://github.com/jaegertracing/jaeger
======
jhinds
Neat! I was looking into tracing solutions for our k8s cluster the other day
and was going to look into setting up Zipkin. Now I'll this to my list of
tools to evaluate. I found this blog post by uber informative
[https://eng.uber.com/distributed-tracing/](https://eng.uber.com/distributed-
tracing/) so maybe there is no need to even setting up Zipkin and just start
with Jaegar?

~~~
dankohn1
Most folks will choose either Zipkin or Jaeger, but both are OpenTracing-
compatible distributed tracing systems. You might find the Cloud Native
Landscape useful for thinking about the options:
[https://github.com/cncf/landscape/blob/master/README.md](https://github.com/cncf/landscape/blob/master/README.md)

Disclosure: I’m the executive director of CNCF, which just adopted Jaeger 2
weeks ago, and I’m an author of the landscape.

~~~
yoshuaw
Last I checked "OpenTracing-compatible" only went as far as using common
terminology. Tbh I was a bit disappointed by this; has more been defined
since? E.g. are there now shared schemas, APIs of sorts?

~~~
dankohn1
Yes, OpenTracing is an API, with bindings currently available in 9 languages.
Please take a look.

[http://opentracing.io/](http://opentracing.io/)

~~~
AYBABTME
Only the semantics and libraries implementing them are there. The wire format
is not specified, which is a pretty annoying problem to deal with.

~~~
paulddraper
It's weird for most people. We're used to cross-language wire protocols.
OpenTracing is different.

An analogy is SLF4J for Java logging. All libraries, etc use the same
interface and the final user determines the backend: java.util, Logback. This
makes sense if you have many authors of libraries with a cross-cutting
concern.

This really makes OpenTracing half a dozen different standards, one per
language, with common semantics.

Should it be about a wire protocol instead? Discussion at
[https://github.com/opentracing/specification/issues/34](https://github.com/opentracing/specification/issues/34)

------
secure
I recently used Jaeger to visualize what my distributed system is doing. Got
it done in about 2 hours; pretty pleased with that. Jaeger is visually
pleasing, available via docker and just worked for me.

~~~
jpkrohling_jpk
Would you be interested in writing a guest blog post about it?

------
pat2man
I've been using Jaeger for a few months now and it performs really well. I
even wrote a custom tracer for an unsupported language and it was relatively
straight forward.

There are a few things I wish it did, but they are all on the roadmap:
[http://jaeger.readthedocs.io/en/latest/roadmap/](http://jaeger.readthedocs.io/en/latest/roadmap/)

~~~
jpkrohling_jpk
If you are interested in contributing it to the jaegertracing GitHub
repository, I'm sure the Jaeger community would welcome it :)

------
trampi
Shameless plug:

A friend of mine, Felix Barnsteiner, wrote a profiler for Java based
applications, called stagemonitor [1]. He started working on it in 2013 for
his masters thesis. Since then, he steadily worked on it in the company as
well as in his spare time. Some months ago, he implemented support for
distributed tracing. Stagemonitor implements Open Tracing. It also collects
frontend performance data, called end user monitoring. They get correlated
automatically. And the best thing about it: stagemonitor is free and open
source (APL). Get in touch if you have any questions.

[1]: [http://www.stagemonitor.org/](http://www.stagemonitor.org/)

------
Dowwie
This seems like a horror story about a large company that decided to depart
from monolithic architectures to one of microservices and found that one set
of problems was exchanged for another. Was there a net gain?

------
necubi
I'm disappointed that all of the open-source tracing systems have adopted the
Dapper [0] model. It's understandable why: it's extremely simple to implement,
as it handles scaling challenges by doing client-side sampling.

A bit of background about how Dapper-style distributed tracing works. Things
typically start with an RPC call of some kind (typically from an external
source like a public load balancer). At that point, you must decide whether to
trace this request or not, which is typically done as a random sample (say, 1%
of requests). At that point, the request gets assigned a _trace id_, a random
identifier for that request.

The trace id is stored in some request context and propagated to each
subsequent service. Each service, meanwhile, divides up its request processing
flow into a series of "spans" which represent some piece of computation. For
example, a span cover an RPC call or a DB query. Spans are identified by a
random _span id_. Once a request has been sampled, all spans for that request
are sent to a central span collector where they're stored for later querying.

This model is simple but very limited. It's often hard to know whether a trace
is interesting at the outset, hence the reliance on random sampling. For
example, you might want to understand why your 99p latency is high, but if
you're just sampling 1% the 99p requests will only be 0.01% of your sample.

More generally, interesting events (like errors or slow requests) tend to be
rare, and sampling a random, small percent of requests is unlikely to turn up
the interesting cases.

A better model, as implemented by lightstep [1] (and an in-progress
distributed tracer I've been working on) is to collect all spans. Even with
very high request volume it's reasonable to store all spans for at least a few
minutes. Doing so opens up all sorts of interesting possibilities, because you
can start tracing a request at any point during that window. For example, you
might want to trace all requests that have errors in them. Or all requests
that take longer than a certain time. Or get a google sample of requests
across different latency buckets. Or requests that violate some application
invariants you've defined.

Ultimately, though, distributed tracing is so helpful for understanding
complex distributed systems and webs of microservices, and it's exciting to
see more open-source competition for zipkin.

[0] Dapper is Google's distributed tracing system. The paper
([https://research.google.com/pubs/pub36356.html](https://research.google.com/pubs/pub36356.html))
kicked off a lot of interest in distributed tracing in the broader community.
[1] [http://lightstep.com/](http://lightstep.com/)

~~~
the_evacuator
That sounds pretty expensive. With dapper the trace span annotations do
nothing if the request isn’t being traced. If it is being traced you might
have significant costs, along the lines of sprintf(“%.03f”, ...) or other very
cpu-intensive activity. This is OK when you trace one per million but when you
trace everything you have to think about the cost. This could lead to either
just using more CPU than you really wanted, or to discourage trace
annotations. Either would be bad.

~~~
necubi
Yeah, that's definitely an upside of the dapper approach--very, very minimal
overhead for non-traced requests. However for the vast majority of use cases a
bit of overhead (microseconds per span) tends to be unnoticeable, and the
benefits in terms of introspectability are huge. In general, the overhead is
mitigated by the fact that spans tend to be pretty large (on the ms-scale).

~~~
ehsankia
I guess it depends on your use case. If you are indeed interested in the 99p
latency, then yes, the only way would be to trace everything. But in that
case, couldn't you temporarily set the Dapper probability to 100%, record the
data you need, and then turn it back down. It seems like a lot more malleable
for different use cases.

------
laichzeit0
How is this different from e.g. AppDynamics or Dynatrace for example?
Technically speaking, not free/commercial.

From the documentation it doesn't look like much, except probably the biggest
downside is that you have to add your instrumentation points manually? i.e.
there is code change required.

~~~
jpkrohling_jpk
Jaeger is just the concrete implementation. OpenTracing is what you should
think about using when writing your application. That said, you could benefit
from automatic instrumentation via OpenTracing framework integration libraries
(JAX-RS, Servlets, ...). Or use the Java Agent. But in the end, I think there
_is_ value in manually instrumenting your code, specially around your business
transactions. This way, you can monitor your business metrics, instead of
"http requests" or "cpu load".

------
trimbo
Is there a pros/cons comparison to Zipkin? Browsed the docs and couldn't find
anything.

~~~
pritianka
100x better UI among other things. Jaeger has a dedicated team and resources
and that shows. They are taking the time to write good docs, build demos,
create a thoughtful UI, etc. I definitely recommend it. Being OpenTracing
native also helps. Full disclosure, I work on the OpenTracing project.

------
marknadal
Does this run in production or is it used for testing? We wrote a distributed
testing system ( [https://github.com/gundb/panic-
server](https://github.com/gundb/panic-server) ), so I'm trying to understand
if integrating Jaeger would be helpful. If Jaeger is run on a production
stack, I'd be curious to understand how that works (are there any tech talks
on it yet?). If it is designed to run for tests, that makes sense, but then
does it depend upon another distributed testing tool? If so, I'd love to see
that tool. Glad to see it be Open Sourced!!!

~~~
javajumbo
Yes, Jaeger is used in production at Uber and many other companies. Our
original blog post on the history of Jaeger:
[https://eng.uber.com/distributed-tracing/](https://eng.uber.com/distributed-
tracing/)

------
saadazzz
What's the performance hit for using Jaeger in production? Are there
benchmarks?

------
bronz
what is tracing?

~~~
manigandham
In this context, it's the network-wide version of a stack trace.

Instead of just method calls in a process, you take a top-level (HTTP) request
and see all the various upstream service calls and their internal logic
involved in completing that request. Useful for micro/multi-service based
architectures but you can trace anything because it's a standard output
format.

~~~
pritianka
I have to say, you described tracing really well! I work on the OpenTracing
project [1] (and at LightStep) and have been describing tracing to people for
a while and your explanation is way better haha :-) [1] opentracing.io

------
kuwze
Why would you use this over a logging framework?

------
khc
Is it something similar to sysdig?

~~~
pritianka
Sysdig is focused on the networking layer. This is for the application layer
(L7).

~~~
apurvadave
Actually sysdig can do application, container, host, network. Most of sysdig
use cases tend to be more focused on the software running on a node vs the
network itself.

------
konschubert
Slightly off topic: It is really interesting how German names have become en
vogue in recent years.

------
shevy
I don't like the name.

For those who don't know german, Jaeger is "Jäger" is hunter/ranger. A
somewhat neutral term in itself but there is also a slight, somewhat remote
connection towards some part of the history ("Jagdstaffel" and what not). I
have absolutely no idea if this has anything to do with it, mind you - but
since the main authors appear to be in the USA, I find that very awkward. Why
not just stick to some english name? That would seem a much better it. Or
perhaps they think german names are awesome ... it's also weird when you see
all the people write Jaeger rather Jäger...

~~~
jjjensen90
This is a pretty silly objection.

There are plenty of software projects named from a variety of languages not
native to the creator of the package. And Jäger has so many neutral meanings
and even for a German wouldn't first bring thoughts of the Jagdstaffel I don't
think...

Also writing the ä as ae is very common for those who have keyboards without
umlaut characters... I've seen it lots of times and doesn't look too weird.

~~~
dsnuh
Absolutely. "Hunter" seems like a perfectly acceptable name for something that
collects data, and I can almost guarantee that the vast majority of Americans
have never even heard of the term. Most Americans in my experience have have
little to no knowledge of WWI, let alone the Jadgstaffl. To object to the name
on such grounds seems pedantic in the extreme to me.

Edit: typo

~~~
KGIII
They may know it from Jagermeister and jäger bombs.

We may not know our history, but we do like drinking.

