
A simple way to get more value from tracing - zdw
https://danluu.com/tracing-analytics/
======
TheColorYellow
At a recent start-up like situation we paid licensing for a monitoring and
metrics provider that provided out of the box tracing infrastructure for us.

It turned out that the licensing was quite expensive for what we where getting
(couldn't even get outlier traces), we didn't have much control of our
infrastructure (couldn't control the sampling rate in such a way to keep all
the traces we needed), and ultimately the learning curve for making the
infrastructure scalable was a challenge as well.

Even after setting up tracing in our entire system successfully, I ultimately
found it of minimal value and we ended up relying on ELK and log aggregation
for more insight.

Tbh, it was a disappointing experience and just reading how much custom dev
was required in this post to make tracing work for you guys makes me
disappointed. Given the frame of work I do (consulting), I doubt tracing will
ever prove more valuable than spending money on log aggregation and other
monitoring efforts and tools.

~~~
matsemann
A quick win I've seen and myself implemented as a consultant, is the use of
correlationId in some form. If an incoming request has a correlation-id-
header, then use it. If not generate a unique one. When calling other services
make sure to include the current correlationId, and thus it spreads throughout
the services for a given request.

Most logging frameworks allow you to add something to every log message, so
add the correlationId there. Then it's easy to track between multiple
services. It also makes it easy to distinguish which log messages belong to a
given request for a single service, so even without lots of services it can be
a nice little feature.

~~~
coredog64
It’s easy for this to fall over. We were doing this where the entry point
would make hundreds or even thousands of calls. Unfortunately, I couldn’t
convince anyone that you needed spans and not just traces

------
amelius
What kind of "tracing" is this about?

~~~
ponker
I think... the kind where you assemble a beginning to end view of a request
and all the micro services it hits by crawling through log files.

~~~
buboard
first web world problems

~~~
PudgePacket
?

~~~
twic
I interpret this as meaning that this is only a challenge for the kind of
over-scale systems that only a handful of companies have.

You could be a pretty significant online retailer (Asos, say), run everything
on a few instances of a monolith, and have all your information in easily
interpretable log files, without needing to reach for distributed tracing
tools.

~~~
disgruntledphd2
Yeah, this tends to be more of an issue for consumer-grade internet services
(where the majority of users are not customers).

Again, these tools are _incredibly_ useful when you have a distributed system,
and kinda pointless otherwise.

------
omeze
> Taken together, the issues were problematic enough that tracing was
> underowned and arguably unowned for years. Some individuals did work in
> their spare time to keep the lights on or improve things, but the lack of
> obvious value from tracing led to a vicious cycle where the high barrier to
> getting value out of tracing made it hard to fund organizationally, which
> made it hard to make tracing more usable.

Omg, this is 100% the situation at $DAYJOB. I actually think this is generally
true for any new infrastructure that devs are expected to interface with. E.g.
for metrics, most engineers I work with also don't really understand
Prometheus, despite it being incredibly useful with a robust query language,
but looking at Grafana graphs gives them a good enough picture. The big "wtf"
moment with tracing is always the sampling, its really hard to build a good
sampling system that collects meaningful traces ahead of time, much less
explain it. The article's "problems" list rings really true.

For my story: we tried to roll our own naive system backed by ELK that
piggybacked off our logging stack, which largely failed due to time/resource
constraints (kibana and elasticsearch actually worked fairly well for some
types of aggregations, but some really useful visualizations like a service
graph aren't possible). I agree with the post in stating that building a good
distributed tracing system from scratch definitely isn't trivial, but it's not
very difficult given the tools that exist today. Also, the article mentions
things like clock skew not mattering in practice, which if you were to
approach this from a "blank slate" would definitely be a bit counter-
intuitive.

Anyway, due to resourcing constraints (i.e. we had no observability team, and
even if we did they wouldn't have worked on this particular problem 1+ years
ago), we use a vendor (Lightstep) which is generally solid & requires very
little maintenance, but there are still issues where simple conceptual queries
like "show me traces that had this tag" aren't possible. E.g. they don't
support things like querying `anno_index` from the article for historical
data, you can only query in-memory data (so last ~10-20 mins, helpful for
immediate oncall scenarios but not much else).

The really interesting thing about distributed tracing is that its really
still in its infancy in terms of potential applications. The article focuses
on performance/oncall scenarios, but we:

\- Built a post-CI flow that captures integration test failures and links
developers to a snap shot so they can see the exact service their test failed
- no digging through logs.

\- Can see traces during local development.

\- In OpenTracing, there's the pattern of injecting untyped "baggage items"
that make it through to downstream systems. We integrated this into our
_logging_ clients, so you get free metadata from upstream services in logs
without having to redeploy N microservices (this lets us get around some of
the query limitations in Lightstep).

\- I'm also shooting around ideas to leverage baggage items to inject sandbox
data for writing tests against plolyglot, manyrepo services. This lets test
data live next to test driver code (e.g. a Go service calls into a Python
service & we want to mock the Python service's test data but don't want to
have to update + merge a change to that repo).

Kudos to Twitter for a great tech + organizational initiative.

------
tlarkworthy
Tracing is a cross functional concern, and therefore does not belong in
application binaries but in a sidecar or reverse proxy intermediates, where it
can be written once and applied to everything. If you do not have the need for
these complex deployments then don't bother with tracing either.

~~~
pluies
While I 100% agree that having doing most of the tracing at the sidecar level
is the right call, you'll still need "some" application-level awareness of
tracing (e.g. passing along b3 headers:
[https://github.com/openzipkin/b3-propagation](https://github.com/openzipkin/b3-propagation)
), or you won't be able to match various spans to a common trace and lose a
_lot_ of value from your distributed tracing.

~~~
tlarkworthy
Good point you need to pass the correlation key, but the sidecar is the one
talking to the tracing infra

------
foreigner
Based on the headline and current events I assumed this article was going to
be about "Contact Tracing" for the pandemic.

------
user5994461
Centralized logging is much more primordial. If you're dealing with HTTP
services, centralized logging will show you the status code, path, response
time, etc.. for every request. With ample support for aggregations.

Teaching developers to effectively use Kibana/Splunk/Graylog would provide
much more benefits than investigating distributed tracing solutions.

If you're dealing with non HTTP services and want to investigate performance.
What you're needing is most likely a profiler. Java has a fantastic one out-
of-the-box (jvisualvm) that can even work remotely to live production process,
C and C++ have some CPU profilers but expensive (the one in visual studio
pro/ultimate), python has some tools but don't recall which one was good.

Of course none of these tools are trivial to understand for the non initiate.
Show a developer stack traces and timing information and they have no idea
what they mean, let alone how they are supposed to improve performance from
that. What could improve adoption greatly is to give training and record a
handful of video tutorials.

~~~
whatshisface
Linux has "perf."
[https://perf.wiki.kernel.org/index.php/Main_Page](https://perf.wiki.kernel.org/index.php/Main_Page)

It's easy to use and very powerful. I've used it with C and Rust and had great
success.

