
Instrumentation: The First Four Things You Measure - cyen
https://honeycomb.io/blog/2017/01/instrumentation-the-first-four-things-you-measure/
======
ackerman80
Came across this which gives good insight into the 4 golden signals for a top-
level health tracking: [https://blog.netsil.com/the-4-golden-signals-of-api-
health-a...](https://blog.netsil.com/the-4-golden-signals-of-api-health-and-
performance-in-cloud-native-applications-a6e87526e74#.uzu89hl16)

One thing of note in the graph is the tracking of response size. This would be
very useful for 200 responses with "Error" in the text. Because then the
response size would drop drastically below a normal successful response
payload size.

In addition to Latency, Error Rates, Throughput and Saturation , folks like
Brendan Gregg @ Netflix have recommended tracking capacity.

~~~
saravana87
TCP retransmission rates looks like a useful metric which can help in
monitoring the health of a service. One way to obtain that is by analyzing
service interactions as mentioned in the blog. Tracing could be another way
through which we can find that info. I am curious as to how code instrumented
monitoring solutions get that information. (PS: I work for Netsil)

~~~
bbrazil
By default you can only get that per-kernel from /proc/net/netsnmp. BPF may
allow something more granular.

The other way of approaching it is to look for the additional latency it
causes, which you can spot on a per-service basis.

~~~
saravana87
Additional latency could be an indicator, but there's no guarantee that it is
because of retransmissions ?

~~~
bbrazil
If you look at your latency histogram and are seeing a bump at around 200ms
above normal (which was the default minimum wait time a few years back
anyway), it's probably retransmits.

~~~
saravana87
Got it.

------
bbrazil
> A histogram of the duration it took to serve a response to a request, also
> labelled by successes or errors.

I recommend against this, rather have one overall duration metric and another
metric tracking a count of failures.

The reason for this is that very often just the success latency will end up
being graphed, and high overall latency due to timing-out failed requests will
be missed.

The more information you put on a dashboard, the more chance someone will miss
a subtlety like this in the interpretation. Particularly if debugging
distributed systems isn't their forte, or they've been woken up in the middle
of the night by a page.

This guide only covers what I'd consider online serving systems, I'd suggest a
look at the Prometheus instrumentation guidelines on what sort of things to
monitor for other types of systems:
[https://prometheus.io/docs/practices/instrumentation/](https://prometheus.io/docs/practices/instrumentation/)

~~~
maplebed
By creating events that contain both the duration of the request and whether
it succeeded, you can create graphs that show you the detail you need. Unless
you include those data together at the beginning, it will be impossible to
tease them apart later on. Combining them into one graph will likely conceal
the difference in the two cases, as you describe, unless you feed them in to a
system that an natively tease them apart as easily as show them together (such
as [http://honeycomb.io](http://honeycomb.io)). So it seems like the
disagreement is more about visualization than collection (the section of the
blog in which that quote appears).

The originally quoted advice, to show "the duration it took to serve a
response to a request, also labelled by successes or errors" remains good
advice, so long as the visualization of that data makes clear the separation.

I absolutely agree that careful consideration is required when choosing what
to put on dashboards to avoid confusion. That seems to be a separate issue.

(bias alert - I work on Honeycomb, and care deeply about collecting data in a
way that lets you pull away the irrelevant data to illuminate the real
problems.)

~~~
hibikir
In practice, if your system if complicated and you have to look at the
visualization, you are already in trouble. For anything complicated, you need
exactly the inputs you describe, but everything has to be processed already by
another layer that can give you higher level ideas.

This is a place where I think you guys could beat what other 3rd party
monitoring tools are doing. I work with some of your guest bloggers, and I
work on a subsystem with its own dashboard: about 50 charts. To make bringing
new teammates a sensible experience, we need both a layer of alerts on top of
the charts, and then a set of rules of thumb, that should be programmed if the
alerting system was good enough, that put the alerts together into realistic
failure cases: if X and Y triggered, but Z didn't, then chances are this piece
is probably the culprit.

There's also opportunities in visualizations that aren't chart based: We used
to have something like that for another complex system in another employer,
but that's expensive, custom work, unless you join forces with something that
understands were all your services are, knows all ingress and egress rules,
and thus could automatically generate a picture of your system, along with
understanding the instrumentation: So leave that until you merge with
SkylinerHQ or something.

That said, I think you guys are heading towards a good, marketable product as
it is. Fixing the annoying the statsd/splunk divide of older monitoring would
probably make us buy it already.

------
anw
While this is good advice, I feel it is a bit too over-simplified.

Counting incoming and outgoing requests misses a lot of potential data points
when determining "is this my fault?"

I work mainly in system integrations. If I check for the ratio for
input:output, then I may miss that some service providers return a 200 with a
body of "<message>Error</message>".

A better message is to make sure your systems are knowledgeable in how data is
received from downstream submissions, and to have a universal way of
translating that feedback to a format your own service understands.

HTTP codes are (pretty much) universal. But let's say you forgot to inlcude a
header or forgot to base64 encode login details or simply are using a wrong
value for an API key. If your system knows that "this XML element means Y for
provider X, and means Z in our own system", then you can better gauge issues
as they come up, instead of waiting for customers to complain. This is also
where tools like Splunk are handy, so you can be alerted to these kinds of
errors as they come up.

~~~
hamandcheese
The article never defines what an error is, so I think it is very reasonable
to take your approach. I think you're mistaking very abstract advice with very
simple advice :)

------
kasey_junk
> A histogram of the duration it took to serve a response to a request, also
> labelled by successes or errors.

This is so much easier said than done. Most time series db that people use to
instrument things quite simply cannot handle histogram data correctly. They
make incorrect assumptions about the way roll-ups can happen or they require
you to be specific about resolution requirements before you can know them
well.

Then histogram data tends to be very expensive to query so it bogs down
preventing you from making the kinds of queries that are really valuable for
diagnosing performance regressions.

Finally, the visualization systems for histograms are really difficult because
you need a third dimension to see them over time. Heat maps accomplish this
but are hard to read at times and most dashboard systems don't have great
visualization options for "show this time period next to this time period"
which is an incredibly common requirement when comparing latency histograms.

~~~
maplebed
Yup! It's hard! All the things you point out are right on.

We don't have the visualizations for histograms yet (though you can chart
specific percentiles), but for the reasons you mention, Honeycomb is perfectly
suited to give you that kind of data. I can't say we'll get that out the door
soon, but it's one of my pet most wanted features so as soon as I can convince
myself it's actually more important than all the other mountain of things that
need to get done, you'll get your histograms and your time over time
comparisons.

I've been advocating for a heat map style presentation of histograms for a
long time, but I hadn't considered the difficulty that creates when trying to
show time over time. That's an interesting one to noodle on.

Thanks for articulating well the value and reasons for difficulty in
implementing histograms!

(bias alert - I work on Honeycomb)

------
sambe
The whole series:

[https://honeycomb.io/blog/categories/instrumentation/](https://honeycomb.io/blog/categories/instrumentation/)

------
techbio
Author appears to use "downstream" and "upstream" to refer to "further down
the stack" and "further up the stack".

Is this normal usage? Seems reversed to me.

~~~
dmichulke
It depends on whether you look at control flow (who calls whom) or data flow.

~~~
jholman
This is the issue, yeah.

To rephrase that, "upstream" means "where events come from".

------
jdormit
Is the last paragraph a joke? If so, could someone explain it?

~~~
LeifCarrotson
It trips my sarcasm detector, but it's also exactly what I'd expect business-
speak to say if they did cut a series short due to lack of money.

------
greenleafjacob
Saturation and utilization are different things. For CPU time, utilization
would be how many cycles were spent running user tasks over total cycles.
Saturation would be how much time (cycles?) was spent in the run-queue. For
disks, utilization could be IOPS, saturation is time spent in I/O wait or
queue sizes. For network interface, utilization could be Gbps, saturation is
total time spent waiting to write to sendq.

------
siliconc0w
Every request to and from the app should be instrumented. Paying addition to
the requests to the app is a good start - but you really need detailed
instrumentation of all downstream dependencies your service uses to process
it's requests to understand where the issue is. It's often likely you're slow
or throwing errors because a dependency you use is slow or throwing errors. Or
maybe the upstream service complaining has changed it's request pattern and
they're making more expensive queries. There is often a small minority of
requests that are responsible for most of the performance issues so even if
the overall volume hasn't changed, the composition and type of requests matter
as well.

------
vaishaksuresh
Off Topic: Does anyone know what tools the author uses to make the diagrams?

~~~
copyconstruct
Paper by 53 or Sketches or Sketchbook or any of the several sketching apps
available for the IPad would be my guess.

~~~
ChristianGeek
I assumed it was whiteboard photos with hand-drawn photoshop masks.

------
grandalf
This is a very useful, common-sense post. I've created exactly that sort of
thing using redis. The expiry mechanism combined with formatting date strings
to create time buckets for keys allows quite a bit of power and is simple to
write.

