
Monitoring demystified: A guide for logging, tracing, metrics - malechimp
https://techbeacon.com/enterprise-it/monitoring-demystified-guide-logging-tracing-metrics
======
buro9
A lot of excellent information in that blog post and linked from it... but if
you're wondering where to start:

1\. Write good logs... not too noisy when everything is running well,
meaningful enough to let you know the key state or branch of code when things
deviate from the good path. Don't worry about structured vs unstructured too
much, just ensure you include a timestamp, file, log level, func name (or line
number), and that the message will help you debug.

2\. Instrument metrics using Prometheus, there are libraries that make this
easy:
[https://prometheus.io/docs/instrumenting/clientlibs/](https://prometheus.io/docs/instrumenting/clientlibs/)
. Counts get you started, but you probably want to think in aggregation and to
ask about the rate of things and percentiles. Use histograms for this
[https://prometheus.io/docs/practices/histograms/](https://prometheus.io/docs/practices/histograms/)
. Use labels to create a more complex picture, i.e. A histogram of HTTP
request times with a label of HTTP method means you can see all reqs, just the
POST, or maybe the HEAD, GET together, etc... and then create rates over time,
percentiles, etc. Do think about cardinality of label values, HTTP methods is
good, but request identifiers are bad in high traffic environments... labels
should group not identify.

Start with those things, tracing follows good logging and metrics as it takes
a little more effort to instrument an entire system whereas logging and
metrics are valuable even when only small parts of a system are instrumented.

Once you've instrumented... Grafana Cloud offers a hosted Grafana, Prometheus
metrics scraping and storage, and Log tailing and storage (via Loki)
[https://grafana.com/products/cloud/](https://grafana.com/products/cloud/) so
you can see the results of your work immediately.

If it's a big project, you have a lot of options and I assume you know them
already, this is when you start looking at Cortex and Thanos, Datadog and
Loki, tracing with Jaegar.

~~~
dkersten
I’d add to no. 1 by saying include a correlation id or request id so you have
a way to filter the logs into a single linear stream related to the same
action.

~~~
jrott
Absolutely constantly being able to get a single linear stream is when tracing
becomes super powerful.

------
KaiserPro
A few things I have learnt along the way:

Logs are great, but only once you've identified the problem. If you are
searching through logs to _find_ a problem, its far too late.

Processing/streaming logs to get metrics is a terrible waste of time, energy
and money. Spend that producing high quality metrics directly from the apps
you are looking after/writing/decomming (example: dont use access logs to
collect 4xx/5xx and make a graph, collate and push the metrics directly)

Raw metrics are pretty useless. They need to be manipulated into buisness
goals: service x is producing 3% 5xx errors vs % of visitors unable to perform
action x

Alerts must be actionable.

Alerts rules must be based on sensible clear cut rules: service x's response
time is breeching its SLA not service x's response time is double its average
for this time in may.

~~~
twic
> Processing/streaming logs to get metrics is a terrible waste of time, energy
> and money. Spend that producing high quality metrics directly from the apps
> you are looking after/writing/decomming

Yeah nah, but, okay, nah yeah.

Generating metrics in the app is much more intrusive, and requires that you
figure out the metrics you need ahead of time. It adds dependencies, sockets,
and threads to your app.

Unless you're very careful, it's also easy to end up double-aggregating,
computing medians of medians and other meaningless pseudo-statistics - if
you're using the Dropwizard Metrics library, for example, you've already lost.

If you output structured log events, where everything is JSON or whatever and
there are common schema elements, you can easily pull out the metrics you
need, configure new ones on the fly, and retrospectively calculate them if you
keep a window of log history.

When i've worked on systems with both pre- and post-calculated metrics, the
post-calculated metrics were vastly more useful.

The huge, virtually showstopping, caveat here is that there is lots of decent,
easy-to-use tooling for pre-calculated metrics, and next to nothing for post-
calculated metrics. You can drop in some libraries and stand up a couple of
servers and have traditional metrics going in a day, with time for a few games
of table tennis. You need to build and bodge a terrifying pile of stuff to get
post-calculated metrics going.

Anyway if there's a VC reading this with twenty million quid burning a hole in
their pocket who isn't fussy about investing in companies with absolutely no
path to profitability, let me know, and i'll do a startup to fix all this.
I'll even put the metrics on the blockchain for you, guaranteed street cred.

~~~
adamzegelin
> if you're using the Dropwizard Metrics library, for example, you've already
> lost.

Can you go into a bit more detail here? Curious to know where Dropwizzard goes
wrong.

I prefer to use the Prometheus client libraries where possible. Prometheus'
data model is "richer" \-- metric families and labels, rather than just named
metrics. Adapting from Dropwizzard to Prometheus is a pain, and never results
in the data feeling "native" to Prometheus.

~~~
therealdrag0
I think they just mean the host is aggregating, so any further aggregation is
compounded slant time the data. Like StatsD’s default is shipping metrics
every 10s, so if you graph it and your graph rolls up those data points into
10 minute data points (cuz you’re viewing a week at once), then you’re
averaging an average. Or averaging a p95. People often miss that this is
happening, and it can drastically change the narrative.

~~~
twic
Yes, exactly this. It's the fact that you're doing aggregation in two places.
Since you're always going to be aggregating on the backend, aggregating in the
app is bad news.

It may be interesting to think about the class of aggregate metrics that you
can safely aggregate. Totals can be summed. Counts can be summed. Maxima can
be maxed. Minima can be minned. Histograms can be summed (but histograms are
lossy). A pair of aggregatable metrics can be aggregated pairwise; a pair of a
total and a count lets you find an average.

Medians and quantiles, though, can't be combined, and those are what we want
most of the time.

Someone who loves functional programming can tell us if metrics in this class
are monoids or what.

There is an unjustly obscure beast called a t-digest which is a bit like an
adaptive histogram; it provides a way to aggregate numbers such that you can
extract medians and quantiles, and the aggregates can be combined:

[https://github.com/tdunning/t-digest](https://github.com/tdunning/t-digest)

------
dig1
The Art of Monitoring [1], covers most of these stuff in a unified manner.

You are introduced to some basics (push vs. pull monitoring), then proceeded
with simple system metrics collection (cpu, memory) via collectd, then goes to
logs ingestion and ends up extracting application-specific metrics from jvm
and python applications.

I highly recommend it, even for seasoned professionals.

[1] [https://artofmonitoring.com/](https://artofmonitoring.com/)

------
gnufx
I never see an important system management principle brought up: If you get a
user complaint (for some value of "user") and not an alert, you should fix the
monitoring system so that you don't get another occurrence of it or related
problems. Obviously that's within reason, depending on the circumstances; the
effort might not be worth it.

------
secondcoming
We log extensively. Here are some of my thoughts it

\- at least in C++, the requirement to be able to log from pretty much
anywhere can lead to messy code that either passes a reference to your logger
to all classes that might possibly need it, or you've got an extern global
somewhere. Yuck.

\- logging can enable laziness. Being able to log that something weird
happened can be considered a sufficient substitute for proper testing.

\- logs are only as useful as the info they contain. This can mean state needs
to be passed around all over the place just so that it can all be eventually
logged on one line (it saves your data team from having to do a 'join')

\- if your logger doesn't support cycling log files it's useless. If something
goes wrong you can easily fill a disk.

~~~
viraptor
I'd disagree with 2 and 4.

2\. Given a large enough system you will encounter situations where the only
action you can take is to log "this really shouldn't happen" and try to roll
back as cleanly as possible. This may be due to either complexity or a bug
manifesting in a layer completely different than where it occurred (I've seen
a null reference crash on "if(foo) foo->bar();" in the past)

4\. I believe loggers should ideally know as little as possible about your
logs. Logs can be rotated externally, can be buffered and sent to other hosts
without touching the disk, can be ignored. Ideall, the system should care, not
the app.

~~~
mmkos
> I've seen a null reference crash on "if(foo) foo->bar();" in the past

References can't be null. Regardless, that's a valid check for a null pointer
and I don't think what you wrote is at all possible (unless maybe in some
multithreaded scenario?).

------
kasey_junk
It’s weird to see the stuff by Jay Kreps (of Kafka ~fame~) listed in the logs
section. His writing is specifically _not_ about logs the observability tool,
but logs the data structure such as you’d see at the heart of a database.

~~~
aloknnikhil
No. The original Kafka paper does talk about logs in the observability sense
as a premise to solve the aggregation problem.

[https://cs.uwaterloo.ca/~ssalihog/courses/papers/netdb11-fin...](https://cs.uwaterloo.ca/~ssalihog/courses/papers/netdb11-final12.pdf)

> There is a large amount of “log” data generated at any sizable internet
> company. This data typically includes (1) user activity events corresponding
> to logins, pageviews, clicks, “likes”, sharing, comments, and search
> queries; (2) operational metrics such as service call stack, call latency,
> errors, and system metrics such as CPU, memory, network, or disk utilization
> on each machine. Log data has long been a component of analytics used to
> track user engagement, system utilization, and other metrics.

> We have built a novel messaging system for log processing called Kafka [18]
> that combines the benefits of traditional log aggregators and messaging
> systems....Kafka provides an API similar to a messaging system and allows
> applications to consume log events in real time.

~~~
kasey_junk
A quote from the LinkedIn blog post linked in the article:

“But before we get too far let me clarify something that is a bit confusing.
Every programmer is familiar with another definition of logging—the
unstructured error messages or trace info an application might write out to a
local file using syslog or log4j. For clarity I will call this "application
logging". The application log is a degenerative form of the log concept I am
describing”

~~~
aloknnikhil
Fair enough. But I don't think quoting this for logging in the tracing sense
is wrong here. He does acknowledge that trace logs are a degenerative form of
logs from the perspective of log processing. The only difference being in the
semantics of human readable text v/s binary logs.

------
say_it_as_it_is
Is there an open source solution for processing streams of structured and
unstructured logs and routing then onward? I see solutions for moving logs to
elastic or Kafka but nothing for evaluating the log.

~~~
ekimekim
This is a problem that is both solved again and again, but also all the
available solutions are bad.

In my experience what happens is:

1\. you start with a "ship logs from X to Y" product

2\. you add more sources and more destinations, making it more of a central
router. you add config options for specifying your sources and dests.

3\. since the way you checkpoint or consume or pull or push certain sources or
dests doesn't generalize, you end up buffering internally to present a unified
"I have recieved / sent this message successfuly" concept to your inputs and
outputs.

4\. you want to do some basic transforms on the logs as you go. you implement
"filters" or "transforms" or "steps" and make them configurable. your config
now describes a graph of sources -> filters -> dests

5\. your filters need to be more flexible. you add generic filters whose
behaviour is mostly controlled by their config options. your configs grow more
complicated as you use multiple layers of differently-configured filters

6\. you have a bad turing complete programming language embedded in your
config file. getting simple tasks done is possible, getting complex tasks done
becomes an awful, inefficient and unreadable mess.

My solution to this cycle has been to just write simple hard-coded
applications that can only do the job I need them to do. If they need a
different configuration later I edit the source. I'm writing my transforms in
a real programming language and I avoid the additional complexity of
abstractions. Of course, that comes with its own costs but I consider it well
worth it.

------
waihtis
> Logging is critical to detecting attacks and intrusions.

Yes, but not universally - and just collecting logs will not take you far.
Logging everything and trying to approach security via the ’collect all data’
is both expensive and inaccurate, and one of the major inefficiencies in
modern cyber.

~~~
onefuncman
This is done efficiently at scale by both Cylance and Crowdstrike, but is
certainly only one part of a defense in depth strategy.

There are viable products around human threat hunting which would be
impossible without a 'collect all the data' component.

~~~
waihtis
You are correct, and this is the key part - what % of organizations have
money, skills and people to build a robust enough capability around threat
hunting, for example?

I’ve been super lucky to meet various orgs and their security in all
geographies and many industries and my gut feeling is 1 out of 10 teams.

------
FrontAid
Recently, I was searching for a service which offers those functionalities on
a very basic level. I tried several options and was really disappointed with
all of them. The only one that I found to be usable was
[https://logdna.com/](https://logdna.com/). I've now been using it for a
couple of weeks and it works OK. It offers logging, alerts,
metrics/dashboards, and some other things. And all that for a reasonable
pricing.

------
xondono
Am I the only one that can’t reach the “save and exit” privacy button on
mobile?

It’s hard for me to think that this is not intentional when the “Accept all”
is usable but the alternative isn’t...

------
notmalc
Nice

------
anderspitman
If you don't need all the fancy metrics, and just want something simple to
keep an eye on your services, alert you if they fail, and automatically
restart them, check out my stealthcheck service. It's all of 150 lines of free
range, 0-dependency go:

[https://github.com/anderspitman/stealthcheck](https://github.com/anderspitman/stealthcheck)

