
Logs vs. Metrics: A False Dichotomy - kiyanwang
https://whiteink.com/2019/logs-vs-metrics-a-false-dichotomy/
======
SideburnsOfDoom
Article doesn't mention cardinality. And it's a key concept to understand how
these systems differ.

When I joined a company that used statsD for metrics and ELK for logging, I
was told the first rule of stats: you will not put a order id, customer id or
the like in the stat name. It is OK to have a value with a small set of values
(e.g. beta: true|false), but the order Id varies too much and will swamp the
statsd server.

Formally: order id is high cardinality. A boolean is low cardinality.
cardinality is the "number of elements of the set" \- foor a boolean it's 2,
for an "order state" enumeration it's a few, and for an order id it's
unbounded, but realistically, it's "how many orders might we have in any 2
week window / other stat retention period"

A stats system like StatsD that keeps continuous averages, max, min, p90 etc
cannot do this if there are lots of high-cardinality factors. There is only so
much combined cardinality that they can handle.

No, Prometheus will not save you here (1)

Whereas a logging system such as an ELK stack stores each individual log entry
in detail, and the more relevant unique ids you attach to it, the more
informative it is. Cardinality is not problem. But pre-calculated stats across
these fields are limited, and ad-hoc queries might be expensive to run.

You could, I suppose, start with a rich log entry and transform it into a
metric by discarding some parts. But not the reverse.

1)
[https://prometheus.io/docs/practices/naming/](https://prometheus.io/docs/practices/naming/)

> CAUTION: Remember that every unique combination of key-value label pairs
> represents a new time series, which can dramatically increase the amount of
> data stored. Do not use labels to store dimensions with high cardinality
> (many different label values), such as user IDs, email addresses, or other
> unbounded sets of values.

~~~
mbar84
Indeed a very important caveat!

Working in the ad-tech industry, we're tracking creative id, placement id,
domain and more for each impression. We're lucky if each timeseries has 10
impressions at the hourly granularity.

~~~
valyala
I'd recommend taking a look at ClickHouse for storing and analyzing billions
of ad-tech raw events in real time. There is no need in grouping them
beforehand, since ClickHouse is able to scan billions of rows per second per
server. See [https://clickhouse.yandex](https://clickhouse.yandex)

------
nodefortytwo
Although I agree in principle one other aspect that becomes critical in high
volume systems is efficiency.

Logging is generally a text stream and in production should only be warning+
levels (imo) so the impact on performance is negligible and really only when
something is going wrong.

I want metrics all of the time, potentially 10 or 20 metrics per
request/action, a high performance, low network method for sending those
metrics with low latency is critical.

We need both systems and both should be treated as tier 1 systems within an
organisation. I don't think pushing metrics into a log stream is a scalable
architecture.

~~~
delusional
Being a text stream, you'd also have to reparse that text stream, and hope
that the text stream never changes. Which is at odds with how I usually use
logs (as places to dump information for humans).

~~~
yehosef
Nowadays using something like elasticsearch is common for this and then you
don't have to reparse it. It's already indexed and you can search for what you
want.

If you're going back to the raw text files, then you're right. But I'm not
sure why anyone would do that these days.

~~~
haddr
but then it might be expensive to encode every possible dimension in the log
stream

------
thinkersilver
Am not sure how much this discussion is happening now but do remember having
these conversations more often a few years ago when tools like Prometheus were
coming up.

This is my take and shouldn't be taken as gospel but I've observed that
metrics are perfect for isolating and identifying an issue and logs attempt to
explain the behaviour seen. Having a layered dashboard set up top to bottom
showing the throughput, error rates and latencies expressing the health of
each layer from the exposed service or api all the way down to the subsystems
and hardware supporting them is a good place to start. I'd write more but
don't want to make this post overly long. There are several useful articles
online on the different methodologies and approaches to achieve this.

There isn't much conversation though about the knowledge an organisation has
around incident analysis workflows, how past incident resolutions are captured
and integrated with dashboards in monitoring systems and how they are shared
and reused as SREs and engineers work with log analytics. I've seen the same
lessons having to be relearned when new engineers join a team. Checklists are
a good example( there are more ways)for directing incident analysis in large
complex systems but how many times do you see this being supported by the
monitoring system? The focus is always on displaying pretty charts in a grid
when more value could be gleaned from the same dashboard content presented as
a rich living document with charts, integrated checklists in something
resembling a guide to resolving an issue.

There certainly is enough data to achieve this.

~~~
ggregoire
> I'd write more but don't want to make this post overly long. There are
> several useful articles online on the different methodologies and approaches
> to achieve this.

Do you have some of these articles in mind?

------
ris
I've thought about this a lot too and agree to a good extent, however:

> 2\. You can derive arbitrary metrics from log streams

Not _quite_. There isn't really a way (that I'd consider sensible) to get
periodic, system-level measurements such as system load, memory usage, I/O
stats from logs.

I'm also far less anxious about the idea of throwing away metrics beyond a
certain age than I would with logs, and to a degree that allows me to worry a
bit less about the volume of data I collect through metrics.

Doubling up on metrics/logs can also be key in detecting that one of those
systems isn't working properly for a particular set of hosts.

~~~
yehosef
>There isn't really a way (that I'd consider sensible) to get periodic,
system-level measurements such as system load, memory usage, I/O stats from
logs.

Why not Metric Beats
([https://www.elastic.co/products/beats/metricbeat](https://www.elastic.co/products/beats/metricbeat)).
The log structure is generally numbers, but it is still fundamentally
different from pure metrics, IMO.

My biggest problem with using ES for metrics is that with the exception of
Timelion, there is no way to do cross metric calculations (db queries/s over
requests/s => db queries/request)

------
valyala
While logs and metrics have common parts, they are quite different beasts. It
is quite expensive producing and storing logs for frequently changing metrics
such as requests rate or response latency on highly loaded system with
millions of rps.

I'd recommend:

\- Storing logs in high-performance systems such as cLoki [1], which give
enough flexibility to generate arbitrary metrics from high log volumes in real
time.

\- Storing metrics in high-performance time series database optimized for high
volumes of data points (trillions) and high number of time series (aka high
cardinality). An example of such a TSDB is VictoriaMetrics [2].

Implementing dashboards with time-based correlation between metrics and logs
based on shared attributes (labels) such as datacenter, cluster, instance,
job, subsystem, etc [3].

[1] [https://github.com/lmangani/cLoki](https://github.com/lmangani/cLoki)

[2]
[https://github.com/VictoriaMetrics/VictoriaMetrics/](https://github.com/VictoriaMetrics/VictoriaMetrics/)

[3] [https://grafana.com/blog/2019/05/06/how-loki-correlates-
metr...](https://grafana.com/blog/2019/05/06/how-loki-correlates-metrics-and-
logs----and-saves-you-money/)

------
peterwwillis
The dichotomy is real because nobody asks devs to output properly structured
logs, or to work with ops teams to associate particular logs with particular
states. Because of that, metrics are the defacto universal measure of the
health of a service. Getting lots of 500s? May be time to flip a circuit
breaker. Getting lots of logs of an unknown error? Time to... Send an email to
the devs and ask what this means, and wait around for an investigation
(meanwhile, metrics have identified the likely culprit)

For many of the services I run, I haven't looked at logs in months, because
the metrics tell the story. If service is degraded, I can usually correlate it
to another downed service, a network failure, or a recent change. No logs
needed.

Correlating metrics with logs is great, and proper distributed tracing is a
revelation. But if you had to collect and measure just one thing, it's
metrics.

------
jillesvangurp
There's a wide variety of time series data that is interesting to track:
metrics, application telemetry, kpis, auditing events, log messages, etc. This
is rich structured data. The richer it is; the more useful it gets. Many
companies use different solutions for each of those and that creates a lot of
complexity and devops overhead. I've seen a lot of projects where corners get
cut and there is just the bare minimum of data. With microservices and
serverless, the problem is even more acute because logging into individual
servers and grepping through logs like a caveman is just not practical
anymore. Running blind is downright irresponsible and sadly all too common.

When something happens, it's actually interesting cross examine all of these
different things. When you do, having meta data to drill down and break down
is essential and it's also essential to have software that can handle that
that doesn't fall over just because you had a spike in usage. Having
everything in one place and annotated with the right meta data is key. Another
thing that is key is retention policies to keep the data volume in check. Done
properly, you will be generating many GB in data per day. Left unchecked, this
will kill whatever infrastructure you have in no time. And while historical
data is sometimes interesting, it's usually the very recent past that is most
interesting. So, you should expect to throw away the vast majority of data you
collect after a short while.

------
dzimine
The dichotomy is real and a reflection of Dev vs Ops dichotomy. DevOps made
Dev and Ops collaborate but didn't blend the roles & skills. Ops appreciate
logs but require consistent metrics to identify and root-cause the problem.
Dev appreciate metrics but require logs to debug and fix the problem. Opinions
on what is more important are informed by role and experience; the author
makes it clear that as a team, we need both.

> For many of the services I run, I haven't looked at logs in months, because
> the metrics tell the story. If service is degraded, I can usually correlate
> it to another downed service, a network failure, or a recent change. No logs
> needed.

Good point, echoing Brendan Gregg, the author of "USE" method commented:

> The USE Method is based on three metric types and a strategy for approaching
> a complex system. I find it solves about 80% of server issues with 5% of the
> effort,”
> ([http://www.brendangregg.com/usemethod.html](http://www.brendangregg.com/usemethod.html))

Solving 80% of issues with 5% of effort is commendable; the rest 20% goes to
developers where the other 95% of effort is spent debugging and fixing the
problem, primarily by reasoning about the logs.

So: \- "which of metrics or logs is more important?" is a relative and moot \-
"can metrics be extracted from logs?" \- yes; "is it practical?" \- it
depends: likely NO for DIY. The fact that ELK is not making it particularly
well doesn't mean that other products can't / don't do it. -

------
falsedan
Manipulating/aggregating metrics is a giant pain, you have to understand what
exactly the implementors though `max` or whatever means and try to match it up
to what you need.

One thing the article doesn’t address is that logs have to be accurate and
metrics should be consistent, but not vice versa. If your performance metrics
are always 10% out, they’re still perfectly usable (as long as they’re always
10% high or low) to alert on or to compare to metrics from a known-good
historical period.

------
karmakaze
Can someone explain this?

> In practice, teams continue to ignore this problem and instead rely on
> aggregate time-series that are essentially impossible to interpret
> correctly, such as “maximum of 99th percentile latency by server.”

I get that we can't know P99 across all servers. Let's say we're talking about
latency and the max of P99's will be an upper bound on the P99 across all
servers. Why isn't this true or useful?

~~~
viraptor
An example: your server is handling requests to two endpoints. Endpoint A
responds in ~100ms, B in ~200ms. You start caching some responses to A higher
up. Now your P99 across all servers is higher even though you improved the
performance. (Potentially setting off alarms on previous thresholds)

While it doesn't solve the alerts, for viewing the data like that heatmaps are
amazing. But you need the raw numbers (or predefined bands for aggregation,
but that limits flexibility)

~~~
falsedan
But your p99 for endpoint B hasn’t changed, why would you alert based on
service-wide stats? It’s easy to tag/dimension metrics by urls

~~~
viraptor
Change the level of caching and you get the same result. Let's say you have
the same endpoint, but start caching user X but not others, because reasons.

~~~
falsedan
Still struggling, if requests from X were slow, caching their responses would
only drop or maintain p99. If those requests were fast, p99 might increase &
alert... but if you were caring about performance at a user-level, you can
imagine you're dimensioning on users too, and you'd see the p99s for each user
hadn't changed, and would shrug and bump the alert threshold up.

If it were me, I'd probably start by caching the slow requests (that are above
the p99) first & adjusting the alert thresholds to match the new performance.

------
alexvaut
I will go a step further by stating that metrics, logs and __traces __are very
similar and should be treated as such in a unique platform. Leveraging these 3
sources of data in a micro-services world is more than needed for
troubleshooting, documentation and monitoring.

Right now I'm using Prometheus (metrics) + Jaeger (traces) +
Fluentd&Clickhouse (logs) + Grafana to render all of that. It's not that easy
to correlate data but I'm getting there (with tricky queries in Grafana panels
and custom Grafana sources). A PoC about displaying traces in a nice way:
[https://github.com/alexvaut/OpenTracingDiagram](https://github.com/alexvaut/OpenTracingDiagram).

------
MayeulC
This is not about the logarithmic vs linear scales, as I first thought, but
about analytics and aggregation of events.

This is probably the gist of it:

> many interesting kinds of metric are very hard to aggregate and re-aggregate
> correctly [...] The best you can hope for is that your metric collection
> supports the recording of histograms [...]

On the funny side, I had to whitelist ajax.googleapis un umatrix to get rid of
the white page. Fitting for the website's name.

------
karmakaze
Metrics tell you if the system is behaving well overall and can usually
identify periods or trouble or trends.

Logs give you more detail when you want it. Often because of a report either
from a user or an indication from your metrics. Metrics usually won't tell you
why, unless you've specifically put in a metric that covers the exact cause
(sometimes to protect against regressions).

------
Thaxll
Wait until you discover tracing, logs will looks like very obsolete.

~~~
sethammons
Meh. We are implementing tracing between several core services in our pipeline
of work. It is valuable. It will in no way replace our logging. We use logging
of event streams and can attach all kinds of information into a log that can
be hard to get on a given span in a trace.

