
A simple way to get more value from metrics - waffle_ss
https://danluu.com/metrics-analytics/
======
bitcharmer
You'd be surprised how many serious tech shops have close to zero performance
metrics collected and utilised.

I've done this in fintech a few times already and the best stack that worked
from my experience was telegraf + influxdb + grafana. There are many things
you can get wrong with metrics collection (what you collect, how, units, how
the data is stored, aggregated and eventually presented) and I learned about
most of them the hard way.

However, when done right and covering all layers of your software execution
stack this can be a game changer, both in terms of capacity planing/picking
low hanging perf fruit and day to day operations.

Highly recommend giving it a try as the tools are free, mature and cover a
wide spectrum of platforms and services.

~~~
benraskin92
Might also want to give Prometheus a try as it's extremely simple to setup,
and it is very well supported in the open source community.

~~~
bitcharmer
Prometheus was another option but millisecond-precise timestamps are a deal
breaker in my field.

~~~
amenod
Curious, would microseconds suffice? Or are we talking about higher precision
still?

~~~
simcop2387
Very likely they need nanosecond precision,
[https://www.thetradenews.com/fintech-firms-reduce-trading-
ti...](https://www.thetradenews.com/fintech-firms-reduce-trading-time-
to-120-nanoseconds/)

------
jamessun
"I don't have anything against hot new technologies, but a lot of useful work
comes from plugging boring technologies together and doing the obvious thing."

~~~
Simulacra
The most exciting phrase to hear in science, the one that heralds new
discoveries, is not “Eureka!” (I found it!) but “That’s funny …” — Isaac
Asimov

------
resu_nimda
Starting the article off with "I did this in one day" \- complete with a
massive footnote disclaiming that it obviously took a lot more than one day -
kinda ruined it for me. Why even bother with that totally unnecessary claim?

~~~
caiobegotti
It's kind of a personal marketing thing these days to have this maverick/hero
aura of genius instead of the "unproductive" but real and hard grinding work
to get something done and delivered. It worked for a few so thousands try the
same and we are here now, I guess?

~~~
derivativethrow
Given:

\- the context that the author already has a very successful career as a well-
known developer

\- the humility he evidences in most posts on his blog

\- the fact that he explicitly highlights the work of others in this post
alongside his own

I really don't think Dan is doing this as any form of personal marketing. He
has no need of personal marketing, his blog already has several million views
per month and frequently shows up on HN as it is, and it isn't really his
style.

~~~
caiobegotti
I did not say he did that, I said that I believe it's common these days given
the points I mentioned. You just need to hang around and see a bunch of posts
on HN to notice that. QED, he's probably one of the "few" I talked about.

------
roskilli
There's a lot of interest in this space with respect to analytics on top of
monitoring and observability data.

Anyone interested in this topic might want to check out an issue thread on the
Thanos GitHub project. I would love to see M3, Thanos, Cortex and other
Prometheus long term storage solutions all be able to benefit from a project
in this space that could dynamically pull back data from any form of
Prometheus long term storage using the Prometheus Remote Read protocol:
[https://github.com/thanos-io/thanos/issues/2682](https://github.com/thanos-
io/thanos/issues/2682)

Spark and Presto both support predicate push down to a data layer, which can
be a Prometheus long term metrics store, and are able to perform queries on
arbitrary sets of data.

Spark is also super useful for ETLing data into a warehouse (such as HDFS or
other backends, i.e. see the BigQuery connector for Spark[1] that could write
a query from say a Prometheus long term store metrics and export it into
BigQuery for further querying).

[1]: [https://cloud.google.com/dataproc/docs/tutorials/bigquery-
co...](https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-
spark-example)

~~~
tixocloud
Thanks for sharing. It’s interesting to see the space gaining steam. What sort
of things are people looking at?

------
gigatexal
This is a really awesome blog. The post about programmer salaries is
insightful: [https://danluu.com/bimodal-
compensation/](https://danluu.com/bimodal-compensation/)

~~~
julianeon
I was thinking the answer to 'why are programmers paid more than other white-
collar professions that are similarly profitable' is: because programmers
control the means of production.

I might be a great telecomm tech, a genius even, but once I'm out of a job, I
can't build my own telecomm system - that would cost billions. I have to go
back to some other telecomm system to start making money again.

But, at a startup, a kicked-out senior engineer can actually pretty much
exactly recreate the company; they can do the equivalent of a laid-off
telecomm employee starting a new, almost-as-good (except for branding)
telecomm company.

No billions in infrastructure required: within a month or two, the cloned
company could be near-indistinguishable from the original.

So companies have to pay employees more like partners, instead of employees,
because either they pay them as equals or they'll be forced to compete against
them, as equal rivals.

~~~
SpicyLemonZest
Interesting. I hadn't thought about it that way before, but it does seem to
predict the bimodality; the lower mode is (presumably) programmers who either
aren't skilled enough or don't work in the right areas to be able to take
their ball and found a startup with it.

------
renewiltord
Just so I understand, the simple way the headline talks about was "collect all
metrics, but store the anal fraction you care about in an easily accessible
place; delete the raw data every week"?

Title didn't live up to article imho. But I get it. Thanks for sharing your
methods.

------
simonw
Love the section in this about using "boring technology" \- and then writing
about how you used it, to help counter the much more common narrative of using
something exciting and new.

~~~
m463
But "exciting and new" is very often just lipstick on a pig.

and anyway:

[https://wondermark.com/c/2007-10-11-344ennui.gif](https://wondermark.com/c/2007-10-11-344ennui.gif)

------
heliodor
If we consider Graphite, InfluxDB, and Prometheus, at this point in the
monitoring industry's evolution, we have the capability to easily criss-cross
metrics generated in the format of one of these systems to store them in one
of the other ones.

The missing piece remains to be able to query one system with the query
language of the others. For example, query Prometheus using Graphite's query
language.

------
sa46
Speaking of high cardinality metrics, what are good options that aren’t as
custom as map reduce jobs and a bit more real time?

We killed our influx cluster numerous times with high cardinality metrics. We
migrated to Datadog which charges based on cardinality so we actively avoid
useful tags that have too much cardinality. I’m investigating Timescale since
our data isn’t that big and btrees are unaffected by cardinality.

~~~
RabbitmqGuy
TimescaleDB

------
staysaasy
The boring technology observation (here referring to the challenge of getting
publicity for "solving a 'boring' problem") is really true.

It extends very well to something that we constantly hammer home on my team:
using boring tools is often best because it's easier to manage around known
problems than forge into the unknown, especially for use-cases that don't have
to do with your core business. Extreme & contrived example: it's much better
to build your web backend in PHP over Rust because you're standing on the
shoulders of decades of prior work, although people will definitely make fun
of you at your next webdev meetup.

(Functionality that is core to your business is where you should differentiate
and push the boundaries on interesting technology e.g. Search for Google,
streaming infrastructure for Netflix. All bets are off here and this is where
to reinvent the wheel if you must – this is where you earn your paycheck!)

------
mv4
Thank you for sharing this. I recently started working on metrics at a FAANG
and saw some of the challenges you mentioned... the fact that you were able to
get good results so quickly is super inspiring!

~~~
tixocloud
Interesting. I thought this would already be a solved problem at FAANG.

------
chrchang523
Minor nit: long -> double -> long cannot introduce more rounding error than
long -> double, if the same long type is at both ends.

------
dirtydroog
What's the standard for metrics gathering, push or pull? I prefer pull, but
depending on the app it can mean you need to build in a micro HTTP server so
there's something to query. That can be a PITA, but pushing a stat on every
event seems wasteful, especially if there's a hot path in the code.

~~~
halbritt
The hot new technology for metrics is Prometheus and it's ilk which is pull-
based.

~~~
bostik
At this point Prometheus is pretty close to becoming the boring technology.
The latest versions have finally brought in the plumbing and tuning knobs to
protect against [most] overly expensive queries. So you can't easily take it
down anymore.

The single-binary approach is still a problem, though. In my mind any serious
telemetry collection stack should separate the query engine and ingestion path
from each other - Prometheus has both the query interface and the
ingestion/writing subsystem in the same process.[ß]

As for the parent poster: you certainly want to push telemetry out on every
event, but the mechanism has to be VERY lightweight. With prometheus the
solution is to have a telemetry collection/aggregation agent on the host, feed
it with the event data and have prometheus scrape the agent. Statsd with the
KV extension is a great protocol for shoveling the telemetry out from the
process and into the agent.

ß: you can get around this with Thanos + Trickster to take care of the read
path only, but it's quite a bit more complex than plain Prometheus.

~~~
roskilli
M3 separates query and ingestion if you're interested in clustered storage for
metrics, slide in question here:
[https://www.slideshare.net/RobSkillington/fosdem-2019-m3-pro...](https://www.slideshare.net/RobSkillington/fosdem-2019-m3-prometheus-
and-graphite-with-metrics-and-monitoring-in-an-increasingly-complex-world/31)

~~~
bostik
Oh, nice. Thank you. I had bookmarked M3 earlier on, but never got around to
really looking into it with proper thought.

------
chris_f
There have been a lot of articles posted recently about the 'old' web, and
while I like the concept I still have a hard time finding quality information
in many of the directories and webrings posted. The level of research and
density of information in this blog is very good.

------
neoplatonian
This is a great post! We should have more of these out there. Does anyone have
any recommendations for similar posts for Node.js (instead of JVM)?

Or any good resource which discusses possible optimizations in the infra stack
at a more theoretical, abstract, generalizable level?

------
dmos62
I'm not sure I understood the solution there. Storing only 0.1%-0.01% of the
interesting metrics data makes sense in the same way that you'd take a poll of
a small fraction of the population to make guesses about the whole?

~~~
yellowstuff
I believe he means they are storing all the data, but for a subset of types of
data, sorta like extracting just a few columns out of a big table. Presumably
someone on some other team gets use out of having access to the 99.9% of data
stored that is not relevant to "performance or capacity related queries."

------
wwarner
Would be a very natural AWS dashboard.

~~~
Aperocky
Cloudwatch is pretty awesome.

------
m0zg
Funny how "groundbreaking" stuff like this is outside e.g. Google, where you
could collect and query metrics in realtime for more than a decade now.

------
dandare
> since i like boring, descriptive, names..

I feel like I have an inception. Should "boring, descriptive, names" be the
default in all IT?

~~~
ertian
The problem with that is that you end up with tons of confusing name
collisions.

------
willvarfar
Great article!

The bit about not being able to use columns for each metric because there were
too many ....

the classic solution is to have a column called “metric name” and another for
“metric value”.

Can’t spot why they didn’t just do that.

~~~
jsnell
Then you lose the benefits that columnar databases have for time series data.

~~~
kyllo
You lose a lot of the benefits, but you can still take advantage of time range
partition elimination just as long as your data is still physically
partitioned by the timestamp column. That's the most important thing when
processing time series data--never read from disk any of the data that's
outside the time range you actually need for your query.

