
M3: Uber's Open Source Large-Scale Metrics Platform for Prometheus - mkvorwerck
https://ubr.to/2ALH8Ak
======
roskilli
Author here and a few other M3ers around too, please do ask any questions.
Ultimately we'd just like to make the project useful to others, so any and all
feedback is appreciated, thank you!

~~~
pininja
Is it possible to use M3 without a retention downsampling policy? For example,
to retain timeseries metrics at the 1 second resolution indefinitely for one
or two days worth of data.

Edit: I was interested in M3 for this instead of just a time series DB because
I’d also like to aggregate the metrics at lower resolutions and higher
resolutions.

~~~
roskilli
Hey, it is yes - by default all metrics are unaggregated, it’s only until you
set aggregation and mapping rules that downsampling occurs.

It’s mainly just disk space that is the primary bottleneck.

~~~
pininja
Oh ok, is it possible to also maintain an aggregated “view” at the same time
as an event view?

~~~
roskilli
Yup indeed, we have to make this more friendly because you have to curl an
HTTP/JSON endpoint right now to set the configuration (we’ll release an
embedded UI for policy rules editing soon, called m3ctl).

Basically though when you set retention mapping rules, metrics are downsampled
to whatever resolution you choose (10s, 1min, 10min, etc) and then stored for
whatever time the retention the namespace you have setup for that policy.

And with respect to my comment about disk space being the bottleneck - unless
you have a huge dataset you actually might not be bottlenecked by disk, it’s
just we’ve encountered that at times with our really high cardinality data
sets.

------
ebikelaw
It’s kind of amusing the Prometheus has all this traction. When I was at
Google we joked about describing borgmon in an “industry-disabling paper”.
Such are the horrors of borgmon. But the whole industry has embraced
Prometheus which is just borgmon!

~~~
justicezyx
Borgmon's model is pretty solid.

It was sort of crippled by the user-facing part, which was clearly (to me)
over-engineered. And the lasting neglect of attention helped the tech debt to
grow.

Over-engineered + lack of maintenance == tech bankrupted ==> everyone started
loath it ==> everyone concluded that it was designed wrong

Disclaimer #1: I am part of the borg team.

Disclaimer #2: I have been deliberately high-level and no details attached;
please let me know if I need to revise this.

Edit: 'review' -> 'revise' (I am pretty good and writing typo unconsciously
...)

~~~
hspak
What part of borgmon is considered user-facing? The client library? The
/metrics endpoint?

I'm trying to think in terms of Prometheus and how it could be (is?) over-
engineered.

I have never seen/used borgmon, sorry for the ignorance.

~~~
ebikelaw
It guess it depends on your point of view. I think of the rules language as
being the "user-facing" part.

------
camel_gopher
"Similarly, M3 has rollup rules that can be applied at ingestion time,
allowing specific retention storage policies to be set for the resulting
rolled up metrics."

Are you applying these rollup rules synchronously with ingestion?

Great read; I work on TSDBs at IRONdb and there's a lot of similarities I see.

~~~
roskilli
Yeah, we apply the roll up rules at ingestion time (the coordinator has an
embedded m3aggregator in process).

Theo came and have a talk at our office when he was in NY one time, it was
really great to dig into IRONdb and hear all about it.

Many similarities for sure, unfortunately we are somewhat strict on the OS we
run (so it can be standardized across the fleet) and thus we can’t really run
ZFS so we have to checksum all our data ourselves, etc.

------
henridf
Is m3ql still being used? And if so any thoughts on how m3ql and promql will
respectively be used over time?

~~~
roskilli
M3QL still is the primary query language internally, however the query service
is still being rebuilt in open source M3 - so it’s not available just yet.

As to why we’ve kept it, there’s been such a large amount of functions we’ve
added over time that don’t really have an alternative in say PromQL, etc - so
we aim to offer both PromQL and M3QL in open source land and letting end users
use either one. I don’t think either are better than the other one, they’re
just different flavors - function expansion vs pipe based, etc.

Here’s the list of functions, you can kind of get a sense of what I’m talking
about just by looking at the list:
[http://m3db.github.io/m3/query_engine/architecture/functions...](http://m3db.github.io/m3/query_engine/architecture/functions/)

------
wvl0
How does M3 compare to Thanos ([http://github.com/improbable-
eng/thanos](http://github.com/improbable-eng/thanos))?

~~~
roskilli
I was at a wedding when the question was posted, hence taking forever to reply
(apologies!).

I am of course biased, I want to say that up front. For Uber we wouldn’t
really want to use Thanos for a few reasons but one primary one is that Thanos
is upfront that they don’t really optimize for latency with metrics that don’t
currently reside on disk, which can take significantly more time coming from
S3 rather being mmapd on a local disk. We have historical metrics needed for
anomaly detection (5 weeks of data) queried very frequently and the model of
downloading the S3 data for each request or caching it locally (which would
fill up the disk of query nodes quite quickly, since we have petabytes of
metrics data) doesn’t really scale for us. Also we’re conscious of having to
pay all the AWS bandwidth costs considering the dataset is in the petabytes
and we run things both on premise and cloud.

Anyway I could definitely talk at length about this, perhaps we should write
up something and put it on our wiki.

~~~
wvl0
That's a great reply there already. We haven't yet reached a situation where
we'd like to continuously query 'old' metrics.

Would love to see a nice writeup with any further considerations. Thanks for
the reply!

------
nnx
On the surface, this looks rather similar to InfluxDB (built in Go, similar
HTTP-style API, etc)

What are the pros/cons of M3 compared to InfluxDB?

~~~
farnulfo
InfluxDB High Availability/Scalability (clustering)are not Open Source
([https://www.influxdata.com/products/editions/](https://www.influxdata.com/products/editions/)).
M3 seems to include it in open source version.

~~~
pm90
Interesting point... I wonder if all this tech being open sourced by teams in
bigger firms is actually killing off smaller firms that rely on providing
these services. Not sure how I feel about it: I like having free stuff but
also want to support the little folks.

~~~
Arqu
I'm currently in the business of doing exactly this. My 5c is that though we
keep an eye out for new tech on this scene, usually the impact is very minimal
in terms of customers.

There is a lot more value in a managed offering than there is just by doing it
yourself. A few top points are: \- No DevOps \- Reduced cost due to scale \-
Minimal engineering time spent \- Long term reliability

Usually, there are two buckets of people - those who want to do it themselves
anyways so the service is not really what they want, and those who just want
to make it simple to them and reduce the cost of development and ownership,
especially if it's not core to their business. There is a large value to
anything if it just provides convenience and nothing else.

------
precisionemma
Hi @roskilli, I'm wondering how you guys achieve low latency, anything
particularly different from existing systems?

~~~
roskilli
Hey @precisionemma, sorry for the late reply - that's a great question. So I'd
say there's a few non-standard things that help:

\- Transport compressed blocks over RPC storage nodes to query service (i.e.
raw byte buffers instead of timestamp and float64 values) to reduce response
payloads (and less serialization/deserialization steps)

\- Auto batching of fetches, each batch can be fetched in parallel (i.e. on a
24 core box, a single query can become 24 subqueries and run concurrently)

\- Series block level caching (fine grained cache), index caching, can enable
async inserts which heavily reduces lock contention at the series map layer
because the inserts get batched together (durable as written to WAL but series
may take a few milliseconds to show up after a write finishes, so you can't
always read your own writes if you're using this, which is ok for metrics)

\- Ensuring bloom filters and index summaries are in mmap and not on heap, to
reduce GC pressure, also scoped to each time window so helps for sparse time
series because the disk has to be read a lot less if the in memory bloom
filters tell you you only need to fetch part of one file volume (out of a lot
of potential file volumes) before going to disk

\- Object pooling of a lot of data structures and keeping things in mmaps as
much as possible helps reduce the heap size and consequently GC pressure is
less (even using bytes instead of strings for keys and IDs was a huge win,
because byte buffers can be reused whereas strings are immutable and so cannot
be reused)

I hope that helps, its not an entirely comprehensive list but definitely
covers some things.

------
glup
Cool platform/capabiluties aside, leave it to Uber to figure out how to make
an owl into a really creepy logo.

~~~
roskilli
Can't take credit for that one, this one was a 99designs competition:
[https://99designs.com.au/logo-design/contests/establish-
bran...](https://99designs.com.au/logo-design/contests/establish-brand-logo-
uber-open-source-release-free-824550/entries)

I really liked the rabbit, entry #12 TBH.

