
M3DB, a distributed timeseries database - Anon84
https://www.m3db.io/
======
roskilli
Thanks for the interest, I just did a talk at FOSDEM a few weeks ago on the
subject of querying over large datasets that M3DB can warehouse and query in
real-time here:

Slides
[https://fosdem.org/2020/schedule/event/m3db/attachments/audi...](https://fosdem.org/2020/schedule/event/m3db/attachments/audio/4032/export/events/attachments/m3db/audio/4032/FOSDEM_2020_Querying_millions_to_billions_of_metrics_with_M3DBs_inverted_index.pdf)

Video
[https://video.fosdem.org/2020/UD2.120/m3db.mp4](https://video.fosdem.org/2020/UD2.120/m3db.mp4)

------
missosoup
Uber has started many projects that ended up getting open sourced. And many of
them are now either abandoned or on life support. H3 comes to mind as
something we almost ended up using but luckily avoided.

These open-sourcings seem a bit like PR pieces with no guarantees of any
support or evolution after being published.

~~~
gtirloni
It's opensource. Why should Uber give any guarantees? They are not in the
business of selling software.

Unless Uber is actively blocking contributions, it's not Uber's fault if no
community formed around something they opensourced.

As for this being a PR piece, they could have achieved the same with just a
detailed blog post and no code. It looks like a expensive PR piece if they
have to opensource work that took probably hundreds of development hours.

~~~
missosoup
I agree with everything you say.

But without any certainty around the roadmap, support, and longterm commitment
by Uber to maintain these projects, they're nothing more than interesting
repos amongst a sea of interesting repos.

The way Uber brands them suggests that they're suitable for use in production
environments, but so far that hasn't been the case with anything they open
sourced outside a narrow envelope that resembles their own operating model.
Maybe this project will set a new trend, but so far nothing they put out
gained any traction or became suitable for general purpose production use. In
that regard, H3 and their other projects have remained at the level of decent
'show HN' pieces rather than something you'd ever use professionally. In other
words, marketing.

Based on previous news coming out e.g.
[https://news.ycombinator.com/item?id=20931644](https://news.ycombinator.com/item?id=20931644)

It seems like Uber had too big of an engineering department with too little
work to do, so they started reinventing wheels. Which is cool if they're
willing to support them in the long term, but so far that hasn't proven to be
the case.

~~~
carlisle_
>It seems like Uber had too big of an engineering department with too little
work to do, so they started reinventing wheels. Which is cool if they're
willing to support them in the long term, but so far that hasn't proven to be
the case.

Former Uber engineer here. I can assure you that while our engineering team
was massive, there was anything but too little work. If anything most
engineers were massively overtaxed. Whether or not the work we were
undertaking was meritious and valuable is an entire branch of philosophy I'm
pretty sure.

Part of the struggle at big companies is that a lot of existing solutions just
don't work. Let me use an example with chat. A few years ago Slack was
evaluated as a replacement for HipChat, since Atlassian's outages had finally
started affecting us during our own outages.

Everybody wanted to go to Slack, but the cost of Slack was tremendously
prohibitive and the state of the service then (as I was told) was such that it
could not support a company of Uber's size. Tremendous effort would have been
undertaken by Slack to support Uber and they didn't want to expend that effort
for a single customer. This was late 2015 early 2016.

There were tons of options, but ultimately an in-house chat software was
created. At the time it seemed required to make our own highly reliable chat,
considering how distributed engineering teams were. I think if you talk to
anybody without the background of how chat evolved at Uber they would think
the in-house chat project would have been a boondoggle.

Not all over-scoped engineering projects are actually so noble. There was
certainly a ton of "reinventing wheels" going on. There was significantly more
"these problems are really hard and I only have bad solutions."

Though if the result is ultimately, "nothing more than interesting repos
amongst a sea of interesting repos" sign me up.

~~~
jjeaff
There seems to be a pervasive misconception by the very employees at Uber,
that they built their own chat platform. When in reality, and someone please
correct me if I am wrong, uChat was a white labeled Mattermost.

I have heard that the team that put it together actually tried to hide that
fact from the company (for the glory, I guess). But that could be apocryphal.

~~~
carlisle_
That's entirely not true. The team was always forthcoming about the fact it
was Mattermost, at least to other engineers.

Mattermost didn't work out of the box, and certainly not the way and at the
scale Uber needed it to. I'm not overly familiar with the technical details,
but one thing in particular stands out as an example. There was a Town Hall
channel that every user had to be a member of. This unfortunately did not
scale, and not enough ACLs were available to limit all the ways users could
use this universal room. Eventually they really fixed the problem, but it was
a tremendous pain point for a long time. There were a lot of fundamentally
"less than great" things about Mattermost that had to get updated to work for
Uber.

There was the amusing time employees found out anybody could change the topic
in the room, even if our chat permissions had been disabled. It was absolute
chaos for at least an hour, I can't remember if it actually negatively
impacted the deployment though it sounds vaguely familiar.

It's pretty telling of employees that badmouth the uChat team. That team
ultimately was trying to do what they thought was best for the company, even
if at the time it seemed like they bit off more than they could chew. There
was no other engineering team so directly visible and exposed to the entire
company internally like they were. People dismissive of their efforts are
generally not used to the difficulty of making so many very vocal customers
happy all at the same time, and could be more sympathetic.

~~~
jjeaff
I dug up the comment I saw that mentioned this. They also claim to be an Uber
employee.

[https://news.ycombinator.com/item?id=19101617](https://news.ycombinator.com/item?id=19101617)

The fact that uChat is commonly mentioned by Uber employees as being the
"custom chat solution we built in-house". It leads me to believe that comment
that the team tried to hide that it was built on open source.

I don't doubt for a second that scaling Mattermost for a huge organization
like Uber was a big undertaking. But it seems disingenuous for people to
always mention that Uber built uChat when it should be more like "Uber put a
lot of work into Mattermost to scale it up."

------
monstrado
On a related note, one of their engineers wrote a POC that uses FoundationDB
instead of their custom storage engine.

[https://github.com/richardartoul/tsdb-
layer](https://github.com/richardartoul/tsdb-layer)

The README does a really good job explaining the internals and motivation.

------
tnolet
I get Uber is huge. But honestly, there was nothing out there that could
fulfill there use case? Cassandra, ElasticSearch, Influx, etc.? I might be
completely wrong, but I just highly doubt that.

~~~
buro9
It's a database for a metric platform.

Think of OpenTSDB and Prometheus. Or for a better comparison think of Thanos
[https://thanos.io/](https://thanos.io/)

As to whether they could fulfil Uber's needs, the thing about scale (real
massive scale - I work at Cloudflare) is that everything breaks in weird ways
according to your specific uses of a technology. The things listed above work
for companies, until they don't. There's few things that seem to truly work at
every scale, Kafka and ClickHouse come to mind for wholly different use cases
than a time series database.

~~~
linuxhansl
We run OpenTSDB (which stores its data in Apache HBase) at scale.

150m-200m events/minute and about 20-30 trillion (10^12) events stored.
Doubling about every 12-18 months or so.

While it's true that things start to creak at scale, this has worked
remarkably well for us so far. I doubt M3DB is somehow magical in this regard.

~~~
roskilli
M3DB ingested 30 million datapoints per second (so 1.8 billion per minute)
with each node writing hundreds of thousands of writes per second. The dataset
was in the petabytes.

For us the cost savings vs OpenTSDB (millions of dollars of hardware), the
faster query time and the reduction in oncall overhead was worthwhile.

~~~
shaklee3
Hundreds of thousands per second isn't very high when you compare that to
clickhouse or kdb+.

------
synack
I setup a lot of Uber's early metrics infrastructure, so I can speak to how
they got to the place where building a custom solution was the right answer.

In the beginning, we didn't really have metrics, we had logs. Lots of logs. We
tried to use Splunk to get some insight from those. It kinda worked and their
sales team initially quoted a high-but-reasonable price for licensing. When we
were ready to move forward, the price of the license doubled because they had
missed the deadline for their end of quarter sales quota. So we kicked Splunk
to the curb.

Having seen that the bulk of our log volume was noise and that we really only
cared about a few small numbers, I looked for a metrics solution at this
point, not a logs solution. I'd operated RRDtool based systems at previous
companies, and that worked okay, but I didn't love the idea of doing that
again. I had seen Etsy's blog about statsd and setup a statsd+carbon+graphite
instance on a single server just to try out and get feedback from the rest of
the engineering team. The team very quickly took to Graphite and started
instrumenting various codebases and systems to feed metrics into statsd.

statsd hit capacity problems first, as it was a single threaded nodejs process
and used UDP for ingest, so once it approached 100% CPU utilization, events
got dropped. We switched to statsite, which is pretty much a drop-in
replacement written in C.

The next issue was disk I/O. This was not a surprise. Carbon (Graphite's
storage daemon) stores each metric in a separate file in the whisper format,
which is similar to RRDtool's files, but implemented in pure Python and
generally a bit easier to interact with. We'd expected that a large volume of
random write ops on a spinning disk would eventually be a problem. We ordered
some SSDs. This worked okay for a while.

At this point, the dispatch system was instrumented to store metrics under
keys with a lot of dimensions, so that we could generate per-city, per-
process, per-handler charts for debugging and performance optimization. While
very useful for drilling down to the cause of an issue, this led to an almost
exponential growth in the number of unique metrics we were ingesting. I setup
carbon-relay to shard the storage across a few servers- I think there were
three, but it was a long time ago. We never really got carbon-relay working
well. It didn't handle backend outages and network interruptions very well,
and would sometimes start leaking memory and crash, seemingly without reason.
It limped along for a while, but wasn't going to be a long-term solution.

We started looking for alternatives to carbon, as we wanted to get away from
whisper files... SSDs were still fairly expensive, and we believed that we
should be able to store an append-only dataset on spinning disks and do batch
sequential writes. The infrastructure team was still fairly small and we
didn't have the resources to properly maintain a HBase cluster for OpenTSDB or
a Cassandra cluster, which would've required adapting carbon- I understand
that Cassandra is a supported backend these days, but it was just an idea on a
mailing list at that point.

InfluxDB looked like exactly what we wanted, but it was still in a very early
state, as the company had just been formed weeks earlier. I submitted some bug
reports but was eventually told by one of the maintainers that it wasn't ready
yet and I should quit bugging them so they could get to MVP.

Right around this time, we started having serious availability issues with
metrics, both on the storage side- I estimated we were dropping about 60% of
incoming statsd events, and on the query side- Graphite would take seconds-to-
minutes to render some charts and occasionally would just time out. We had
also built an ad-hoc system for generating Nagios checks that would poll
Graphite every minute to trigger threshold-based alerts, which would make
noise if Graphite was down and the monitored system was not. This led to on-
call fatigue, which made everybody unhappy.

We started running an instance of statsite on every server which would
aggregate the individual events for that server into 10 second buckets with
the server's hostname as a key prefix, then pushed those to carbon-relay. This
solved the dropped packets issue, but carbon-relay was still unreliable.

We were pretty entrenched in the statsd+graphite way of doing things at this
point, so switching to OpenTSDB wasn't really an option and we'd exhausted all
of the existing carbon alternatives, so we started thinking about modifying
carbon to use another datastore. The scope of this project was large enough
that it wasn't going to get built in a matter of days or weeks, so we needed a
stopgap solution to buy time and keep the metrics flowing while we engineered
a long term solution.

I hacked together statsrelay, which is basically a re-implementation of
carbon-relay in C, using libev. At this point, I was burned out and handed off
the metrics infrastructure to a few teammates that ran with statsrelay and
turned it into a production quality piece of code. Right around the same time,
we'd begun hiring for an engineering team in NYC that would take over
responsibility for metrics infrastructure. These are the people that
eventually designed and built M3DB.

~~~
4d617832
Really interesting read for me. I am currently not so far from your SSD point
but our setup still works fine most of the time. It’s just 100k/m though. I am
trying to use more of the Go implementations of the graphite stack which did
improve load. I will consider m3db probably to get some benchmarks. All the
other ones would require some more people as you said.

------
ksec
How does it compare to TimescaleDB [1] ?

[1] [https://www.timescale.com](https://www.timescale.com)

~~~
akulkarni
TimescaleDB co-founder here.

TimescaleDB is a more versatile time-series database. It supports a variety of
datatypes (text, ints, floats, arrays, json), allows for out-of-order writes
and backfilling of old data, supports full SQL, JOINs between tables (eg for
metadata), flexible continuous aggregates, native compression, and is backed
by the reliability of Postgres. [0]

M3DB seems much more limited in scope [1]:

"Current Limitations

Due to the nature of the requirements for the project, which are primarily to
reduce the cost of ingesting and storing billions of timeseries and providing
fast scalable reads, there are a few limitations currently that make M3DB not
suitable for use as a general purpose time series database.

The project has aimed to avoid compactions when at all possible, currently the
only compactions M3DB performs are in-memory for the mutable compressed time
series window (default configured at 2 hours). As such out of order writes are
limited to the size of a single compressed time series window. Consequently
backfilling large amounts of data is not currently possible.

The project has also optimized the storage and retrieval of float64 values, as
such there is no way to use it as a general time series database of arbitrary
data structures just yet."

[0] [https://www.timescale.com/](https://www.timescale.com/)

[1] [https://m3db.github.io/m3/m3db/#current-
limitations](https://m3db.github.io/m3/m3db/#current-limitations)

------
jmakov
So how does this compare to e.g. Clickhouse?

~~~
bdcravens
Clickhouse is an analytic column-based RDBMS. It's not a timeseries database.
Each class of product is used to solve different problems.

~~~
aeyes
Clickhouse works exceptionally well as a TSDB.

~~~
roskilli
While this is true, for a metrics workload it does not work great I have both
seen and heard from others, mainly due to the fact it does not have an
inverted index - so finding a small subset of metrics in a dataset of billions
of metrics ends up taking significant time due to the scan required to find
the timeseries matching the arbitrary number of dimensions specified to find
the timeseries you're looking for.

If you're building it with a specific application and a concrete schema you
can create which will result in fast queries and don't have requirements for
arbitrary dimensions being specified for lookup, then yes it's great as a
TSDB.

Prometheus, M3DB, etc all use an inverted index alongside the column store
TSDB to help with metrics workloads.

~~~
mbell
Most practical applications using Clickhouse for metrics data store the metric
index separately. What index you want really depends on the metric system,
e.g. with graphite data you don't want an inverted index, you want a trie.

~~~
roskilli
Yes I've seen that also work, it's a lot of stitching together things yourself
and we had to put a lot of caching in front of the inverted index we were
using, however definitely plausible. ClickHouse doesn't do any streaming of
data between nodes as you scale up and down which was a big thing for us since
we had large datasets and needed to rebalance when cluster expanded/shrunk.

With regards to trie vs inverted index for Graphite data, I'd actually still
be inclined to say inverted index is better based on the amount of queries I
saw at Uber with Graphite where people did `servers.*.disk.bytes-used` type
queries which is way faster to do using an inverted index since you have a
postings list for each part of the dot-separated metric name, rather than
traversing a trie with thousands to tens of thousands of entries in index 1
host part of the Graphite name. This is what M3DB does[0].

[0]:
[https://github.com/m3db/m3/blob/b2f5b55e8313eb48f023e08f6d53...](https://github.com/m3db/m3/blob/b2f5b55e8313eb48f023e08f6d53350fabf09338/src/cmd/services/m3coordinator/ingest/carbon/ingest.go#L318-L330)

~~~
idjango
Just to point out that there is inverted index implementation of graphite data
working on clickhouse.

Regarding the auto-rebalance feature, I cannot much more agree with you. It's
something that clickhouse definitely need to handle internally.

~~~
roskilli
That's interesting, I had not heard of ClickHouse as a backend for Graphite
with an inverted index. Let me know if you have any links to that.

I'm assuming this is an out of process inverted index used alongside
ClickHouse? Or is it more of a secondary table contained by ClickHouse which
can be searched to find the metrics, then the data is looked up?

The latter scales not as well with billions of unique metrics since it's
always a scan across the unique metrics stored in the time window your query
searches for (since any arbitrary dimensions can be specified, all must be
evaluated). This is the drawback of PromHouse which is an implementation of
Prometheus remote storage on top of ClickHouse - and the major reason why
PromHouse was only ever a proof of concept rather than a production offering.

------
MichaelRazum
Ok everything open source was not good enough. Please make a simple benchmark.
Without it, it is so hard to make decisions

~~~
hagen1778
I'm aware of only one public benchmark including some competitors -
[https://promcon.io/2019-munich/talks/remote-write-storage-
wa...](https://promcon.io/2019-munich/talks/remote-write-storage-wars/) Would
like to see more of this.

------
katzgrau
Nice... patiently waits for AWS to create a managed version of it...

~~~
staticassertion
[https://aws.amazon.com/timestream/](https://aws.amazon.com/timestream/)

~~~
katzgrau
Yep, I registered for the "preview" ages ago, so I figured it was vaporware at
this point

------
TheRealPomax
admins/mods: this needs an apostrophe to turn it into "Uber's M3DB".

For anyone who's never heard of M3DB, and lives in a place where Uber doesn't
operatore or is even banned (and so isn't part of daily life or conversation)
"Ubers" might just as easily be some db researcher affiliated with the
university of who knows where showing off something they came up with last
summer and got a grant for.

~~~
tlb
Fixed, thanks

------
heliodor
When the Android app is broken in so many easy-to-fix ways that blatantly
interfere with usability, how does a company allow its developers to spend
time on making custom internal tools or even spend time open-sourcing them?
The company has so much money and yet seems so utterly mismanaged.

~~~
rossjudson
Sounds like off-the-shelf tooling just didn't work. What's your solution for
that?

~~~
heliodor
In the grander scheme that I'm discussing, the solution you're asking about is
to gather no more metrics than the off-the-shelf tools allow and spend the
engineering effort on fixing the simple bugs that prevent customers from
getting cars. Why do they need a perfect ton of metrics while they can't fix
simple things such as:

\- the car icon doesn't move as location updates come in (as evidenced by the
route line getting shorter)

\- the on-screen keyboard does not allow me to type anything after the first
message (no other app on my phone has this problem)

\- after I rate a driver, the app shows a map and none of the UI. I have to
kill the app and restart it in order for it to be usable again. Picture this:
I'm trying to get a ride, I open the app, I get nagged to rate the last
driver. I agree just to be nice to the driver. (I should skip instead and get
back to my task of getting a ride). After putting up with the nagware, the app
fails 100% and I cannot complete the original task!

Alternatively, if they demand the collection of so many metrics, why don't
they collect the metrics that would show them just how pathetically broken
their Android app is?

------
clircle
Does "Time Series Database" mean anything technical, or is this just some Uber
marketing? In statistics, time series has a technical meaning.

~~~
abvdasker
A time series database is specialized for use cases where the data and query
patterns are solely temporal in nature and must show the latest data in real-
time (performance metrics/monitoring and stock prices come to mind).
Relational and NoSQL databases tend to degrade rapidly with these query
patterns at scale (think of the complexity of SQL queries to bucket rows by
timestamp).

[https://en.m.wikipedia.org/wiki/Time_series_database](https://en.m.wikipedia.org/wiki/Time_series_database)

~~~
refset
Note that temporal databases are also a thing, so it's probably wise to avoid
using the word "temporal" when discussing time series databases. As far as I
know kdb+ is the only technology that has a foot in both camps.

[https://en.m.wikipedia.org/wiki/Temporal_database](https://en.m.wikipedia.org/wiki/Temporal_database)

~~~
CharlesW
> _As far as I know kdb+ is the only technology that has a foot in both
> camps._

Teradata Vantage also supports both. And you're absolutely right, it's
important not to conflate "temporal" and "time series" support.

