Hacker News new | past | comments | ask | show | jobs | submit login
Monarch: Google’s Planet-Scale In-Memory Time Series Database [pdf] (vldb.org)
175 points by ngaut 30 days ago | hide | past | favorite | 48 comments

As someone interested in PL theory, I've long found the query language exposed by Monarch surprisingly interesting (briefly discussed in section 5.1 but the description doesn't quite do it justice). It's a functional language, a breath of fresh air compared to "real programming languages" in use at Google like C++, Java, or Go.

The most interesting idea is that its native data types are time series of integers, doubles, distributions of doubles, booleans, and tuples of the above. This means that the data you operate on intrinsically consist of many timestamped data points. It's easy to apply an operation to each point of the data, and it's also easy to apply operations on a rolling window, or on successive points. This makes the language have the feel of an array-based language, but even better because the elements are timestamped and the array can be sparse.

Furthermore the presence of fields in each data point adds more dimensions to the level of aggregation (not just the inherent time-based). Now the language has the feel of a native multi-dimensional array language. It feels amazing to program in it. You can easily do sophisticated queries like figuring out how many standard deviations each task's RPC latency for a specific call is above or below all tasks' mean, for outlier detection.

The query language is the brainchild of John Banning, one of the authors of the paper, and has a long history behind it. In 2007 or so he started working on a replacement for Borgmon's rule language; the thinking at the time was that the main problem with Borgmon was that its language was surprising and difficult for casual users to grasp. (And with a monitoring language, there are only casual users.)

That work eventually resulted in a language called Optic, which was indeed (IMO) a very nice cleanup of Borgmon. Ultimately though that work got shelved in favor of Monarch, whose focus was less on the language problems of Borgmon and more on the points listed in the introduction of the paper, especially points 1, 3, and 4 (at least in my memory).

The underpinnings of the query data model and execution model got hashed out reasonably well as part of the first implementation of Monarch, which started in earnest in late 2008 or early 2009. But the textual form of the query language suffered for quite a long time after that. I wrote the first crappy version of an operators-joined-by-pipes language sometime in 2010. ("Language" is a generous term; John liked to refer to it in a kindly way as "an impoverished notation.") But it was clear even then that the basics of that syntax were appealing: they lined up nicely with how our users mentally constructed their queries. "You start with the raw data; then apply a rate; then aggregate by these fields; then take the maximum over the last five minutes" etc.

Through a couple of revisions over the subsequent few years, that "impoverished notation" eventually got embedded, through some awful operator overloading, as a kind of DSL inside of Python. But it was clear to everyone that it would be impossible to release that publicly to GCP users; it was much too clunky, and also by then tied inextricably to Python idiosyncrasies. So in about 2015, give or take, we came back to the question of what a better textual notation might look like.

The obvious first choice was to see if we could somehow twist SQL into being useful, possibly with some custom functions or very minor extensions. Around this time there was a large effort going on to standardize several different SQL dialects that were being used by internal systems (BigQuery/Dremel's SQL dialect was not the same as Spanner's dialect, etc). So it felt like there was a convenient opportunity to somehow fit time series data into the same model.

John did a bunch of due diligence to try to make that idea work, but it just wouldn't fly. I remember a list he had of about fifty of the most common kinds of queries, written with a SQL version next to (an early version of) Monarch's current query language. Nearly everyone he showed it to, across the spectrum of experience and seniority, both SWE and SRE, said "of course I'd rather read and write SQL, let me look at that list"... and then went through it careful and came out thinking, well, maybe not.

I don't know if there are any interesting conclusions to draw from the history of it, except that language design is really hard. I agree that it's a fun little language, and I'm very happy that John and the team managed to get it out publicly in Stackdriver.

Googlers still have to use the terrible python dsl (“mash”). Even worse: they have to use it wrapped in a different terrible python dsl (“gmon”). Sigh.

I was mostly using th new notation when I left in 2018.

Stream values can also be strings, and the language is terrible for dealing with string-valued streams when they come up (but they come up surprisingly often, you just don't normally worry about them too much).

If you ignore the issue of alignment, I actually think that a more conventional array based language, either something SQL-like or numpy-like, would be more accessible to most people. And things like windowing are incredibly unintuitive for most users.

Haven't read the paper yet, but could you expand on the tuple use? It seems like the odd person out in that list of primatives.

They are used when joining timeseries.

A join creates a tuple but that's not the only way to use them. You can also just produce a tuple with an expression in the query such as (val(), 5, "dog") if you like. The whole language is documented here:


Interesting that this paper contains hard numbers hinting at Google's absolute scale. They say Monarch has 144000 leaves. Even if each leaf is assigned only 1 CPU core -- which is probably a low estimate because who would do that? -- that makes Google's monitoring stack a Top 100 supercomputer.

The only other places I've seen Google give out hard numbers were a presentation by Jeff Dean mentioning map-reduce core-years consumed per day, and a footnote in a paper that mentions how much CPU time Google Exacycle gives away to scientific computing every day. All of these calibration points are eye-opening.

they also estimate close to a petabyte of RAM; if all the RAM is spent in leaves that is about 6-7GB per leaf. I don't know that we can say it would be unreasonable to have 1 core per leaf replica; presumably some leaves have low utilisation so it might make sense to share the core with another workload. From a capacity planning standpoint, I think they leave this open though they do indicate that they are sometimes CPU bound so don't try to compress beyond delta compression. That might suggest multiple cores per leaf.

A single core design allows a very simple concurrency model, without having to worry about cache pingponging, false sharing, or myriad other issues. The parallelism is applied at higher layers, as there are multiple replicas for each leaf and obviously they can use many cores effectively overall.

I don't see that the paper gives enough information to help us prune the design space here.

We don't run exacycle any more (I built and ran exacycle for several years). It's not a cost-effective way to do science, but yes, the scale was absolutely insane. These days, I'm more interesting in seeing if there are ways to use TPUs, rather than CPUs, for similar kinds of opportunistic computing.

Yeah the "supercomputer" ranking is a bit of a joke. Every mid-sized google dc would count as a top 10 supercomputer.

I work at one such "medium sized" Google DCs. Supercomputers are typically much more interconnected, whilst we have a much more traditional topology.

Supercomputers are more about the network topology than raw processing power.

Isn't the whole point of a supercomputer that it isn't just a datacentre with an LED display on the front?

supercomputers, by tradition, are tightly coupled in a way that Google datacenter servers aren't. The closest thing Google has to supercomputers are GPUs linked by high performance networks, and TPUs (which have their own custom toroidal mesh).

When I had done the presentation at Facebook's @Scale NYC conference last summer, I talked the powers that be into using more specific numbers, and they allowed it for the paper too.

It's not a supercomputer.

The reason supercomputers are so tiny compared to cloud data-centers is that cloud computing is highly and mostly-triviallly parallel, but supercomputers are mesh and serial.

That appears to be your own private definition of the term. There are lots of things in the top500 list that are just a pile of Xeon boxes with Ethernet.

*that's using monarch

Is the entire internet the world's largest supercomputer?

Question for googlers (and ex-googlers). Is there anything out there that's as convenient as /streamz, but in opensource form?

Prometheus is as convenient as /streamz in the sense that it has service discovery, so you don't have to do any work to have your endpoints discovered and scraped.

I haven't found anything as good as Monarch and the internal dashboard, though. Monarch always made sense to me, by not doing anything clever. For example, if you want to aggregate across different streams, you have to "align" them first, and the alignment is the operation that defines how to make up missing points. Other systems don't have that alignment step, and it causes me to struggle with everything else I want to do.

As far as I can tell, Prometheus and InfluxDB have implicit alignment, but are never clear on what the rules are. At my last job, I collected some of the exact same metrics as I did at Google, and while I knew exactly how to get the charts I wanted with Monarch, I could never figure out how to get them with InfluxDB (I later found out that it simply didn't support my use case). (The exact use case was collecting counters from network devices opportunistically, and then generating a bandwidth chart for a section of the network. At Google it was easy; I could do the differencing to turn the packet counters into "bytes sent over the last X seconds", then align the streams so that data was available at each time, then aggregate over topology. With InfluxDB... the query language lets you express that, but it doesn't yield correct results. I complained about it on HN and the author of did a lot of handwaving about how what I want to do is wrong, or something, and so I just wrote my own thing instead.)

The other thing I miss from Monarch is the default visualization for histograms. I have to manually reproduce that in Grafana with a series of manual queries like histogram_quantile(0.99, ...), histogram_quantile(0.90, ...), histogram_quantile(0.85, ...), ... where that was just the default visualization. Again, the open source world has some defaults, but I can't make heads or tails of what they are (try a default histogram visualization in Grafana)... Monarch just did the right thing by default.

I miss it.

6-7 years ago I used to use Statsd+Graphite. Statsd is kinda like streamz (although it works in a completely different way, but from the app perspective it's similar), and Graphite is like Monarch. It had a pretty ugly UI, and I think today people use Grafana as the front-end instead.

If you are looking for /varz, https://prometheus.io

If you really want /streamz with simple aggregation for distribution, description for metrics etc in an HTML page, I haven't encountered one yet.

No, looking for streamz specifically. I'd like to be able to see how each node is doing and have automatic, zero-config aggregation in k8s a-la Borg/Monarch. The service I built was one of the first large users of Monarch at Google many years ago. All because I couldn't be bothered to learn Borgmon. :-)

You should look into Riemann: http://riemann.io/

I wonder if Google will ever open source it.

Googler, opinions are my own.

Open sourcing anything at Google tends to be rather difficult because much of the technology we build is built on other technology that has not been released. It tends to take a great amount of effort to open source anything at Google because of all the internal dependencies you have.

It should be noted that Prometheus is an open source recreation of the precursor to monarch (developed by a former Google SRE that has since returned).


> Open sourcing anything at Google tends to be rather difficult because much of the technology we build is built on other technology that has not been released

Monarch might be an exception to that, though, because - as noted in the paper -, it has to be built without many dependencies, since almost everything else depends on it for monitoring and alerting.

That being said, I'm sure it would be extremely difficult to open source now since it was never built with open sourcing in mind.

It might not depend on many other services, but that doesn't preclude it depending on other code.

But when you publish a paper on Monarch, aren't you giving away the core idea?

In the case of PageRank, at least it was published after Google had already dominated the search space, i.e., many years after the conception, implementation, and utilization.

I wonder if Google papers are internally reviewed before publishing so as to make sure only partial information is revealed, and not the secret sauce.

They aren’t saying it is difficult without giving away trade secrets, they are saying it is difficult because the software has dependencies on internal services, which themselves have dependencies on other internal services. Basically, in order to run it you need to also run the entirety of Google’s tech stack. This doesn’t work for an open source project.

They aren’t worried about ‘giving away the idea’, they just don’t have an easy technical way to open source just the one component.

You could definitely build Monarch from this paper. It is very detailed. But keep in mind that Monarch is already ten years old.

> But when you publish a paper on Monarch, aren't you giving away the core idea?

The thing is Monarch is not the secret sauce that makes Google money. Every other company in the world could be running monarch, and Google revenue and market position would not be affected.

Publicizing this is a win-win situation. The idea gets exposure and potentially other devs can be benefit, and Google gets some good PR as a place where devs get to work on cool stuff.

[Also a Googler, opinons mine] In addition to what the other people are saying: there are some limitations to monarch (or really, the data upload path) that are quite annoying, so monarch isn't even necessarily the "best". It's just very good. There are ways to improve it.

The issue is, even if you give away the secret sauce that doesn't really help with making the secret sauce scale or whatnot, nor does anyone that isn't a large cloud provider need a custom solution like monarch. Prometheus or Datadog work fine for everyone else. This might be interesting reading for those companies, but also it might not be, because those can't be as centralized as monarch is (consider if prometheus had an API and ran a centralized cluster of data-ingestion servers, and you made time series to that global, Prometheus-owned, cluster).

> nor does anyone that isn't a large cloud provider need a custom solution like monarch

I'm not actually sure that's true. It's like many other things inside Google - people outside don't necessarily understand the value or know what's actually possible, because they've never experienced anything similar. It's sort of like trying to discuss the finer points of the taste of oysters with someone who has never tasted them.

I would very much like the feature set of Monarch (and streamz), without the maintenance overhead or even insane scale. Very, very few companies out there need to run anything at anywhere near "billion-user" scale, but literally all of them could benefit from painless and detailed monitoring that Monarch offers.

The one thing I really want (which apparently Monarch has) is histogram retention. I'm often called upon to summarize service latency as global p50 and p95, and at the sheer volume of data we have, we aggregate that metric. Thus I am left calculating an average of p95s, which isn't super useful.

To the best of my knowledge, nothing else in the market does that.

Stackdriver/Google Cloud monitoring is backed by Monarch, so if you want the flavor of a Monarch distribution-valued metric, see the docs:


Since the distribution is represented by a CDF of buckets, there's no guarantee that you'll get an accurate representation of the median or any other quantile. On the other hand you'll get an exact average.

Monarch has distributions with predetermined bucket boundaries. These are indeed very useful.

Pet peeve: it can calculate and graph something it calls a quantile. But if the value is in the middle of a large bucket, it will just interpolate or something and the result will be terribly misleading. It'd be much better if it gave lower/upper bounds.

I try to use distributions via questions of the form "what fraction of values are less than / greater than N [which I've verified is a bucket boundary]?". This gives you an answer you can trust.

Circonus (https://www.circonus.com/) supports both recording histogram directly, and merging histograms for analysis. IIUC, it also supports first-class timeseries data similarly to Monarch, where each data point has a high precision timestamp that does not have to align with other timeseries in the data set.

> aren't you giving away the core idea?

The core idea is a distributed time-series DBMS... not much to give away there. It "gives away" some architectural novelty, but it's not the solution to P=NP. These papers typically describe engineering feats more than they do a revoluntionary idea.

I doubt

As impressive a the Planet-Scale sounds like, monitoring system has been essential in a lot of companies, big or small, and there are tons of companies dedicated to this problem domain.

So it is not an untapped territory for many. Google solved a problem that is relevant to Google and its scale. The solution is cool, but it is only so because the problem exists in the first place.

Without Google's problem, the solution will be seen as incredibly over-engineered and unnecessary.

Trust me, it's not that impressive if you want to use it in your dozens of machines cluster...

Given the size of the beast, I wonder what monitors it?

Monarch is monitored by a second instance of itself known as meta-monarch. It’s a fraction of the size, of course, but still mind-bogglingly large. Meta is in turn monitored by the main Monarch instance.

A borgmon

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact