The query language is the brainchild of John Banning, one of the authors of the paper, and has a long history behind it. In 2007 or so he started working on a replacement for Borgmon's rule language; the thinking at the time was that the main problem with Borgmon was that its language was surprising and difficult for casual users to grasp. (And with a monitoring language, there are only casual users.)
That work eventually resulted in a language called Optic, which was indeed (IMO) a very nice cleanup of Borgmon. Ultimately though that work got shelved in favor of Monarch, whose focus was less on the language problems of Borgmon and more on the points listed in the introduction of the paper, especially points 1, 3, and 4 (at least in my memory).
The underpinnings of the query data model and execution model got hashed out reasonably well as part of the first implementation of Monarch, which started in earnest in late 2008 or early 2009. But the textual form of the query language suffered for quite a long time after that. I wrote the first crappy version of an operators-joined-by-pipes language sometime in 2010. ("Language" is a generous term; John liked to refer to it in a kindly way as "an impoverished notation.") But it was clear even then that the basics of that syntax were appealing: they lined up nicely with how our users mentally constructed their queries. "You start with the raw data; then apply a rate; then aggregate by these fields; then take the maximum over the last five minutes" etc.
Through a couple of revisions over the subsequent few years, that "impoverished notation" eventually got embedded, through some awful operator overloading, as a kind of DSL inside of Python. But it was clear to everyone that it would be impossible to release that publicly to GCP users; it was much too clunky, and also by then tied inextricably to Python idiosyncrasies. So in about 2015, give or take, we came back to the question of what a better textual notation might look like.
The obvious first choice was to see if we could somehow twist SQL into being useful, possibly with some custom functions or very minor extensions. Around this time there was a large effort going on to standardize several different SQL dialects that were being used by internal systems (BigQuery/Dremel's SQL dialect was not the same as Spanner's dialect, etc). So it felt like there was a convenient opportunity to somehow fit time series data into the same model.
John did a bunch of due diligence to try to make that idea work, but it just wouldn't fly. I remember a list he had of about fifty of the most common kinds of queries, written with a SQL version next to (an early version of) Monarch's current query language. Nearly everyone he showed it to, across the spectrum of experience and seniority, both SWE and SRE, said "of course I'd rather read and write SQL, let me look at that list"... and then went through it careful and came out thinking, well, maybe not.
I don't know if there are any interesting conclusions to draw from the history of it, except that language design is really hard. I agree that it's a fun little language, and I'm very happy that John and the team managed to get it out publicly in Stackdriver.
Googlers still have to use the terrible python dsl (“mash”). Even worse: they have to use it wrapped in a different terrible python dsl (“gmon”). Sigh.
> The core point here is that all summary statistics are misleading. You need to be clear on what you care about
I couldn't agree more. A few months ago I gave a talk that tried in part to emphasize this point (https://www.youtube.com/watch?v=EG7Zhd6gLiw). mjb, I hadn't seen your post until just now but I wish I'd known about it earlier.
Another hard-earned lesson on many teams I've worked with is that humans just aren't very good at judging the variance that's intrinsic to many [summary] statistics. Even when your system is operating in what a human would consider a steady-state, summary statistics are naturally going to bounce around a bit over time. The variance is often higher for tail percentiles just because the density of the PDF is lower in that region. When faced with a question like "did the behavior of my system get worse?" in response to an external change (such as a config change, a code deploy, a traffic increase, etc.), it can be difficult to come up with a reliable answer just by eyeballing a squiggly time series line.