Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
My Philosophy on Alerting: Observations of a Site Reliability Engineer at Google (docs.google.com)
573 points by ismavis on Oct 13, 2014 | hide | past | favorite | 119 comments


This reminds me of an excellent talk my friend Dan Slimmon gave called "Car Alarms and Smoke Alarms". He relates monitoring to the concepts of sensitivity and specificity in medical testing (http://en.wikipedia.org/wiki/Sensitivity_and_specificity). Sensitivity is about the likelihood that your monitor will detect the error condition. Specificity is about the likelihood that it will not create false alarms.

Think about how people react to smoke alarms versus car alarms. When the smoke alarm goes off, people mostly follow the official procedure. When car alarms go off, people ignore them. Why? Car alarms have very poor specificity.

I'd add another layer of car alarms are Not My Problem, but that's just me and not part of Dan's excellent original talk.


We just got new fire alarms installed in our building. They go off 2-4 times per day and everyone just ignores them. It's enough to toast your bread lightly brown to go off. You can easily hear the neighbours fire alarms. There is no way of removing the battery in the new alarms, the only thing you can do is take it down and stuff it in a drawer somewhere. Not even pushing the button helps. This makes them totally useless, even more so than a car alarm. We got ours exchanged and they go off much less now luckily.


Somehow our rented house has three separate smoke detectors clustered around the kitchen, all of which have been renamed cooking detectors, since the slightest hint of heat seems to set them off, triggering every other detector in the house to alarm as well.

They do at least have a snooze button on them, but I was pretty close to buying a set of the Nest alarms out of my own pocket and replacing them until I saw the price tag.


they are probably positioned to close to cooking surfaces, additionally the method of detecting smoke will also determine what kind of false alarms they generate


> Not even pushing the button helps.

If it's like every other smoke detector I've encountered, the bottom is to test the alarm, not to cancel it.


All the storebought smoke alarms I've seen in the past few years have two buttons (or one button with two functions) - one to test, one to shut it off for a few minutes when something besides a housefire is making it go off.

It may vary by country I suppose.


I've got a mains-attached one (ie not solely battery powered), and the button is "hush" which silences it for 5 mins (although it chirps once per minute to remind you that it's hushed). If you hold the button whilst it's not going off, it tests the alarm.


Your lucky, when someone in my building triggers the fire alarm for there flat, the really loud alarm sounds in all the flats in the building, with an audio note saying that someone has triggered the fire alarm and you can leave if you want, but it's not required. I guess it will tell you when they have confirmed it's a real fire, and you really have to leave.


If you're in the US, watch out. I think there are laws against tampering with fire alarms.

(Though I think you're in a ridiculous situation)


Not necessarily recommended, but wrapping smoke alarms in plastic bags usually prevents them from going off. Of course...


Near a kitchen, avoid ionization-based smoke alarms like the plague. Get photoelectric instead.


Would add another dimension to car vs. smoke alarms. [1]

Smoke alarms are loud and typically near you (think about the one that goes off in the kitchen because of burning pizza in the oven). So you have to react or you are paralyzed. You fan it almost immediately. You have to.

Car alarm typically you can't do anything (unless it's your car) and the noise, while annoying, is not as close or as loud as a smoke alarm.

[1] Edit: Distance from and the decibels of the alarm and what you can do about the alarm. Of course there are also building alarms (follow procedure) and alarms in your own house "get the noise to stop!"


So all we need to do is have car alarms be sent directly to you (to your phone or your house) and be loud.


That'd be nice. Make it a loud part of the key fob? Maybe then the owner would be less inclined to accept a car alarm with a hair-trigger.


And more likely to notice, and more able to find his keys... seems like a win all 'round.


And if someone parks like an asshole, you trigger their car alarm and embarrass him in front of everyone they're around. bonus win.


Just put a cell radio in it. As a bonus it can transmit GPS coordinates to you so you can watch it get stolen in real time.


Yeah, I think this actually may explain things better than specificity. Even if the false alarm rate is low, the rate of actual, life-threatening fires is incredibly low in modern day, so chances are almost every time you hear a fire alarm go off it's alerting you essentially erroneously. I get the impression that in a home the "standard procedure" for dealing with fire alarms is to open windows, fan the smoke away from the thing and/or take the batteries out of the alarm.


It depends. The rate of smoke alarms going off inadvertently while someone is cooking is high compared to the actual incidence of fires. The rate of smoke alarms going of when no one is doing anything heat-related is clearly lower. If you are a person who just burned some toast, that is almost certainly the reason the smoke alarm is going off, and it's entirely reasonable to address the symptoms. If you are a person who woke up to a smoke alarm, and no one else was cooking anything, things are a little different.


I don't know if I've ever experienced personal (home) fire alarms going off for no reason, and I've never had them go off because of an actual fire, but I have had them go off when someone else is cooking (alerting me that someone is cooking, whether or not I knew that), or due to steam from a shower, if they are placed wrong - again, whether or not I knew someone was taking a shower.

In corporate/organizational environments - often in buildings made entirely of concrete - I've had them go off very frequently, but almost never as a result of a dangerous condition - usually either a drill, a small fire that was immediately put out, someone pulling the fire alarm as a prank/protest or something else like someone burning popcorn or something.

I still think the point stands.


Which point?

My point was that in a (nonadversarial) context where you have known triggers for false alarms, P(problem|alarm) falls dramatically once you have identified the presence of such a trigger, and it's entirely reasonable for our actions to reflect this.


A similar lesson most of us are probably familiar with. http://en.m.wikipedia.org/wiki/Boy_who_cried_wolf


Link I found of the lecture

http://vimeo.com/95073903


Another thing about car alarms: how many people even know what their own car alarm sounds like?


Unless you have some fancy/aftermarket thing, it sounds just like your horn which you probably hear every time you use the key fob to lock it.

Whether you have the ability to differentiate your particular tone of horn from all the others is a whole different story.


On the same topic, I've heard that every supertanker has a proximity radar in order to avoid to collide with small boats.

However, in some heavy-traffic routes like the Straight of Dover or Gibraltar the proximity alarm goes off constantly. So the commanding officers usually disable them, leading to a good number of accidents with fishermen boats.


> Err on the side of removing noisy alerts – over-monitoring is a harder problem to solve than under-monitoring.

Absolutely this. Our team is having more problems with this issue than anything else. However, there are two points which seem to contradict:

  - Pages should be [...] actionable
  - Symptoms should be monitored, not causes
The problem is that can't act on symptoms, only research them and then act on the causes. If you get an alert that says the DB is down, that's an actionable page - start the DB back up. Whereas, being paged that the connections to the DB are failing is far less actionable - you have to spend precious downtime researching the actual cause first. It could be the network, it could be an intermediary proxy, or it could be the DB itself.

Now granted, if you're only catching causes, there is the possibility you might miss something with your monitoring, and if you double up on your monitoring (that is, checking symptoms as well as causes), you could get noise. That said, most monitoring solutions (such as Nagios) include dependency chains, so you get alerted on the cause, and the symptom is silenced while the cause is in an error condition. And if you missed a cause, you still get the symptom alert and can fill your monitoring gaps from there.

Leave your research for the RCA and following development to prevent future downtime. When stuff is down, a SA's job is to get it back up.


"Every page should require intelligence to deal with: no robotic, scriptable responses."

This omits the implicit "because robotic, scriptable responses should be dealt by robots and scripts".

You should have monitoring for DB is down. But you should page on "product is not working". Having a monitoring system that lets you quickly find what's wrong is extremely important, but you shouldn't be woken up / or distracted for unactionable alerts.


can't upvote enough. not every monitoring check has to page. this is the key insight -- more visibility is always better, more alerting is not.


I don't know about that, frequently the "cause" that you're alerting on is also a symptom. "DB down" is causing your webapp to fail, but why is the DB down? Misconfiguration, hardware failure, power outage? Heck, maybe the DB is not actually down but there was a network failure that made it unreachable from the monitoring server.

My point is that alerting on a "cause" may not actually get you to the root cause, and maybe not even all that much closer than a symtpom.


> Misconfiguration, hardware failure, power outage? Heck, maybe the DB is not actually down but there was a network failure that made it unreachable from the monitoring server.

Sorry, I should have been more specific. DB daemon being down is a cause, and should be monitored. Hardware down is a cause, and should be monitored. Network availability is a cause and should be monitored. Power outage... you get my drift.

I think the root cause of my disagreement with this document is the lack of a proper dependency tree in the alerting tool. Their tool appears to want to alert on any and every monitored problem, which necessitates limiting what you monitor for fear of a pager flood. A proper tool can address this problem correctly.


911: PING TO GOOGLE >100ms


Yep, I thought of that when I was reading it too, but he also talked about having a 'dashboard' system that lets engineers easily determine causes for symptoms. His argument is _heavily_ dependent on having such a system -- and his argument about symptoms-not-causes _only_ about _alerts_ (specifically pages that go directly and immediately to engineers).

In that context, I find it persuasive based on my own experience. (And I wish I had the kind of dashboard he described available; I'd love to see a presentation or blog post about _that_!)


On he topic cause vs symptom-based alerting there is one aspect missing: decoupling.

The systems that Rob is responsible for are decoupled from individual pieces of hardware (redundancy/fault-tolerance) and degrade gracefully (if one part fails only a fraction of the working set is affected). Otherwise the huge scale would not be manageable as some piece is always breaking.

So yes: if you run systems that are not somewhat decoupled from their base, cause==symptom and cause-based alerting is indeed the way to go. And you will be woken up by every single piece breaking. The way to improve this is decoupling, and then you'll also switch to symptom-based alerting.


> The problem is that can't act on symptoms, only research them and then act on the causes.

Have a look at the example at the top of page 5.


I did. And I see the words "A human just has to parse through a few lines of text ...".

Why is a human doing this? Why isn't the tool identifying the cause through a pre-determined dependency tree and only alerting on the salient points instead of relying on a human intelligently parsing through this data when they've been woken up by a page at 3am?

My experience has shown me that this kind of parsing can be avoided by using the right tools.


Of course if the system can automatically detect the problem and fix it then it should (this is also mentioned somewhere in the document). But at some point it will have to bail and then a human has to have a look.


I don't think there's a contradiction there. "Research them and act on the causes" is _exactly_ the action you should take on a symptom-based alert. Yes, the ultimate action is always to try to address and eliminate the "rootest root cause", but if you start from the symptom you will potentially see more ways to mitigate it then if you start from "database server disk is bad."


Perhaps not the best choice of problem to use as an example - a Database server disk is bad problem has one real solution, and will cause a whole flood of seemingly unrelated problems.

Knowing immediately that it's the server disk (more realistically a raid array going into recovery mode) will save you a lot of time and effort troubleshooting what would appear as a sporadic slow response issue. There's dozens of potential causes for poor response times, of which a raid array in recovery mode is just one.

And once you know it's a raid array in recovery mode, you can then take immediate action, something you can't do if you are still busy troubleshooting a sporadic response slowness issue.

I feel that ultimately, there's no problem with monitoring for high level symptoms, but they should not be the goal state of monitoring. The goal should be to monitor all possible causes of problems to limit the troubleshooting the SA has to do at 3am when woken by a page.

Plus, you should be using a tool which properly silences high level symptoms if there's a problem with a system which is clearly identified as a parent. That is to say, a server being down will silence "db is not responding" alerts.


I think this might be better understood in terms of "workarounds" rather than "solutions", because while "solutions" are important in the long term, "workarounds" are what get users interacting with your app again. Symptom-based alerting lets you apply symptom-based workarounds much sooner than cause-based alerting will let you solve the problem.

If your DB's disk is bad, your real problem isn't that the disk is bad; your real problem is that, for example, customers can't buy products from your site. Fixing the symptoms means making your customers able to buy products from your site, not replacing the DB's disk. If you had, say, a failover slave DB, the point of the alert is to tell you to activate the failover process. Replacing the disk is important, but not urgent in the same way activating the failover is.

(Note the interesting fact that all alerts will then end up being for things the system could do something in response to on its own. Failing over to a slave can be automatic. Alerts are, in effect, the system saying "I need a human to come help me stop this from happening, because I don't know how to stop it myself.")


Having your application reviewed by SREs who are going to support it is a legendary experience. They have no motivation to be gentle.

It changes the mindset from "Failure? Just log an error, restore some 'good'-ish state and move on to the next cool feature." towards "New cool feature? What possible failures will it cause? How about improving logging and monitoring on our existing code instead?"


In a past life, we had good luck with obligating dev teams to be (co-)oncall for their own systems until it conformed to to an agreed-upon SLA for N months, before taking on oncall for the system.

That aligned everybody's interests: devs came to SREs for design review and code review, and were incentivized to not just throw code over the wall.


I like making the devs who write the application the SREs who are going to be supporting it and making them carry a pager.

If its worth it to the team to write code with edge conditions that will errantly wake them up in the middle of night occasionally they can make that decision to put that tax on their own lives, not someone else's.

Then the SREs are in every single code review.


This works only as long as you have a single application, or a bunch of independent applications, each with their own team. With any kind of "platform" or "framework" or shared service architecture, the incentive will be for the devs on the application teams to do as little logging as possible, because what will inevitably happen is that failures and errors with "unknown" causes will be marked as platform failures and will wake the platform team in the middle of the night. At that point, you're back to where you started. I've been in that scenario multiple times, and believe me, there are few things worse than trying to debug someone else's code at three in the morning to try to determine if the page that woke me up was a legitimate platform issue or if it's due to application code that's misbehaving or misusing the platform.

Making devs carry pagers certainly helps, but it's a mistake to think that it's a panacea, and that forcing devs to carry pagers will suddenly make them write code with perfect logging and perfect error handling.


Nobody's arguing that it's a panacea. Only that raising awareness of operational issues, reliability, and lost sleep does tend to put feet to the fire.


Great writeup. Should be in any operations handbook. One of the challenges I've found has been dynamic urgency, which is to say something is urgent when it first comes up, but now that its known and being addressed it isn't urgent anymore, unless there is something else going on we don't know about.

Example you get a server failure which affects a service, and you begin working on replacing that server with a backup, but a switch is also dropping packets and so you are getting alerts on degraded service (symptom) but believe you are fixing that cause (down server) when in fact you will still have a problem after the server is restored. So my challenge is figuring out how to alert on that additional input in a way that folks won't just say "oh yeah, this service, we're working on it already."


This is definitely the biggest problem in where I'm working now. We have a lot of monitoring via Wily Introscope, but the biggest thing is relating failures of different components together. E.g. one service layer fails so some queue gets backed up so some application server starts timing out.

The amount of noise that starts coming in when there is some major outage (say some mainframe system fails) is ridiculous.

Right now where I work they solve it by throwing manpower at the problem tbh.

It takes a lot of work by the application owners all working together to really get a coherent picture of how the services are interdependant, but the applications are so large, old code, etc - normal problems I guess a lot of companies face, that its almost impossible to find people who have a complete end to end understanding of most transcations.

Side note: my only monitoring experience is with Wily - anyone have opinions to hare on it?


That's harder problem than I originally realized. It's easy to write noisy alerts, super easy to not have them (or not catching some issues).

It's hard to tune them so signal to noise ratio will be high.


Yep. Extracting meaningfull information out of logs automatically is probably an AI-complete problem...

Correct me if I wrong, but AFAIK the current state of the art solution to the alerts/log-filtering problem is: "log everything & feed these logs into a real time search engine that produces dashboard/alerts". Like elasticsearch/kibana. No? Curious, is that the approach that is being used internally at Google right now? BTW, the article stated the problem/and desired outcome, but not the solution. (?)


Logging is super fucking noisy and generally not structured for operations support. The state of the art at my not-google employers is basically Nagios scripts. Everyone has these scripts that check various components:

- is the webserver running - is it responding on port 443 - does it return HTML' and maybe - 'If I submit a search, do I get a result back?'

Nagios scripts are responsible for everything: opening network connections, querying system internals, collecting metrics, interpreting results, and boiling it down to a number between 0 and 3 and an unstructured text output to stdout.

A few of us understand that what we need is a more structured, data driven approach. Collect base metrics first, build a time series, apply a projection, and feed that projection in to a system that understands the actual failure condition.

As an example, imagine you're monitoring /. Nagios runs NRPE, NRPE looks at df /, and if it's 85 percent full (by default), sends a warning page. At 90 percent it sends a critical page. A smarter system collects the df / results, delivers it to a central timeseries database. The new data point is used to create a new projection, and the new projection is used to determine the time to an actual failure. The system above might have an idea of how long it takes to respond, repair, and resolve and issue a page when the disk will fill up if not responded to within 4 hours. That's the ideal solution, IMO.

It doesn't exist, AFAIK. There's a massive backlog of scripts that were written in the monolothic Nagios model that need to be rewritten, and thus this newer better version is always imaginary.


> As an example, imagine you're monitoring /. Nagios runs NRPE, NRPE looks at df /, and if it's 85 percent full (by default), sends a warning page. At 90 percent it sends a critical page. A smarter system collects the df / results, delivers it to a central timeseries database. The new data point is used to create a new projection, and the new projection is used to determine the time to an actual failure. The system above might have an idea of how long it takes to respond, repair, and resolve and issue a page when the disk will fill up if not responded to within 4 hours. That's the ideal solution, IMO.

We've actually implemented exactly this at my current workplace. We have a nagios check that queries graphite and calculates a "days until full" value, and alert based on that. We have similar checks for monitoring other infrastructure. These checks take a lot of work to get the right calculation and threshold values, but once they work, it's pretty great.


I'd be interested in hearing how exactly you're making these calculations.


I too would be very interested to see your scripts!


We use Splunk at my organization to handle alerting and paging (and lots of other things).

Generally speaking, it works out pretty well for us. If you're a Splunk query guru you can also correlate and/or combine multiple disparate logs in elaborate ways to create more complex alert conditions.

The same can presumably be done with Elasticsearch/Logstash/Kibana.

We're actually security incident response, not reliability incident response, so our goals and methods differ a bit but the core concepts are all the same.


Alerting and monitoring is not about logs. Applications export interesting signals directly in a way understood by monitoring service like Nagios. It stores the samples, draws nice graphs and supports flexible alert definition logic.


Well, to me "applications export interesting signals directly in a way understood by monitoring service" feels like a legacy approach. It places the burden of decision "what is an interesting alert signal" and burden of structuring the log file output on the software developer! And it places that burden at an inconvinient time, when the system is still in the making.

On the other hand, by logging everything as text, and then running intellegent/structurizing real time search engine over logs one can make/modify these decisions at a later time. And it can be done both by devs/ops, without touching the source code!


That seems silly though. I can replace stats on a thing that normally takes 50 usecs in a log line because it will take more than that long just to log the fact and an insane amount of cpu to analyze such a thing. The large scale systems that I personally operate produce a few KB per minute in structured stats, a few MB per second in structured logs, and hundreds of MB per second in unstructured text logs. I know which of these I'd rather use for monitoring.


To thrownaway2424. What seems silly is that processing of a few of MB per second of unstructured text logs by a real time search engine seems impossible to you. Think web-crawlers. Search engines are efficient....


What do you use to monitoring the "real time search engine"?


Is that a joke-question? The one that I've used is the elasticsearch / kibana. And usually one would be using elasticsearch to monitor the elasticsearch :)

That's the good thing about this setup, you have all the logs from all your applications (think like custom text logs from your routers, your custom applications, temperature sensors, syslogs, windows servers) aggregated in one place. And when something happens (at a particular moment in time, or with a particular machine, or with a particular key) suddenly you are able to search/drill down and locate the actual cause. And maybe even configure a dashboard or make a plot that would show when this problem was showing up.

Scalable real time search engines with the ability to create trends/dashboards is one powerfull toy ;) It is ridiculuos and silly. But it is an immensely powerfull approach.


youre thinking too small. Try hundreds of KB to a couple MB per second per host. And tens of thousands of hosts. Data streams at (tens of) gigabits per second are not trivial.


I don't know. In my experience, one big elasticsearch box can cope with a few months of 2-3 MB/sec log data. I guess that the entropy of log file information is quite low and the search engine is being able to take advantage of that and keep its indexes rather small. But gigabits per second... I just don't know.


You can do alerting and monitoring through logs. I've done it myself. You can reduce the complexity of your infrastructure by converging those functions into a single set of tools. I would absolutely agree that the state of the art is capturing the evolving state of the system as a stream of events, and deriving monitoring and alerting from that stream.


Logs are typically fairly unstructured and complex to parse. For whitebox monitoring (i.e. where you have access to the code and the code can report state) you are far better off exporting state in a very well defined format to minimise parsing overhead. It also tends to make you a bit more focused on defining the characteristics of the parameter you are monitoring.

You want blackbox monitoring (for close-to-user experience) AND whitebox monitoring (which provide diagnostics of internal state for debugging). True blackbox monitoring is often pretty unreliable so you are usually better of alerting on whitebox reported state of end user perceivable variables, e.g. HTTP error codes, latency and so on.

State of the art is to report a staggering amount of data about the internal state of a server. I mean a lot. 10s to 100s of times the number of parameters you are probably used to seeing.


Typically yes. But I'm predicting a huge shift towards structured logging. Take a look at Serilog. It's a .Net logger I've been using recently that just has some fantastic concepts. Worth reading into even if you don't use .Net . I believe that's the direction "logging" will go... It's more eventing now I guess.

Both metrics and "logs" can be expressed as events. Those are like points in space. An incident could be like a line; a 2d event with a start and duration.


The real trick, in my experience, is to treat every page as an actionable item, even if, no, especially if the action is to change alerting thresholds.

Doing this has taken us, in the past, from 400 pages a day to under 100 a day, over the course of a week's worth of effort.


If something not-so-bad-this-time happens while I'm eating dinner, I am not taking time right then and there to figure out exactly when we should and shouldn't alert for that event. That demands a level of analytical thinking that can take place during work hours.


Where I work, at a mobile ad network, they put everyone on call on a rotating basis even if they are not devops or server engineers. We use Pager Duty and it works well. Since there is always a primary and secondary on call person, and the company is pretty small and technical, everyone feels "responsible" during their shifts, and at least one person is capable of handling rare, catastrophic events. I often wonder which is more important: good docs on procedures for failure modes or a heightened sense of responsibility. A good analogy may be the use of commercial airline pilots. They can override autopilot, but I am told rarely do. The safest airlines are good at maintaining their heightened sense of vigilance despite the lack of the need for it 99.999% of the time.


"If you want a quiet oncall rotation, it's imperative to have a system for dealing with things that need timely response, but are not imminently critical."

This is an excellent point that is missed in most monitoring setups I've seen. A classic example is some request that kills your service process. You get paged for that so you wrap the service in a supervisor like daemon. The immediate issue is fixed and, typically, any future causes of the service process dying are hidden unless someone happens to be looking at the logs one day.

I would love to see smart ways to surface "this will be a problem soon" on alerting systems.


Re: this will be a problem soon? Metrics trending. Look for changes in your metrics to spot potential problems and plan for the future. This is done quite often for example in QA to look for issues between releases, and can be done both macro and micro in terms of continuous delivery services' metrics.


I think anything that requires "spotting potential problems" is only a partial solution. I've never seen a compelling system that can look at all the metrics and (with reasonable precision and recall) spot and summarize changes that are actually problematic and surprising to humans. It's definitely a necessary part of observing what's going on (and quickly eliminating hypotheses like "maybe we're out of CPU!"), for sure.

The subcritical alerts I think of are more things like "Well, the database is _getting_ full, but it's not full yet." Or to borrow someone else's example, "we put in this daemon restarter when it was dying once a week; now it's dying every few minutes and we're only surviving because our proxy is masking the problem but soon it's going to take the whole site down."


These subcritical alerts deserve better but different handling: they can almost always be delivered to a non-paging email address, either a relevant internal mailing list or a ticket queue, where they can be investigated during normal office hours.

The other useful tip I have is to put URLs to internal wikis and/or tickets in the alert body. We write documentation for these to a 3AM standard: if I can't understand it immediately after being woken up at 3AM, it's not clear or actionable enough.


I think we're talking about the same thing. (Why does this keep happening?) Trending of metrics tells you whether an alert is useful or not. Has the database been getting full for over a month, or did it just begin getting full and the current rate of disk consumption means in 90 minutes it will be full?

There is no reason anyone should ever run out of disk space if they alert on the trending rate of disk space [rather than the actual amount of disk space used]. But this applies to so, SO many things other than simple resource exhaustion. Seeing the trends is useful to alerts, but it's also useful to humans who can review them weekly and plan for the future.


I don't want to threadjack, but you don't have an email in your profile.

If you (or anyone else reading) ever want to talk about what "this will be a problem soon" might look like in the future drop me a line: dave@pagerduty.com


Most of this appears to be just end-to-end testing, and whether you're alerting on a failure of testing the entire application stack or just individual components. He probably got paged by too many individual alerts versus an actual failure to serve data, which I agree would be annoying.

In a previous position, we had a custom ticketing system that was designed to also be our monitoring dashboard. Alerts that were duplicates would become part of a thread, and each was either it's own ticket or part of a parent ticket. Custom rules would highlight or reassign parts of the dashboard, so critical recurrent alerts were promoted higher than urgent recurrent alerts, and none would go away until they had been addressed and closed with a specific resolution log. The whole thing was designed so a single noc engineer at 3am could close thousands of alerts per minute while logging the reason why, and keep them from recurring if it was a known issue. The noc guys even created a realtime console version so they could use a keyboard to close tickets with predefined responses just seconds after they were opened.

The only paging we had was when our end-to-end tests showed downtime for a user, which were alerts generated by some paid service providers who test your site around the globe. We caught issues before they happened by having rigorous metric trending tools.


I don't think it's end-to-end testing because "testing" to me implies a synthetic environment. This is about instrumenting and monitoring the production system at scale, and learning about the right things at the right time.

It certainly shares some things with end-to-end testing, and blackbox monitoring is very useful for finding high level problems with any complex networked system.


So he's talking about system testing and not end-to-end testing. I suppose if your application is really simple, system testing is fine. But if your QA group ever starts automating tests, it's time to re-evaluate.

Blackbox monitoring is (imho) only appropriate for 3rd parties. If it's part of your company, it shouldn't be a black box; that means someone got lazy and didn't demand the devs provide an API.

Also, i'm sorry but this really gets to me: at what point are we talking about 'at scale' ? I think it's whenever tons of money is riding on your site's availability and an unexpected failure causes customers to complain. Immediately VPs start screaming "WE NEED TO SCALE UP!!" and then they mandate some half-assed implementation of the comprehensive monitoring solution they claimed was unnecessary just a month before. But maybe i'm just jaded.


Thanks for posting this! I'm on the product team at PagerDuty, and this lines up with a lot of our thinking on how to effectively design alerting + incident response. I love the line "Pages should be urgent, important, actionable, and real."


I'm always happy to chat about this topic; feel free to drop me a line.


Here's another good writeup on effective alerting, by a former Google Staff Engineer: http://blog.scalyr.com/2014/08/99-99-uptime-9-5-schedule/


Why does a company the size of Google even have call rotations? Shouldn't they have 24/7 shifts of reliability engineers who can manually call in additional people as and when they're needed?

I can totally understand why SMBs have rotations. They have less staff. But a monster corporation? This seems like lame penny pinching. Heck for the amount of effort they're clearly putting into automating these alerts, they could likely use the same wage-hours to just hire someone else for a shift. Heck with an international company like Google they could have UK-based staff monitoring US-based sites overnight and visa-versa. Keep everyone on 9-5 and still get 24 hour engineers at their desks.


Google does spread oncall rotations across multiple timezones. Most SREs are oncall only during the day, with the local nightshift being somebody else's dayshift.

For a more detailed look at Google's SRE operations, watch Ben Traynor's excellent talk "Keys to SRE": https://www.usenix.org/conference/srecon14/technical-session...


That was an insightful talk, thanks.


As someone who is on a regular rotation - being oncall sucks. But there are definite advantages.

Having the team feel direct pain is a great motivator for building robust applications - if you know that that quick hack could lead to you getting paged at 3 in the morning you are far more likely to seek out additional solutions(anecdotally speaking) - it means you also have weight within the team to make the right engineering calls.

The team can bond over operations - it sucks being oncall, everyone knows that so everyone tries to make it less sucky, by clearing ops queues before handing off or making ticket messages a bit nicer.

When you start having ticket-free weeks or months it is an awesome feeling, the service works and is robust and your team can spend that time writing new stuff.

Additionally, when you have a small team building/maintaining a service it makes far more sense to rotate the oncall responsibility between them rather than an external engineer.


> Having the team feel direct pain is a great motivator for building robust applications

But the OP is a dedicated reliability engineer, they don't build the actual applications and couldn't given the size of the company. Essentially you're being punished for other team's generated issues.

You getting punished when your own stuff breaks is more a small business issue, not something corporations have.


> You getting punished when your own stuff breaks is more a small business issue, not something corporations have.

why do you think that should be the case? I'd argue quite the opposite having done this in a large corporation with pretty decent success.

to be successful, it requires that your monitoring isn't noisy; most issues are repaired in an automated way; then when you get a "page" it is something that someone deeply familiar with the code / service is the best person to react to it. it will drive down the MTTR. and it typically gets the root cause fixed sooner in the code. these folks need to be engineers, but they don't all need to be just the devs, since internet scale services just are noisy and you need to have some randomization buffer. but those engineers need to be dedicated to the service and not a central org, where they aren't going to know how to look at logs, have depth on the intra-service interactions, the latest changes, etc. to me, this is also the best definition of "devops".


There are definitely corporations where engineers who build the applications take responsibility for maintaining and running the applications - I happen to work for one.


This should be all corporations.


SRE has veto power over many of the design choices that go into the actual application.


The only people who can support or fix an application is often the team developing the application itself.


Yeah, when I interviewe at Google, this was pretty much what I was told - one of the prices of writing "new, cool stuff" is that you get to support it, because no one else is going to be able to fix it when things go wrong.


Sounds as if that requires retiring systems when the developer leaves the company.


I would imagine that if the new thing makes the transition to being a successful, widely used service then putting a support team in place is a part of that. At the beginning though, when no-one knows whether your new thing is going to be a success or not, the dev team is it.


I think it's simple: Developers who are on-call write better code.

As soon as you take the view of "This is crappy, but keeping it running is someone else's problem" then everything suffers (product quality, engineering quality, and reliability).


There's a lot to be said for having a domain expert be on call. Most issues can be fixed really quickly, but if it's your first time ever seeing an alert or working on a service, you're pretty much boned and end up paging the domain expert anyway.


That's exactly what Google does except some teams have triple rotations like Mountain View/Sydney/Zurich. I'm not sure why you think the word "rotation" means something else.


I've worked for companies in the past where a "rotation" was you were on call 24/7 for a week or longer. Worked all night? Still expected at the office no later than 9 a.m.

It was exhausting as the people on call didn't have the power to actually fix the system. Instead, they would have to walk someone else through the steps over the phone. If it took a code change to fix the system, too bad that was at least two weeks of red tape, and every single time the error occurred they had to page the person on call to walk the person through over the phone to verify it was the same error and nothing could be done.

Think "big business," "division of responsibilities," "accountability," inept management who never had to suffer under the policies they demanded, and a toxic culture which was proud to "give everything they have for the product."

Luckily, I don't work for those shitty companies anymore.


IMO the difference is ownership. That you own your code from the inception through production and that there is never a hand-off. It leads to better code and better moral.


What makes you think this isn't the case? I don't see anything to the contrary in the document.


You cannot hire top-quality engineers who are willing to do shift work.


Here's the link to it as a PDF for anyone else wanting a printable copy to pin to their wall: https:/docs.google.com/document/export?format=pdf&id=199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q


Good article. Alerting system unfortunately are still at the same level they where decades ago. Today we work in highly distributed environments that scale dynamically and we finding symptoms is a key problem. That is why a lot of people alert on causes or anomalies. In reality they should just detect them and log them for further dependency analysis once a real problem is found. We for example differentiate between three levels of alerts: infrastructure only, application services and users. Our approach to have NO alerts at all but monitor a ton of potential anomalies. Once these anomalies have user impact we report back problem dependencies.

If you are interested you can also get my point of view from my Velocity talk on Monitoring without alerts. https://www.youtube.com/watch?v=Gqqb8zEU66s. If you are interested also check out www.ruxit.com and let me know what you think of our approach.


This is huge. One of the big things dev teams benefit from bringing an SRE team onto their project is learning things like this and how to run a sustainable oncall rotation.


My startup http://usetrace.com is a web monitoring (+regression testing) tool with the "monitor for your users" philosophy mentioned in Rob's article. Monitoring is done on the application/feature level -> alerts are always about a feature visible to the users.


This was very informative, I like the idea of monitoring symptoms that are user-facing rather than causes which are devops/sysadmin/dev-facing. I'm just thankful that my next project doesn't require pager duty.


Can't access the site, seems like there's some quota on docs.google.com... Does anyone have a cached version? (WebArchive can't crawl it due to robots.txt)


So I guess the author uses a smart phone as a pager, but given his passion for uptime, reliability, latency etc. I wonder if he has experimented with an actual pager.


We use a variety of escalation techniques. As mentioned down thread, pagers are actually very unreliable. Some SREs carry pagers and mobiles. Most SREs carry phones with escalation via SMS, an actual telephone call from an automated system (after a delay), and/or escalation via a network connection over the data network. Phone calls are way way more reliable than pagers.

Unacknowledged pages escalate to a secondary oncaller (e.g. if the oncall is out of range/in a tunnel, under a bus) and tertiary depending on configuration (and then it loops, or falls to another rotation, again depending on configuration). The code and services that do escalations is deliberately and carefully vetted to have minimal overlap with production systems (who's failure they might be alerting people to).


This sort of mirrors how we do things at PagerDuty. Phone may be more reliable than SMS, but every telephony/messaging gateway fails sometimes. We use something like a dozen different phone/SMS gateways to prevent single points of failure, and do end-to-end testing of our SMS providers to check their uptime and latency.

Responders can customize their notification methods (push, SMS, phone, and email) and rules, so you can do things like get a lightweight push notification when an alert happens, and then a phone call 2 minutes later if you haven't acknowledged the incident. Teams get escalation timeouts that forward alerts up the chain if the primary hasn't responded after a period of time.


It turns out that pagers are often unreliable; while I've never had a message not get through eventually, I have frequently seen messages get delayed for hours or even until the pager was power-cycled.

On the other hand, if you set up an IMAPS account for each person's alert address, and then have their smartphone use that account to check email, reliability is quite high. (Strangely, sysadmins tend to have wifi in their houses and good data plans, especially when the company pays for it.)


Ahhh I thought the medical profession had it figured out, but these unreliable pagers could be why the doctor is always late for the delivery.


Any alert should be for a good cause sounds good according to me.


I just want to say: HN is bursting with great articles today.


Great article @robewaschuk :)

-- Marcin, former Google SRE


In todays world, 90% of bloggers rely on google for their living


[flagged]


Stop it. Your spam is obvious.


You win some you lose some. Hacker News is literally 100% spam, that's what a link aggregator is by definition. Some stick and some get called out.

Something gets called spam just because you post it yourself, but if "someone else" posts it then it isn't spam anymore. Magic! Just like your past submissions

I've been on these boards lurking for many many years and I've seen posts come and go, all of them are what you would call spam. Just people pitching their products


The idea that a link aggregator is "100% spam" may help you feel better, but has no bearing on reality. You tried to promote something you work on (without even being decent enough to admit it) and got called for it.

Don't be a jerk.


Very cool




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: