
My Philosophy on Alerting: Observations of a Site Reliability Engineer at Google - ismavis
https://docs.google.com/a/gravitant.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/preview?sle=true
======
beat
This reminds me of an excellent talk my friend Dan Slimmon gave called "Car
Alarms and Smoke Alarms". He relates monitoring to the concepts of sensitivity
and specificity in medical testing
([http://en.wikipedia.org/wiki/Sensitivity_and_specificity](http://en.wikipedia.org/wiki/Sensitivity_and_specificity)).
Sensitivity is about the likelihood that your monitor will detect the error
condition. Specificity is about the likelihood that it will _not_ create false
alarms.

Think about how people react to smoke alarms versus car alarms. When the smoke
alarm goes off, people mostly follow the official procedure. When car alarms
go off, people ignore them. Why? Car alarms have very poor specificity.

I'd add another layer of car alarms are Not My Problem, but that's just me and
not part of Dan's excellent original talk.

~~~
Moru
We just got new fire alarms installed in our building. They go off 2-4 times
per day and everyone just ignores them. It's enough to toast your bread
lightly brown to go off. You can easily hear the neighbours fire alarms. There
is no way of removing the battery in the new alarms, the only thing you can do
is take it down and stuff it in a drawer somewhere. Not even pushing the
button helps. This makes them totally useless, even more so than a car alarm.
We got ours exchanged and they go off much less now luckily.

~~~
scoot
> Not even pushing the button helps.

If it's like every other smoke detector I've encountered, the bottom is to
test the alarm, not to cancel it.

~~~
cdr
All the storebought smoke alarms I've seen in the past few years have two
buttons (or one button with two functions) - one to test, one to shut it off
for a few minutes when something besides a housefire is making it go off.

It may vary by country I suppose.

------
falcolas
> Err on the side of removing noisy alerts – over-monitoring is a harder
> problem to solve than under-monitoring.

Absolutely this. Our team is having more problems with this issue than
anything else. However, there are two points which seem to contradict:

    
    
      - Pages should be [...] actionable
      - Symptoms should be monitored, not causes
    

The problem is that can't act on symptoms, only research them and then act on
the causes. If you get an alert that says the DB is down, that's an actionable
page - start the DB back up. Whereas, being paged that the connections to the
DB are failing is far less actionable - you have to spend precious downtime
researching the actual cause first. It could be the network, it could be an
intermediary proxy, or it could be the DB itself.

Now granted, if you're only catching causes, there is the possibility you
might miss something with your monitoring, and if you double up on your
monitoring (that is, checking symptoms as well as causes), you could get
noise. That said, most monitoring solutions (such as Nagios) include
dependency chains, so you get alerted on the cause, and the symptom is
silenced while the cause is in an error condition. And if you missed a cause,
you still get the symptom alert and can fill your monitoring gaps from there.

Leave your research for the RCA and following development to prevent future
downtime. When stuff is down, a SA's job is to get it back up.

~~~
ithkuil
"Every page should require intelligence to deal with: no robotic, scriptable
responses."

This omits the implicit "because robotic, scriptable responses should be dealt
by robots and scripts".

You should have _monitoring_ for DB is down. But you should _page_ on "product
is not working". Having a monitoring system that lets you quickly find what's
wrong is extremely important, but you shouldn't be woken up / or distracted
for unactionable alerts.

~~~
gfodor
can't upvote enough. not every monitoring check has to page. this is the key
insight -- more visibility is always better, more alerting is not.

------
praptak
Having your application reviewed by SREs who are going to support it is a
legendary experience. They have no motivation to be gentle.

It changes the mindset from _" Failure? Just log an error, restore some
'good'-ish state and move on to the next cool feature."_ towards _" New cool
feature? What possible failures will it cause? How about improving logging and
monitoring on our existing code instead?"_

~~~
lamontcg
I like making the devs who write the application the SREs who are going to be
supporting it and making them carry a pager.

If its worth it to the team to write code with edge conditions that will
errantly wake them up in the middle of night occasionally they can make that
decision to put that tax on their own lives, not someone else's.

Then the SREs are in every single code review.

~~~
quanticle
This works only as long as you have a single application, or a bunch of
independent applications, each with their own team. With any kind of
"platform" or "framework" or shared service architecture, the incentive will
be for the devs on the application teams to do as little logging as possible,
because what will inevitably happen is that failures and errors with "unknown"
causes will be marked as platform failures and will wake the platform team in
the middle of the night. At that point, you're back to where you started. I've
been in that scenario multiple times, and believe me, there are few things
worse than trying to debug someone else's code at three in the morning to try
to determine if the page that woke me up was a legitimate platform issue or if
it's due to application code that's misbehaving or misusing the platform.

Making devs carry pagers certainly _helps_ , but it's a mistake to think that
it's a panacea, and that forcing devs to carry pagers will suddenly make them
write code with perfect logging and perfect error handling.

~~~
dredmorbius
Nobody's arguing that it's a panacea. Only that raising awareness of
operational issues, reliability, and lost sleep does tend to put feet to the
fire.

------
ChuckMcM
Great writeup. Should be in any operations handbook. One of the challenges
I've found has been dynamic urgency, which is to say something is urgent when
it first comes up, but now that its known and being addressed it isn't urgent
anymore, unless there is something else going on we don't know about.

Example you get a server failure which affects a service, and you begin
working on replacing that server with a backup, but a switch is also dropping
packets and so you are getting alerts on degraded service (symptom) but
believe you are fixing that cause (down server) when in fact you will still
have a problem after the server is restored. So my challenge is figuring out
how to alert on that additional input in a way that folks won't just say "oh
yeah, this service, we're working on it already."

~~~
orangesareok
This is definitely the biggest problem in where I'm working now. We have a lot
of monitoring via Wily Introscope, but the biggest thing is relating failures
of different components together. E.g. one service layer fails so some queue
gets backed up so some application server starts timing out.

The amount of noise that starts coming in when there is some major outage (say
some mainframe system fails) is ridiculous.

Right now where I work they solve it by throwing manpower at the problem tbh.

It takes a lot of work by the application owners all working together to
really get a coherent picture of how the services are interdependant, but the
applications are so large, old code, etc - normal problems I guess a lot of
companies face, that its almost impossible to find people who have a complete
end to end understanding of most transcations.

Side note: my only monitoring experience is with Wily - anyone have opinions
to hare on it?

------
jakozaur
That's harder problem than I originally realized. It's easy to write noisy
alerts, super easy to not have them (or not catching some issues).

It's hard to tune them so signal to noise ratio will be high.

~~~
dchichkov
Yep. Extracting meaningfull information out of logs automatically is probably
an AI-complete problem...

Correct me if I wrong, but AFAIK the current state of the art solution to the
alerts/log-filtering problem is: "log everything & feed these logs into a real
time search engine that produces dashboard/alerts". Like _elasticsearch
/kibana_. No? Curious, is that the approach that is being used internally at
Google right now? BTW, the article stated the problem/and desired outcome, but
not the solution. (?)

~~~
praptak
Alerting and monitoring is not about logs. Applications export interesting
signals directly in a way understood by monitoring service like Nagios. It
stores the samples, draws nice graphs and supports flexible alert definition
logic.

~~~
dchichkov
Well, to me "applications export _interesting_ signals directly in a way
understood by monitoring service" feels like a legacy approach. It places the
burden of decision "what is an _interesting_ alert signal" and burden of
_structuring_ the log file output on the software developer! And it places
that burden at an inconvinient time, when the system is still in the making.

On the other hand, by logging everything as text, and then running
intellegent/structurizing real time search engine over logs one can
make/modify these decisions at a later time. And it can be done both by
devs/ops, without touching the source code!

~~~
thrownaway2424
That seems silly though. I can replace stats on a thing that normally takes 50
usecs in a log line because it will take more than that long just to log the
fact and an insane amount of cpu to analyze such a thing. The large scale
systems that I personally operate produce a few KB per minute in structured
stats, a few MB per second in structured logs, and hundreds of MB per second
in unstructured text logs. I know which of these I'd rather use for
monitoring.

~~~
dchichkov
To thrownaway2424. What seems silly is that processing of a few of MB per
second of unstructured text logs by a real time search engine seems impossible
to you. Think web-crawlers. Search engines are efficient....

~~~
thrownaway2424
What do you use to monitoring the "real time search engine"?

~~~
dchichkov
Is that a joke-question? The one that I've used is the elasticsearch / kibana.
And usually one would be using elasticsearch to monitor the elasticsearch :)

That's the good thing about this setup, you have all the logs from all your
applications (think like custom text logs from your routers, your custom
applications, temperature sensors, syslogs, windows servers) aggregated in one
place. And when something happens (at a particular moment in time, or with a
particular machine, or with a particular key) suddenly you are able to
search/drill down and locate the actual cause. And maybe even configure a
dashboard or make a plot that would show when this problem was showing up.

Scalable real time search engines with the ability to create trends/dashboards
is one powerfull toy ;) It is ridiculuos and silly. But it is an immensely
powerfull approach.

------
jonbarker
Where I work, at a mobile ad network, they put everyone on call on a rotating
basis even if they are not devops or server engineers. We use Pager Duty and
it works well. Since there is always a primary and secondary on call person,
and the company is pretty small and technical, everyone feels "responsible"
during their shifts, and at least one person is capable of handling rare,
catastrophic events. I often wonder which is more important: good docs on
procedures for failure modes or a heightened sense of responsibility. A good
analogy may be the use of commercial airline pilots. They can override
autopilot, but I am told rarely do. The safest airlines are good at
maintaining their heightened sense of vigilance despite the lack of the need
for it 99.999% of the time.

------
leef
"If you want a quiet oncall rotation, it's imperative to have a system for
dealing with things that need timely response, but are not imminently
critical."

This is an excellent point that is missed in most monitoring setups I've seen.
A classic example is some request that kills your service process. You get
paged for that so you wrap the service in a supervisor like daemon. The
immediate issue is fixed and, typically, any future causes of the service
process dying are hidden unless someone happens to be looking at the logs one
day.

I would love to see smart ways to surface "this will be a problem soon" on
alerting systems.

~~~
peterwwillis
Re: this will be a problem soon? Metrics trending. Look for changes in your
metrics to spot potential problems and plan for the future. This is done quite
often for example in QA to look for issues between releases, and can be done
both macro and micro in terms of continuous delivery services' metrics.

~~~
robewaschuk
I think anything that requires "spotting potential problems" is only a partial
solution. I've never seen a compelling system that can look at all the metrics
and (with reasonable precision and recall) spot and summarize changes that are
actually problematic and surprising to humans. It's definitely a necessary
part of observing what's going on (and quickly eliminating hypotheses like
"maybe we're out of CPU!"), for sure.

The subcritical alerts I think of are more things like "Well, the database is
_getting_ full, but it's not full yet." Or to borrow someone else's example,
"we put in this daemon restarter when it was dying once a week; now it's dying
every few minutes and we're only surviving because our proxy is masking the
problem but soon it's going to take the whole site down."

~~~
dsr_
These subcritical alerts deserve better but different handling: they can
almost always be delivered to a non-paging email address, either a relevant
internal mailing list or a ticket queue, where they can be investigated during
normal office hours.

The other useful tip I have is to put URLs to internal wikis and/or tickets in
the alert body. We write documentation for these to a 3AM standard: if I can't
understand it immediately after being woken up at 3AM, it's not clear or
actionable enough.

------
peterwwillis
Most of this appears to be just end-to-end testing, and whether you're
alerting on a failure of testing the entire application stack or just
individual components. He probably got paged by too many individual alerts
versus an actual failure to serve data, which I agree would be annoying.

In a previous position, we had a custom ticketing system that was designed to
also be our monitoring dashboard. Alerts that were duplicates would become
part of a thread, and each was either it's own ticket or part of a parent
ticket. Custom rules would highlight or reassign parts of the dashboard, so
critical recurrent alerts were promoted higher than urgent recurrent alerts,
and none would go away until they had been addressed and closed with a
specific resolution log. The whole thing was designed so a single noc engineer
at 3am could close thousands of alerts per minute while logging the reason
why, and keep them from recurring if it was a known issue. The noc guys even
created a realtime console version so they could use a keyboard to close
tickets with predefined responses just seconds after they were opened.

The only paging we had was when our end-to-end tests showed downtime for a
user, which were alerts generated by some paid service providers who test your
site around the globe. We caught issues before they happened by having
rigorous metric trending tools.

~~~
robewaschuk
I don't think it's end-to-end testing because "testing" to me implies a
synthetic environment. This is about instrumenting and monitoring the
production system at scale, and learning about the right things at the right
time.

It certainly shares some things with end-to-end testing, and blackbox
monitoring is very useful for finding high level problems with any complex
networked system.

~~~
peterwwillis
So he's talking about system testing and not end-to-end testing. I suppose if
your application is really simple, system testing is fine. But if your QA
group ever starts automating tests, it's time to re-evaluate.

Blackbox monitoring is (imho) only appropriate for 3rd parties. If it's part
of your company, it shouldn't be a black box; that means someone got lazy and
didn't demand the devs provide an API.

Also, i'm sorry but this really gets to me: at what point are we talking about
'at scale' ? I think it's whenever tons of money is riding on your site's
availability and an unexpected failure causes customers to complain.
Immediately VPs start screaming "WE NEED TO SCALE UP!!" and then they mandate
some half-assed implementation of the comprehensive monitoring solution they
claimed was unnecessary just a month before. But maybe i'm just jaded.

------
shackattack
Thanks for posting this! I'm on the product team at PagerDuty, and this lines
up with a lot of our thinking on how to effectively design alerting + incident
response. I love the line "Pages should be urgent, important, actionable, and
real."

~~~
robewaschuk
I'm always happy to chat about this topic; feel free to drop me a line.

------
gk1
Here's another good writeup on effective alerting, by a former Google Staff
Engineer:
[http://blog.scalyr.com/2014/08/99-99-uptime-9-5-schedule/](http://blog.scalyr.com/2014/08/99-99-uptime-9-5-schedule/)

------
Someone1234
Why does a company the size of Google even have call rotations? Shouldn't they
have 24/7 shifts of reliability engineers who can manually call in additional
people as and when they're needed?

I can totally understand why SMBs have rotations. They have less staff. But a
monster corporation? This seems like lame penny pinching. Heck for the amount
of effort they're clearly putting into automating these alerts, they could
likely use the same wage-hours to just hire someone else for a shift. Heck
with an international company like Google they could have UK-based staff
monitoring US-based sites overnight and visa-versa. Keep everyone on 9-5 and
still get 24 hour engineers at their desks.

~~~
twp
Google does spread oncall rotations across multiple timezones. Most SREs are
oncall only during the day, with the local nightshift being somebody else's
dayshift.

For a more detailed look at Google's SRE operations, watch Ben Traynor's
excellent talk "Keys to SRE":
[https://www.usenix.org/conference/srecon14/technical-
session...](https://www.usenix.org/conference/srecon14/technical-
sessions/presentation/keys-sre)

~~~
retroencabulato
That was an insightful talk, thanks.

------
ecaron
Here's the link to it as a PDF for anyone else wanting a printable copy to pin
to their wall:
https:/docs.google.com/document/export?format=pdf&id=199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q

------
AloisReitbauer
Good article. Alerting system unfortunately are still at the same level they
where decades ago. Today we work in highly distributed environments that scale
dynamically and we finding symptoms is a key problem. That is why a lot of
people alert on causes or anomalies. In reality they should just detect them
and log them for further dependency analysis once a real problem is found. We
for example differentiate between three levels of alerts: infrastructure only,
application services and users. Our approach to have NO alerts at all but
monitor a ton of potential anomalies. Once these anomalies have user impact we
report back problem dependencies.

If you are interested you can also get my point of view from my Velocity talk
on Monitoring without alerts.
[https://www.youtube.com/watch?v=Gqqb8zEU66s](https://www.youtube.com/watch?v=Gqqb8zEU66s).
If you are interested also check out www.ruxit.com and let me know what you
think of our approach.

------
icco
This is huge. One of the big things dev teams benefit from bringing an SRE
team onto their project is learning things like this and how to run a
sustainable oncall rotation.

------
dhpe
My startup [http://usetrace.com](http://usetrace.com) is a web monitoring
(+regression testing) tool with the "monitor for your users" philosophy
mentioned in Rob's article. Monitoring is done on the application/feature
level -> alerts are always about a feature visible to the users.

------
omouse
This was very informative, I like the idea of monitoring symptoms that are
user-facing rather than causes which are devops/sysadmin/dev-facing. I'm just
thankful that my next project doesn't require pager duty.

------
annnnd
Can't access the site, seems like there's some quota on docs.google.com...
Does anyone have a cached version? (WebArchive can't crawl it due to
robots.txt)

------
0xdeadbeefbabe
So I guess the author uses a smart phone as a pager, but given his passion for
uptime, reliability, latency etc. I wonder if he has experimented with an
actual pager.

~~~
apposite
We use a variety of escalation techniques. As mentioned down thread, pagers
are actually very unreliable. Some SREs carry pagers and mobiles. Most SREs
carry phones with escalation via SMS, an actual telephone call from an
automated system (after a delay), and/or escalation via a network connection
over the data network. Phone calls are way way more reliable than pagers.

Unacknowledged pages escalate to a secondary oncaller (e.g. if the oncall is
out of range/in a tunnel, under a bus) and tertiary depending on configuration
(and then it loops, or falls to another rotation, again depending on
configuration). The code and services that do escalations is deliberately and
carefully vetted to have minimal overlap with production systems (who's
failure they might be alerting people to).

~~~
shackattack
This sort of mirrors how we do things at PagerDuty. Phone may be more reliable
than SMS, but every telephony/messaging gateway fails sometimes. We use
something like a dozen different phone/SMS gateways to prevent single points
of failure, and do end-to-end testing of our SMS providers to check their
uptime and latency.

Responders can customize their notification methods (push, SMS, phone, and
email) and rules, so you can do things like get a lightweight push
notification when an alert happens, and then a phone call 2 minutes later if
you haven't acknowledged the incident. Teams get escalation timeouts that
forward alerts up the chain if the primary hasn't responded after a period of
time.

------
sabmd
Any alert should be for a good cause sounds good according to me.

------
lalc
I just want to say: HN is bursting with great articles today.

------
wanted_
Great article @robewaschuk :)

\-- Marcin, former Google SRE

------
zubairismail
In todays world, 90% of bloggers rely on google for their living

------
djclegit
Very cool

