
Incident management at Google – adventures in SRE-land - kungfudoi
https://cloudplatform.googleblog.com/2017/02/Incident-management-at-Google-adventures-in-SRE-land.html
======
no_wizard
I find this bit to be particularly insightful:

 _" Can I handle this? What if I can’t?" But then I started to work the
problem in front of me, like I was trained to, and I remembered that I don’t
need to know everything — there are other people I can call on, and they will
answer. I may be on point, but I’m not alone_

It might be because I'm currently training people in this realm, and this is
one of their biggest fears, or maybe because it was my biggest fear, but its
so true. We're a team. We're here to help. At least if your SRE org is any
good. Never be afraid to ask for help, and never be afraid to admit you don't
know something or it might be outside of your comfort zone.

I'll take willing to learn and readily able to admit knowledge deficits over
someone who doesn't any day of the week. Great book they're working on, great
article on this. So many gems, but this one stuck out for me, and its pretty
relevant to me right now.

~~~
AdmiralAsshat
I notice this alot with the team I work with. Our team's tickets are not
assigned automatically: there's simply a queue and people are encouraged to
grab what they can. Unfortunately, what I've often seen is that people see
something in the description that they're not familiar with and refuse to
touch the ticket because they don't want to ask for help, which means that it
languishes in the queues. The end result is that the same person ends up
taking the same kind of ticket over and over because they're the only one who
has any familiarity with the program in question.

~~~
no_wizard
I've seen this be the case in other places. We used to have a queue based
system. Then we went to an auto-assignment process. One of things it has done
is open up communication on our team, since when we did it, we implemented as
a rule of policy that if you aren't familiar with something getting assigned
to you, you first ask for help before trading (we have a formal mechanism for
soliciting this).

1\. It encourages everyone to be aware that we're here to help

2\. It encourages learning, the number of request for help decrease over time
once the experience and familiarity ramp up

3\. It exposes everyone to different types of work.

There are exceptions of course. Our P0 bucket has a dedicated set of people
that handle that, and they are hand picked because those are house on fire
situations that need the experience. Its also the one we put the juniors on
when they know the ropes and are ready to take on the critical tasks so they
can advance themselves (its good experience for career and personal growth I
feel).

The other thing i like about this is that as a manager, I can actively
encourage behavior by claiming a ticket, or helping someone with a ticket. I
really want that culture effect to happen from the top down.

------
mikecb
The coolest thing I took away from the SRE book was this progression of system
operations from manual, to scriptable, to automated, to a fourth category I
hadn't even known existed: autonomous. The idea that you can keep moving up
this hierarchy of exception management beyond even chef and puppet, and
systems will be able to heal themselves, is a pretty cool one.

As a manager, this made the concept of 20% time a lot more clear. These are
people with the knowledge and incentive to build a hierarchy of systems that
progressively remove risk from their work. This is in fact their primary
business objective. And we need to make sure they have time to do that, vs
working them to death with manual remediation. It's a great lesson.

Incidentally, Stackdriver contains a simple alerting and incident management
tool that's really nice to use. Hopefully it gets more robust as time goes on
and larger and more complex orgs move to their cloud. Edit: not Outalator.

~~~
asuffield
(I'm a Google SRE. My opinions are my own.)

That's not what our 20% time is for, and 20% is way too small a number for
that purpose. "20% time" (the way we use the term) is for personal/career
growth/scratching itches.

Time spent on building systems that make our service better is my primary job.
Manual remediation ("toil") is something to be tracked as a dangerous
antipattern that must not be allowed to take over.

Toil and oncall response should be less than 20% of my time, together. At
least half my time should go into engineering projects. If the level of toil
is in excess of 50% of team activity then I would expect only percussive
intervention to get the team out of this situation.

~~~
mikecb
Great comment, thanks for the clarification. Wasn't trying to say that 20% is
a magic number, just that it cemented the idea for me that engineering time,
and self-directed engineering time, is incredibly valuable for everyone that
can be justified and should be zealously protected.

------
WestCoastJustin
FYI - it's linked to in the post, but in case it's not obvious, they have
posted the SRE book for free at
[https://landing.google.com/sre/book.html](https://landing.google.com/sre/book.html)

I'd highly recommend it if you're in the Ops feild. Probably the best book out
there on current large scale Ops practices.

~~~
ben_jones
It's a great book, very well written and fair. But the first time I read it I
suffered a certain amount of zealotry: "Google is amazing! I should rewrite
everything to be more like them!". Really a subcaption everyone should keep in
mind is that the book defines how Google built systems for Google. YMMV.

~~~
Tushon
Currently reading the book, and could see the zealotry come out, but they
mention, repeatedly, that your own systems may not need the level of service
that Google built-in, or your teams may not be big enough to justify, etc. I
don't think it is quite fair to indicate that they didn't give that thought. I
totally agree that it is a great book and should be encouraged throughout ops
orgs, devops roles, and developers who want to plan for the future, but that
not every company needs the Google way to 100%, or maybe even 50% (as you
indicated as well).

------
twosheep
Maybe it's just me but I found the constant in-line plugs for the book to be
distracting -- footnotes would have been better.

Interesting write-up though

~~~
daenney
I had a similar reaction. I was a bit irked by it b/c it felt very pushy
towards the SRE book and broke me out of the flow of the article a few times.
10/10 on the book though, would recommend anyone to read it.

------
vgy7ujm
Is it just me or are we seeing a trend almost before the "new" role SRE has
become mainstream that SRE is turning into support technicians because that is
what is needed at most places that are not Google scale. The devaluation of
the sysadmin took some time, this is happening much faster. What will be the
next title when SRE can't get you a decent salary anymore? And why don't we
see the same with the SWE role? Is it just that business leaders sees Ops as
cost no matter what name it has?

------
saycheese
Anyone able to compare and contrast Google's "Wheel of Misfortune" with
Netflix's "Chaos Monkey" both in terms of the systems that enable them and the
operations that relate to them?

~~~
thesandlord
Chaos Monkey is more like Google's DiRT

[http://queue.acm.org/detail.cfm?id=2371516](http://queue.acm.org/detail.cfm?id=2371516)

~~~
Kostchei
and Dust

~~~
grf
Not a thing anymore for several years, i.e name consolidation. Also, a good
chunk of DiRT is now continuous and automated (not autonomous though).

Disclaimer: I work at Google and ran the DiRT team for a few years incl.
incident management itself.

