Yes this is also how it's done at other large orgs. But one key to a quick respo...

matwood · on June 7, 2019

> fingers are never publicly/embarrassingly pointed nor are people blamed

The other problem is that it is almost never a single person or teams fault. The reality is that it is everyones fault, and as soon as people accept that they can prevent it in the future.

Lets take a contrived case where I introduce a bug that floods the network with packets and takes down the network. Is it my fault? Sure. But what about pre-deployment testing? What about monitoring - were there no alarms setup to detect high network load? What about automatic circuit breakers that should have taken the machine offline, and instead let a single machine take down the whole system?

The point is that blaming the person who introduced a code bug is lazy, and does nothing to prevent the issue in the future. When a failure like what happened at Google occurs it is an organizational failure, not a single person or team. That is why blaming people is generally not productive.

ethbro · on June 7, 2019

I've only been tangentially pulled into high severity incidents, but the thing that most impressed me was the quiet.

As mentioned in this thread, it's a lot like listening to air traffic comm chatter.

People say what they know, and only what they know, and clearly identify anything they're unsure about. Informative and clear communication matters more than brilliance.

Most of the traffic is async task identification, dispatch, and then reporting in.

And if anyone is screaming or gets emotional, they should not be in that room.

tetha · on June 7, 2019

Someone at our place recently commented that the ops team during an incident strongly feels like NASA mission control in critical moments[1]. I wanted to protest, but that's surprisingly accurate.

> And if anyone is screaming or gets emotional, they should not be in that room.

If someone starts yelling around in my incident war room for no reason, they get thrown out. I'm a calm and quiet person, but bugging around during a major incident is one of the few things that make me mad.

1: https://youtu.be/Y0yOTanzx-s?t=3059

di4na · on June 7, 2019

It is not surprising at all. Mission Control was forged in the fire (literally for Apollo 1) and they are one of the most visible "incident team" we know about.

I highly advise to read Gene Kranz memoirs "Failure is not an Option" if you work in that kind of environment.

ethbro · on June 7, 2019

I heard recently that he never said that.

Apparently it was mentioned when the Apollo 13 script writers were gathering stories at NASA, they liked it, and then gave it to the Kranz character.

Who then decided, "Hey, if everyone thinks I said it..." and titled his memoir.

di4na · on June 7, 2019

Yep exactly.

Moru · on June 7, 2019

When the insident is over, does it look like 55:50 in that video? :-)

tetha · on June 7, 2019

We recently had a 15 month long project almost fail due to some stupid shit and a really wonky error no one understood so far. <Almost fail> as in "Keep the customer on the phone we have 3 possible things to try and don't hang up! No one leaves that call until I'm out of hacks to deploy!" That evening we had the entire ops team in the houston-mode for several hours.

And yes, once we had a workaround in place the customer accepted, we reacted like that. Except we also had our critical-incident whiskey go around. Then the CEO walked in to congratulate us on that project. Whoops. But he's a good sport, so good times. :)

throwaway_ac · on June 7, 2019

I have mixed feelings about the finger pointing/public embarrassment thing. Usually the SRE is matured enough cause they have to be, however the individual teams might not be the same when it comes to reacting/handling the Incident report/postmortem.

On a slightly different note, "low-level team to have at least one engineer on call at any given time" - this line itself is so true and at the same time it has so many things wrong. Not sure what the best way to put the modern day slavery into words given that I have yet not seen any large org giving day off's for the low-level team engineer just cause they were on call.

KirinDave · on June 7, 2019

Having recently joined an SRE team at Google with a very large oncall component, fwiw I think the policies around oncall are fair and well-thought-out.

There is an understanding of how it impacts your time, your energy and your life that is impressive? To be honest, I feel bad for being so macho about oncall at the org I ran and just having the leads take it all upon ourselves.

wikibob · on June 7, 2019

What are the policies exactly? I’ve heard it’s equal time off for every night you are on call?

masto · on June 7, 2019

The SRE book (https://landing.google.com/sre/sre-book/chapters/being-on-ca...) says that engineers are compensated for being on call in the form of cash or time off.

Personally I think this is a fair system, and I would hardly call it slavery.

(disclaimer: am Google SRE)

virgilp · on June 7, 2019

It was paid or time off where I worked before. It's just being established where I work now, but what's discussed is 2x regular pay for working outside your work hours due to an incident. Doesn't seem "slavery" to me.

Moru · on June 7, 2019

In the places I have been working at (lots of different types of jobs), overtime used to be 2x pay or 2x off. None of them were IT-related though.

dbcurtis · on June 7, 2019

At one Large Org where I worked, the Pager Bearer was paid 25% time for all the time they were on the pager, and standard overtime rates (including weekend/holiday multipliers) from the time the pager went off until they cleared the problem and walked out the plant door, or logged out if the problem was diagnosed/fixed remotely.

25% time for carrying the pager was to compensate for: 1) Requirement to be able to get to the plant in 30 minutes. Fresh snow? Too bad, no skiing for you this weekend. 2) You must be sober and work-ready when the pager goes off. At a party? Great, but I hope you like cranberry juice.

As the customer who signed the time cards for the pager duty, I thought that was not only fair, but it also drove home to me as a manager that the cost was real and was coming out of my budget, not some general IT budget that someone else took the hit for. This is one case where "You want coverage for your service? Give me a charge code for the overtime." was not just senseless bureaucratic friction, it led to healthier, business-driven, decisions.