I don't think engineers can believe in no-blame analysis if they know it'll harm...

SQueeeeeL · on Dec 7, 2021

Would they? Having 3 outages in a year sounds like an organization problem. Not enough safeguards to prevent very routine human errors. But instead of worrying about that we just assign a guy to take the fall

dolni · on Dec 7, 2021

If you work in a technical role and you _don't_ have the ability to break something, you're unlikely to be contributing in a significant way. Likely that would make you a junior developer whose every line of code is heavily scrutinized.

Engineers should be experts and you should be able to trust them to make reasonable choices about the management of their projects.

That doesn't mean there can't be some checks in place, and it doesn't mean that all engineers should be perfect.

But you also have to acknowledge that adding all of those safeties has a cost. You can be a competent person who requires fewer safeties or less competent with more safeties.

Which one provides more value to an organization?

fragmede · on Dec 8, 2021

The tactical point is to remove sharp edges, eg there's a tool that optionally take a region argument.

    network_cli remove_routes [--region us-east-1]

Blaming the operator that they should have known that running

    network_cli remove_routes

will take down all regions because the region wasn't specified is the kind of thing as to what's being called out here.

All of the tools need to not default to breaking the world. That is the first and foremost thing being pushed. If an engineer is remotely afraid to come forwards (beyond self-shame/judgement) after an incident, and say "hey, I accidentally did this thing", then the situation will never get any better.

That doesn't mean that engineers don't have the ability to break things, but it means it's harder (and very intentionally so) for a stressed out human operator to do the wrong thing by accident. Accidents happen. Do you just plan on never getting into a car accident, or do you wear a seat belt?

pm90 · on Dec 7, 2021

> Which one provides more value to an organization?

Neither, they both provide the same value in the long term.

Senior engineers cannot execute on everything they commit to without having a team of engineers they work with. If nobody trains junior engineers, the discipline would go extinct.

Senior engineers provide value by building guardrails to enable junior engineers to provide value by delivering with more confidence.

JackFr · on Dec 7, 2021

Well if John caused 3 outages and and his peers Sally and Mike each caused 0, it's worth taking a deeper look. There's a real possibility he's getting screwed by a messed up org, also he could be doing slapdash work or he seriously might not undertsand the seriousness of an outage.

crmd · on Dec 7, 2021

John’s team might also be taking more calculated risks and running circles around Sally and Mike’s teams with respect to innovation and execution. If your organization categorically punishes failures/outages, you end up with timid managers that are only playing defense, probably the opposite of what the leadership team wants.

jjav · on Dec 7, 2021

Worth a look, certainly. Also very possible that this John is upfront about honest postmortems and like a good leader takes the blame, whereas Sally and Mike are out all day playing politics looking for how to shift blame so nothing has their name attached. Most larger companies that's how it goes.

Kliment · on Dec 7, 2021

Or John's work is in frontline production use and Sally's and Mike's is not, so there's different exposure.

jaywalk · on Dec 7, 2021

You're not wrong, but it's possible that the organization is small enough that it's just not feasible to have enough safeguards that would prevent the outages John caused. And in that case, it's probably best that John not be promoted if he can't avoid those errors.

kortex · on Dec 7, 2021

Current co is small. We are putting in the safeguards from Day 1. Well, okay technically like day 120, the first few months were a mad dash to MVP. But now that we have some breathing room, yeah, we put a lot of emphasis on preventing outages, detecting and diagnosing outages promptly, documenting them, doing the whole 5-why's thing, and preventing them in the future. We didn't have to, we could have kept mad dashing and growth hacking. But very fortunately, we have a great culture here (founders have lots of hindsight from past startups).

It's like a seed for crystal growth. Small company is exactly the best time to implement these things, because other employees will try to match the cultural norms and habits.

jaywalk · on Dec 7, 2021

Well, I started at the small company I'm currently at around day 7300, where "source control" consisted of asking the one person who was in charge of all source code for a copy of the files you needed to work on, and then giving the updated files back. He'd write down the "checked out" files on a whiteboard to ensure that two people couldn't work on the same file at the same time.

The fact that I've gotten it to the point of using git with automated build and deployment is a small miracle in itself. Not everybody gets to start from a clean slate.

AnIdiotOnTheNet · on Dec 7, 2021

> I have to convince other leaders that John would do well the next level up.

"Yes, John has made mistakes and he's always copped to them immediately and worked to prevent them from happening again in the future. You know who doesn't make mistakes? People who don't do anything."

nix23 · on Dec 7, 2021

You know why SO-teams, firefighters and military pilots are so successful?

-You don't hide anything

-Errors will be made

-After training/mission everyone talks about the errors (or potential ones) and how to prevent them

-You don't make the same error twice

Being afraid to make errors and learn from them creates a culture of hiding, a culture of denial and especially being afraid to take responsibility.

jacquesm · on Dec 7, 2021

You can even make the same error twice but you better have much better explanation the second time around than you had the first time around because you already knew that what you did was risky and or failure prone.

But usually it isn't the same person making the same mistake, usually it is someone else making the same mistake and nobody thought of updating processes/documentation to the point that the error would have been caught in time. Maybe they'll fix that after the second time ;)

garbagecoder · on Dec 8, 2021

Yes. AAR process in the army was good at this up to the field grade level, but got hairy on G/J level staffs. I preferred being S-6 to G-6 for that reason.

mountainofdeath · on Dec 7, 2021

There is no such thing as "no-blame" analysis. Even in the best organizations with the best effort to avoid it, there is always a subconscious "this person did it". It doesn't help that these incidents serve as convenient places for others to leverage to climb their own career ladder at your expense.