
It's about what broke, not who broke it - rodrodrod
https://rachelbythebay.com/w/2018/03/27/whowhat/
======
_n_b_
I work in the nuclear industry, where most places are pretty good about
maintaining a "blame-free" culture. You focus on what processes and procedures
failed, what controls were missing, etc., that allowed somebody to make a
mistake.

As this attitude was adopted, things shifted too far (at least in the opinion
of industry groups, and my observation) to the point where people
underperforming to the point of negligence weren't blamed, and the corrective
actions to prevent reoccurrences of problems they caused ended up being
cumbersome and expensive without really improving safety. (And in this
industry, everything relates back to safety.)

In recent years, things have shifted back towards a more pragmatic middle
ground. There are tools to assess if a problem was organizational (and it
still almost always is) or if there was some element of personal negligence
involved. This follows with an industry wide trend of trying to fix the real
problems that affect safety and operations, not over-engineer cumbersome
corrective actions.

~~~
lostcolony
Well, that's just it, the fact that underperformers aren't recognized is also
a symptom of the process, and the fix is fixing the process, not throwing away
the process.

Every problem is organizational, even those caused by individuals, because
it's the organization's job to recognize and remove those individuals where
appropriate.

~~~
lev99
In a bad organization many people that would perform to satisfactory levels
begin to underperform. One signal of this is when individual responsibility is
removed from the equation.

I enjoy organizations where individual's expectations are clearly defined, and
I prefer if there are consequences for missing those expectations because I
feel like it increases the reliability of the team.

------
nevatiaritika
My manager at work especially has the reverse attitude where the person who
broke it is more significant than what broke/how we fixed it/ how to avoid it
in the future. I have seen people get taunted for a bug they caused two years
ago, a bug which didn't affect any revenue or was pretty easy to fix. And of
course it still gets pointed out during appraisals.

Its a nightmare, because there's no room for experiment left anymore. Everyone
just sticks to the template, afraid to do more than required, never deleting
unused code etc. An attitude like this never ever helps!

~~~
smallbigfish
We don't touch production. We don't upgrade. We are are a X million company we
can't afford the risks.

These are some of the excuses they put up.

And then they sit 10 years or more with that bad stuff in there, build even
uglier ways around it.

But the time comes to actually do something about. And what was once a one day
job becomes "we will hire a consultancy firm to guide us".

~~~
bigiain
Ha. "Outsourcing of blame"!

Outsourcing of blame - as a Service. Where's my VC???

~~~
leoc
Accenture. You've invented Accenture.

~~~
StavrosK
What do they do, exactly?

~~~
shoo
[http://exposingevilempire.com/accenture/bigtime-
consulting/](http://exposingevilempire.com/accenture/bigtime-consulting/)

[http://exposingevilempire.com/accenture/bigtime-
consulting2/](http://exposingevilempire.com/accenture/bigtime-consulting2/)

[http://exposingevilempire.com/accenture/bigtime-
consulting3/](http://exposingevilempire.com/accenture/bigtime-consulting3/)

------
csours
I took down an assembly plant by clicking on a Network status icon from a
particular hardware supplier.

Over the weekend, firmware patches were applied, and the server rebooted.
After reboot, everything worked fine, so the tech marked the change successful
and went home.

Well, apparently the NICs would work just fine, but not all settings were
applied until you opened the UI provided by the vendor. When you opened the
UI, the final settings would be applied, and the NICs would reboot, just long
enough to kill TCP connections.

That loss of TCP connection killed the parent system, and then all the other
children systems also died when the parent died.

So who would you even blame there? The guy who set the tripwire? The guy who
tripped on the tripwire? The guy who designed a system that could be brought
down by a momentary loss of connection?

I'm lucky that my boss wasn't the type to point fingers, because I was the guy
who was there when it happened, and it sure got a lot of attention.

~~~
dozzie
> [...] not all settings were applied until you opened the UI provided by the
> vendor. [...] the NICs would reboot, just long enough to kill TCP
> connections.

The UI part suggests that it was Windows, and if it was, it's not quite the
case that "just long enough" to kill TCP connections, as you need quite a lot
of downtime to terminate a typical TCP session.

In Windows, if a NIC goes down, all the TCP connections that use the NIC get
closed immediately. (Or at least this was the case a few years ago. I had a
similar system with similar drawbacks deployed back then, though it was an
automated warehouse, not an assembly plant.)

> So who would you even blame there?

The idiots who designed the system to run on non-industrial-grade operating
system. Windows was never a good choice to control industrial installations.

~~~
dfox
Windows is often the only vendor-supported choice for interfacing your
computer applications to PLCs and such things. Also most of the proprietary
protocols run over industrial ethernet are some kind of legacy serial (232,
485..) bytestream format wrapped in TCP and the software usually does not
handle loss of the TCP connection particularly gracefully. (on multiple
occasions I've seen rules like "reboot the whole installation on every shift
change" to "handle" the obvious reliability issues of such systems)

It is not about some small and well defined set of "idiots", it is essentially
industry-wide design mistake.

~~~
dozzie
> Windows is often the only vendor-supported choice for interfacing your
> computer applications to PLCs and such things.

Which is not a problem by itself, since PLC, being an industrial equipment,
should operate independently from a non-industrial equipment. The problem is
idiots who think a desktop PC can reliably control PLC in real time.

~~~
dfox
Problem is when you have some kind of process that is inherently controlled
not by the logic in PLC, but by some external system (either because the
required data will not fit into PLC's data memory or because they constantly
change based on some external bussines processes)

Reasonable architecture for this kind of problem would be attaching some
server to the PLC as peripheral, but it tends to be done other way around. As
for reasons for that I speculate that it is simply inertia of the typical PLC
programmer which is then compounded by reasoning along the lines of nobody
does that, so it is not tested and we will hit unknown bugs in the PLC
firmware itself.

------
aytekin
We have put a rule that made our system very strong over the years: We don’t
care if you broke the site, just fix it quickly and more importantly write a
test that will catch the same problem if it happens again.

Every time someone breaks something, we get harder to break.

~~~
taneq
Sounds like your system is antifragile.

~~~
gowld
Robust. The word is "robust". We don't need to promote buzzwords.

~~~
taneq
I think there's a worthwhile distinction between 'robust', meaning 'able to
resist stresses', and 'antifragile', meaning 'able to react to stresses and
become stronger.'

------
CoolGuySteve
I used to think this way until I started working with someone who was _nearly
always_ the one who broke it. At some point we just had to face the fact that
his work was unreliable even after significant mentoring.

If the tasks were difficult that would be one thing, but I'm talking about
stuff like committing code to prod that was clearly never even executed once.

~~~
altano
Sounds like you have a code review and automated testing problem and not a bad
coworker problem.

~~~
lawn
Sounds like both.

------
ComputerGuru
I do a lot of open source work and unfortunately a very common posion is
focusing on “who broke it,” which is especially disparaging when done in
public. A particularly nasty habit is when outsider Alice opens an GitHub
issue saying “xxxx is broken” and developer Bob replies with “yup, @Charlie’s
commit fubar’d everything.”

Unfortunately both very demoralizing and very common.

~~~
silveroriole
Demoralizing - why? That seems like an attitude problem on the part of
“Charlie”, not “Bob”. If “Charlie” is going to slink off with his tail between
his legs every time he makes a mistake, he’ll have a tough time of it - it’s
not like everyone can’t SEE that he broke it through version control anyway!

I just don’t really get it. Even when I was a junior, if I overheard “this
thing is broken,” I was the first to pop up and say “oh, I bet that was me,
let me have a look.”

~~~
ComputerGuru
I’m with you 100% except you’re not taking into account what I said about this
being in public and Alice not being a part of the project. Internally
assigning blame isn’t the issue, it’s about the “team” facade being shattered
when dealing with the outside. If you’ve accepted Charlie into the
organization then from without it isn’t about Charlie or Bob, the answer
should be “yes, we’re aware; a recent commit broke that functionality and
we’re working on fixing it.” I’m not even talking about a dev mailing list or
GitHub PR discussion, I’m taking about the specific case of badmouthing a
developer to an enduser.

Imagine if Apple came out and said “yeah, that blank root password bug, it was
all because of John Smith and his crap patch that caused this.”

Outsiders don’t have the same perspective as insiders. If Charlie’s commit
message read “implementing the really difficult thing we talked about,” the
team might be aware of mitigating factors that Alice won’t. But even without
those mitigating factors, all you’ve done is badmouth your own devs to the
public. Additionally, you are not considering whether Charlie is an otherwise
stellar developer that has never had a bad patch before. Alice may incorrectly
presume that the only reason he’s being called out is because this is a habit
of his, perhaps.

~~~
lev99
I often compare open source projects based on what's visible on the github
page.

Drama around a volunteer team in the open is a bad smell.

Edit: I'd like to explain why.

* Open source projects with lots of drama often don't attract new talented developers, and if talent happens to depend on that codebase they are more likely to fork and start a new community, or fork and not submit pull requests.

* If I need to interact with the team for pull requests or bug/support tickets I'd like to feel assured we can do so respectfully and professionally.

* If a community has drama in it I am less likely to recommend the software to a friend or blog about the software because I won't want to be associated with it. I'm more likely to stop using it and switch to a different solution.

------
userbinator
_I had to then tell them that this person still worked there._

The old IBM story is worth mentioning in relation to this:
[http://www.mbiconcepts.com/watson-sr-and-thoughtful-
mistakes...](http://www.mbiconcepts.com/watson-sr-and-thoughtful-
mistakes.html)

------
kosei
When someone makes a mistake, that's an incredible investment in them. I'm
always surprised* when people try to throw it away by firing them or making
them want to quit. Help them learn from it and apply that knowledge moving
forward. Otherwise they're just taking that knowledge and using it to help
another company.

*Obviously with the caveat that some people are repeat offenders who are careless or just not good employees

~~~
lev99
In other professions some mistakes cost the professional real money (doctor
malpractice) or cause them to lose their license (drinking and driving with a
commercial vehicle license).

As an industry we don't have a response to a truly neglectful mistake yet.

------
ashleyn
Reminds me of when someone ran "rm -rf /" at Pixar and deleted all of Toy
Story 2.

The backups were crap and the only reason it survived was because someone took
a server to work from home.

When all was said and done, they never really found who did it, they just made
organisational changes to ensure it didn't happen again. No blame game.

~~~
andrewmcwatters
When I worked with my first non-remote team in Phoenix, I basically did this
to our mobile app codebase with an in-house git repository due to some faulty
rsync changes to a grunt task.

To the old NPL team, sorry about that. Culture is important.

------
partycoder
If in soccer the opposing team scores, who is to blame? the goalkeeper,
defenses? the coach? the whole team? the referee? nobody?

Preventing goals means that the strategy needs to ensure good ball possession,
and staying on the offense, to reduce the burden on the defense, to reduce the
burden on the goalkeeper, who is the last line of defense.

If the last line of defense fails that's not an individual failure but a team
failure, coach included, since the coach selects who gets to play, when and
their roles.

Same in software: bad management passes the burden to developers, bad
development passes the burden to testers, bad testing passes the burden to
release management.

~~~
partycoder
Now, there are cases when everyone knows what to do, steps are taken so
everyone is informed of it, but someone still decides to go against it. In
that case the individual is at fault.

------
zer00eyz
It's not about whats broken, its about what you DO when it is broken.

This my favorite interview question to ask candidates:

"What is your all time biggest screw up, and how did you come back from it" \-
I then tell them the story of me loosing several hundred thousand dollars and
the funny things that happened around it to set the tone. If you have been in
tech for any length of time you have one of these stories (if not a few). I
have heard some great ones by simply asking and it gives great insight into a
candidate (humor, stress response, the things you have seen).

------
dancek
I think this is an important piece of organization culture. If the first
reaction to problems is blame and punishment, issues are covered up. But if
finding bugs and fixing them is considered valuable, there will be less issues
in the long run.

Of course I write enough stupid bugs myself that I'm bound to think this way.

~~~
tzhenghao
This is so true. Providing the incentive to squash bugs than punishing people
for making them is the driving force for innovation in a team. Take that away,
and you get a toxic culture where everybody starts finger pointing when an
issue arises.

------
PeterStuer
I found this to be the touchstone of spotting a dysfunctional enterprise.
There it is all about the 'who', never about the fix. In those environments
every new project is CYA from day 1. The disconnect between daily activities
and the success of the company is so large, that all actions and projects are
just about personal politics. A failure that can be blamed on the right target
is often even a preferred outcome as eliminating a competitor for a promotion
is even better than not having failed. If you find yourself in such an
environment, try to leave asap.

------
silveroriole
Sure, if you have a huge company and a revolving door, the solution is a bunch
of processes and idiot-proof safety nets, and no one person is to blame for
most bugs. If you’re in a small company, the solution is to teach the devs by
showing them what mistakes they made. I don’t think that’s a bad thing; if you
write code, that code is your responsibility, and you shouldn’t be sensitive
about people telling you your code is broken.

Also, focusing on the code itself, for me at least, easily leads to thoughts
like “this function is crap! What idiot wrote this!?”. Finding out who broke
it leads to thoughts like “I see John introduced this buggy function. I should
go check with him, maybe he had a good reason.”

------
gjvc
Mishaps occur on a spectrum, and may be categorised from mistakes,
carelessness, recklessness, through to malicious intent, and any combination
of the above all along said spectrum.

Though these categories may seem like they are orientated on individuals'
actions, they may be used to determine where the risk lies in systems (and
people's use thereof) and how measures can be taken to avoid the same problems
being repeated.

Much of the time, the complexity of systems (using the term in the widest
possible sense) is under-estimated, and automated integrity checks are not
used as religiously as they may be.

------
red_admiral
I'm 90% in agreement. Her workplace definitely sounds like somewhere I'd
consider working myself (if I were looking for a job).

There are some things that I consider basic competence standards, like not
storing passwords in plain text in any system you're building. I wouldn't fire
an intern for getting that wrong but I also wouldn't let an intern near a
production authentication system without some oversight.

If someone is a security engineer with a responsibility to know these kinds of
things as part of their job role and certification, then if they'd implemented
passwords-in-clear to cut corners somewhere, even if it's to meet a really
important deadline, I'd be extremely unhappy. Of course I'd establish the
general pattern of what had gone wrong first, and if it was a superior being
abusive to the security engineer to get the product launched on time I'd still
be really unhappy but not at the engineer.

Occasionally one does follow the chain of causes back though and finds not the
organisation's culture but an individual who really should have known better.

~~~
rachelbythebay
If you can go back in time, join me in 2013 and you can enjoy the ride for a
few years, too. I'm sorry to say that I don't think you'll get the same
experience in 2018.

------
jancsika
The answer requires context, at least for FLOSS projects.

If unlucky dev #13 broke something because humans can no longer reason about
the relevant part of the system, then it doesn't matter that #13 was the one
who broke something. What really matters is that people get busy removing the
sandtraps from their software.

However, many FLOSS projects run on the sheer joy and freedom that comes with
maintaining a particular subsystem or area of the code. Most devs have a quick
understanding of the responsibilities associated with that. But in cases where
that responsibility doesn't come naturally, _who_ broke becomes the focus.
Addressing that issue will determine whether or not future breakages occur.

------
koliber
It isn't about who broke it. But if there is a person on the team who
continually breaks things, does not learn from their mistakes and repeats
them, or is not truthful when they break things, the team should react
appropriately.

------
hennsen
It’s also about how it broke. And who broke it is sometimes the person who can
say a lot if not most about that. Therefore i don’t recommend teaching to
never talk about tge person who took an action that lead to a disaster, but
rather encouraging a culture where admitting having taken a wrong step doesn’t
lead to punishment, neither financial or social. Who broke it is an important
part of the analysis, helping the organization to learn from each other’s
errors. Making it a taboo talking about it is missing a chance for
development...

------
pronoiac
Ooh, this is good. Part of it's covered under the name of "blameless post-
mortems," but I don't remember searching for similar breakage, which is a
great idea.

------
iramiller
This seems like a classic case of applying the Five Whys
[[https://en.m.wikipedia.org/wiki/5_Whys](https://en.m.wikipedia.org/wiki/5_Whys)]
methodology for root cause analysis.

------
drdeadringer
I don't see how this is not "better mousetrap, better mouse". Phrases from
"they build a better fool" to "they build a better US Navy crewman" are a
hundred a penny, and yes I've experienced the other side of this.

The best programmer vs the worst user, and every mix in between, shall produce
situations needing attention this article addresses.

I've been in this situation on both sides. "Of course it should be clear what
this phrase means, how could they fuck this up?" ... and ... "I have on idea
what this means, both choices could mean what I want but either choice ends me
up on the wrong page of this bullshit 'choose my own adventure' that I'll have
to repeat if I'm wrong".

I'm interested in finding out if I'm understanding this wrong, and//or other
thoughts.

------
gowld
The SRE Book teaches a lot of the lessons that this blog teaches.
[https://landing.google.com/sre/book.html](https://landing.google.com/sre/book.html)

------
donttrack
I totally agree. Its usually the hallmark of a good team, if they have the "we
are in this together" attitude.

------
lkrubner
There is the risk of conflating two separate types of problem. There are
problems that arise from the complexity of the code, and problems that arise
from particular people.

If a programmer has a habit of sloppy code, or violates the team's standards
in some ways, then a good leader will keep track of the fact that one person
is responsible for a recurring pattern of mistakes.

I absolutely agree with Rachel By The Bay, that many bugs arise from the
complexity of the situation, and it would be wrong to blame the person who
just happens to trip over that bug. But a good leader should take action
against anyone who repeatedly screws up, and who seems unwilling to improve.

I've written about this before. This is from "How To Destroy A Tech Startup In
Three Easy Steps":

\----------------------

Wednesday, July 15th, 2015

I got to work at 11:00 a.m. John announced that our demo had stopped working.
Sipping my coffee, I logged into the server to find out what the problem was.
I looked at the error log for the API app, but it seemed okay. Then I checked
the error log for the NLP app.

java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at
java.lang.String.substring(String.java:1955) at
Celolot.nlp.Extractor.fuckBitchesGetMoney.java:87

What the hell was this?

“FuckBitchesGetMoney”?

What kind of name is that for a function?

A computer programmer can name their functions anything, but there are some
“best practices” regarding names, and this particular function name violated
all of them.

I asked Sital why he had given this name to his function. He looked at me
straight, shrugged, and stated that the name was from the 1995 song by The
Notorious B.I.G., “Get Money.” I replied that rap lyrics were not part of our
naming conventions. He promised that he would change it.

Coming from anyone else, I might have interpreted the function name as an act
of angry rebellion, but Sital was too forthright for that. Apparently, he
thought the name was funny and went with it because he wanted to add some
humor to his code. Never did he stop to think it might be unprofessional.

I looked through his code and found several other functions that had
inappropriate names. I sent him a list and asked him to change their names to
something standard.

A week later the function was still there. FuckBitchesGetMoney. Yet I don’t
think that any of this was a deliberate act of rebellion. He was just oddly
forgetful and disorganized.

[https://www.amazon.com/Destroy-Tech-Startup-Easy-
Steps/dp/09...](https://www.amazon.com/Destroy-Tech-Startup-Easy-
Steps/dp/0998997617/)

~~~
itronitron
if the function was still there, I think it is also likely that the old jar or
class file (with the function) was still lurking in the classpath or your
version control and build system weren't using his revision

~~~
lkrubner
The point is, he failed to make any revisions. He was oddly disorganized. Even
with quite a bit of coaching, he was unable to do what we needed.

------
teddyh
What’s that old saying; “ _Fix the problem, not the blame_ ”?

------
nstj
I like this site and hadn't really read much from it - it's interesting how
much it's been front paged over the last couple of weeks:
[https://news.ycombinator.com/from?site=rachelbythebay.com](https://news.ycombinator.com/from?site=rachelbythebay.com)

~~~
krallja
Rachel is an excellent writer who was on a long break from writing. Seems like
HN is happy to read her posts again.

~~~
rachelbythebay
Thanks! I was working a "real job" from about mid 2013 and am no longer, so my
cycles are now all mine again. I was too tired to write most of the time
before.

Also, there are many more stories to be told now!

~~~
pnathan
I am really looking forward to reading the new stories - I bought your
collection of stories too. :)

------
BrissyCoder
I don't know. Where I work no discernible pattern can be found with the "what"
that broke.

It's always the same f __*ing people that break it though!

~~~
pbhjpbhj
It amuses me that the sibling comments appear unable to imagine the
possibility that someone is incompetent.

Of course there are other possibilities - the people breaking things are doing
the hard bits that no one else dare to.

~~~
jspash
But wouldn't that imply that the "daring, thing-breaking" people are actually
incompetent to some degree? Otherwise they would mitigate the risk before
performing any dangerous operations on a live system.

"Bravado is no excuse for lack of preparation." \- Leeroy Jenkins

------
erikb
It makes sense for a logical perspective, but in practice that's not how it
works.

In reality if something breaks, and you are stupid enough to mention it, then
(a) you are considered an a-hole for blaming <responsible-person-for-topic>
even if you didn't and (b) responsible for fixing it.

So your main job is somehow make your stuff work despite all the other stuff
that doesn't work and all the other people that try to stop you, silently. The
less you criticize the better. What you get in return is that if you fuck up,
people will try to avoid blaming you as well. Also if you don't succeed at
making anything happen you get a little arrogant smile from your manager and a
mediocre feedback round. But otherwise nothing happens.

The only change to that pattern happens when you piss off your manager or your
manager's manager. Then suddenly each and everyt activity you do will be
scrutinized and if there's a problem it will be used against you. The best
hope they have is that you go away by yourself.

~~~
al2o3cr
"The best hope they have is that you go away by yourself."

I'd recommend you satisfy their hope maximally by running the hell away from
that dumpster fire of bullshit office politics.

