
I've screwed up plenty of things too - Lammy
https://rachelbythebay.com/w/2020/02/10/broken/
======
duxup
One of the fist things my boss told me at my first "real" (good career type
money) job was:

"Everyone screws up, you will too, just be honest about it and tell me and it
will be good."

It was a job working with this memory mapped wonky hardware tied to
mainframes. There was no "undo" and as soon as you wrote to memory there it
was. It was inevitable that you would typo something sometime in an important
system.

Finally 3 years later a buddy is talking to me over the cube wall "Hey was
that #3 you were working on yesterday?" I of course am typing away while
talking and say out loud "3? Um..."

So I type something like -RESET SYSTEM 3-

I meant to type 9 a non critical system. 3 tied to a data replication system
that was absolutely critical that it be running otherwise all transactions
would stop (well for a bit while backups take over).

So if you couldn't use an ATM for a huge bank for a little while decades ago
(fortunately it was the middle of the night), that was me ;)

I went to my boss the next morning and told him what I did, and he says "This
like your first in 3 years, that's a record or something, it's usually a few
within like 6 months. Nice job!"

It was a great place to work, no finger pointing, if you screwed up no big
deal, everyone stuck around working with that team for decades.

When there were conference calls it was rarely stated (if ever) who actually
did the thing. It was just accepted that it happened and we could discuss how
to prevent it and such. "The engineer" or "the support team" and such were
common phrases.

Inevitably folks would ask "who was it" and the answer usually was something
like "it doesn't matter".

~~~
praptak
This approach has been adopted by the SRE movement under the "blameless
postmortem culture" alias: [https://landing.google.com/sre/sre-
book/chapters/postmortem-...](https://landing.google.com/sre/sre-
book/chapters/postmortem-culture/)

~~~
twoquestions
How do they convince people not to use failures as political capital against
other teams? It's all well and good to say "We promise we won't do it!", it's
quite another to _actually_ not call out the other team's failures when
they're competing with you for budget/headcount, or against other individuals
when stack-ranking.

~~~
suchire
You do it by management not making things zero sum, and by
correcting/penalizing people for blaming behavior (e.g. saying a root cause is
that a team/person made a mistake, vs a system not preventing or mitigating
the mistake)

------
ChuckMcM
I find the premise[1] of this post amusing. The sort of comment "You always
criticize others, but what about you?" seems to nearly always signal emotional
damage on the part of the commentator who felt blamed for something. And that
is nearly always a sign of bad management practice.

People don't always get things right, they screw up, they do stupid things for
good reasons, and sometimes good things for stupid reasons. As a manager I
always want folks to be observant and thoughtful, and try to keep such things
not about "who" screwed up but how that screw up came to be (the good or
stupid reasons) and how one might think about the action ahead of time that
would alert you to the potential problem that would result from a given
action.

And the key of all that is making the discussion about how to think so you
don't have the problem in the first place, rather than making it a blame-fest
on some hapless engineer who chose poorly.

I was fortunate to have a manager early in my career who was very proactive at
solving problems and moving forward, not affixing blame. He would say
"Ignorance is the natural state before learning, only if it persists in the
presence of learning opportunities does it become a problem."

I've always tried to learn by what I observe and what I do, which is why I
enjoy Rachel's stories of finding root causes. They teach the principles that
needed to be understood prior to the action. All without experiencing the
feeling of dread that you've just taken production off line :-).

[1] That being that commenters feeling badly that the author doesn't seem to
show their own flaws in the stories.

~~~
twoquestions
That's something I gotta ask on my way to my next job. My company is extremely
conservative in every meaning of the word, and is not unlike patio11's
description of Japanese megacorps in mannerisms.

"What happened the last time production went down?" should produce a quite
illuminating answer. Do they go through a detailed root-cause analysis? Do
they answer with marketing-speak meant for a legally minimal disclosure? Do
they blame "that moron" whom I'm meant to replace?

As it stands here, the official corporate policy is everything happens
perfectly until someone who shouldn't be there messes things up, and the
problem is best solved with a public and angry firing letter. Quoth our
business partner: "We're allowed to change our minds, but you're not allowed
to be in error. Even if we give you bad data, you're expected to infer proper
data and give us proper output. BTW we're not paying for testing"

------
ed_blackburn
Here is a screw-up/near miss share. I was migrating a pensions service to a
cloud vendor. Part of this involved a very large ETL. Practice makes perfect
and we ran the custom process regularly to ensure I'd go smoothly on the big
day. The last time we ran the practice I somehow managed to get my prod creds
mixed up and I started to restore a week old back up over the top of
production. Thankfully the first part of the process is a disk check and I
realised my mistake and cancelled the job before any destructive actions
happened. I was minutes away from destroying the pensions records for two FTSE
100 businesses.

Everyone makes mistakes! It's how we learn :-)

~~~
notimetorelax
Did you do any changes to the setup after this near miss to avoid doing this
again?

~~~
ed_blackburn
Yes. I reconfigured the creds and grants so it wasn't possible to repeat the
mistake. The lesson learned was about isolation and diligence.

------
jsolson
I was once (partially) responsible for the deaths of dozens of virtual
machines at a distance of about three and a half years.

Fun fact: none of these VMs had rebooted in that time, or they wouldn't have
crashed.

Anyway, back in 2014 or so I dropped a bunch of transmit packet completions.
In most cases I also double completed packets which was immediately fatal.
Kernels get mad about that sort of thing.

Turns out, not all of the affected VMs died. Some of them lived on with head
indices forever unequal to tail indices (until they rebooted).

In 2018 a developer realized there was a potential bug in waiting for VMs
entering a quiescent state -- a truly idle networking stack had retired all Tx
packets that it had admitted. Having unequal indices was impossible under
correct operating conditions. They fixed the glitch.

This change rolled out gradually.

Gradually, the kernel panics appeared.

The change rolled back, halting the impact, but then the analysis began. What
had we broken?

Another fun fact: Linux often includes an uptime in dmesg logs.

Slowly a pattern appeared. The dmesg logs included unusually large numbers for
uptimes. Plotting these, there was a clear cliff in terms of a minimum uptime.
Historical deployment logs showed a noteworthy release at that date, years
past. Noteworthy in that it was rolled back for my bug, years prior.

On the plus side, I realized this was almost certainly my years prior fuckup
slightly sooner than anyone else, so at least I got to call myself out :)

------
foobarbecue
I accidentally recursively chown'ed / to myself on a server in Antarctica that
was a critical gateway for our geophysical network, the night before we were
planned to leave the ice. Luckily, my flight was delayed... I spent the next
day wiping the server and setting it up from scratch, after giving up on
trying to recover from all the bizarre problems that stem from owning all the
system files (including broken ssh). You should try it sometime!

~~~
Balgair
> You should try it sometime!

An overnight mission at McMurdo or a server reset without internet access on
another device?

I'll take the overnight.

~~~
bostik
> _An overnight mission at McMurdo_

Depending on the time the year, perhaps. What's the longest possible time from
sunset to sunrise at those latitudes?

~~~
Balgair
179 'days' (4296 hours total). Missions vary in length though.

------
Insanity
Screwing up things is normal. One thing I started doing is that when a junior
member of the team "screws up", I'd laugh it off and tell them about a major
screw up of mine.

The thing is that often my screw-up as a junior was worse (short version:
broke a key part of the 'boot' system, was detected friday evening, and we had
a major scheduled release on monday morning) and it just puts people at ease.
I'll tell it in a humorous way as well. It's important (I think) that they
don't feel bad about it.

I've often had colleagues join in on the conversation as well. We're human,
we'll make mistakes, no need to stress out over it.

EDIT: added a bit more explanation of _why_ I do so.

~~~
foobarbecue
Great, sounds like you are fostering a culture where mistakes are shared
rather than hidden.

------
sulam
When I worked at Twitter we wanted to find out if we were finally going to
survive New Years. Our head of SRE wanted me to test in prod, which horrified
me but he convinced me in the end. In order to simulate the population of
Japan I had to make a bunch of fake users. I spent a fair amount of time
making sure they wouldn’t get caught up in any analytics, throwing off our
active user numbers, and I managed to peg the ‘follows’ service getting them
all to follow each other in a reasonable distribution. I also needed to bypass
the rate limiter, but since I was in prod I could just reset my own counter in
prod and effectively be totally limitless.

Two things broke in a visible way during all of this. During testing
everything was wired up to my personal account. I managed to spam all my
followers with thousands of happy new year tweets in a couple seconds since I
wasn’t subject to the rate limiter. I deleted all but one of those, which I
left to remind myself that with great power comes great stories of things
going wrong.

The other thing was a bit more dramatic, albeit short-lived. The first big
test had everyone ready to go. I hit enter on the job, and at the time (maybe
still) I had no way to get metrics out of production at a granularity less
than one minute. A very worried minute goes by, and then we realize I’ve
DDOSed the authentication service. All my fake accounts needed to auth to
actually tweet, and naturally they did that first. Since the whole point of
the test was for load to all hit in roughly the same second, the auth load
also all arrived in the same second. Oops.

We decided that was an unfair test, I spent a few hours getting auth tokens
for my fake users, and we tried again. That time everything worked, and we
also survived New Year’s... But it was fun getting there.

~~~
john-trammell
The auth DDOS sounds familiar. Was there a blog post or previous HN about
this?

~~~
sulam
Not from me, and I don't remember one. Honestly while it did cause a brief
problem, it didn't seem particularly noteworthy to the public.

------
jerkstate
I once accidentally deleted a large part of a production private nntp spool
(used as a tech support forum for a commercial product) while trying to get
replication working to a backup. The same day, I did a recursive chown on a
different production server from /. Worse yet, I was so distraught that I left
the office without telling anyone in a position to fix it. Since then (over 20
years ago now) I always double check where and who I am before doing something
destructive and dry run if possible, but more importantly, clearly and quickly
communicate when I screw up.

I ask junior folks what their biggest (technical) screw up has been, in
interviews. I think it's a bad sign if they won't admit to it or claim they've
never screwed up big-time.

~~~
throwaway1777
Hopefully you ask this to senior folks also. Folks without much experience or
responsibility may have not screwed up too badly yet, but I agree it would be
a bad sign for anyone with significant experience to have never broken
something.

------
astannard
I ran a sql script that migrated a database, it all looked as if it had worked
perfectly, but the category Id's had changed. The main website handled this
fine but I found out a separate system that sent out daily offers by email
proudly advertised Domestos Bleach as the drink of the day.

------
zeequreshi
Well as an idiot who screws up a lot, I can tell you it is a blessing in
disguise.

Screw ups lead to uncertainty and research suggests we learn best in
uncertainty - [https://www.aau.edu/research-scholarship/featured-
research-t...](https://www.aau.edu/research-scholarship/featured-research-
topics/uncertainty-helps-us-learn-study-shows)

------
dws
Oh god. I did the binary tree tree thing, too, and also in C. We needed a
symbol table for a thing, and I assumed symbols would come in in random order.
Oops. Someone suggested AVL trees. The reference I used (which may have been
Knuth) left delete "as an exercise for the reader." That led to my next big
oops: Pondering how to delete from an AVL tree while slicing onions for
dinner. Lots of blood. I still have the scar.

------
loafoe
"ifconfig eth0 down" on the production bastion host, instead of on my
localhost terminal -- and no hands on in the datacenter which was 160km away.
Of course the bastion host was the only one not hooked up to remote power
reset services.. and only 2 hours left in the service window.. sinking feeling

~~~
throwaway8941
For tmux users: put something like this in your .tmux.conf on production
servers:

    
    
        set -g window-style 'fg=red,bg=black'
    

It will color the text red, hopefully reminding you to be extra careful.
Adjust according to your preferences.

One other "defensive scripting" trick I frequently use is starting any `rm`
command with `ls`, double checking its output (or triple checking if it's a
recursive one), and then replacing `ls` with `rm`. It barely takes any extra
time if you're proficient with emacs-style readline hotkeys:

    
    
        C-a M-d rm C-m

~~~
gryfft
In this vein, I set my PS1 to bold red capital letters on bastion hosts and
alias sudo="echo 'You're on a jump box moron :p'"

I do the tmux color trick too-- color coded by environment for each bastion.

------
giu
I once ran a script on production to re-push some old data for a customer
based on log entries. This script used the log timestamps to decide which data
to re-push. Didn't realize that the timestamps in the log files were UTC, and
I just ran it with the default timestamp provided by the library (which is the
one the host system uses). Lucky for me, the system's default timezone was
also UTC, but nonetheless, the moment I realized it and the 10 minutes it took
me to read the documentation and to check the host system's timezone felt like
hours.

You live and you learn, I'd say :)

------
vikingcaffiene
Early versions of the Phoenix frameworks ORM would select every record if you
didn't pass it an ID. I didn't know that and wrote a deletion endpoint
forgetting to put said ID in. Tests passed (I mean it did delete...) and off
to prod it went. Long story short: I deleted data for all our users. Thank God
for backups.

~~~
x0x0
I've seen this in 2019! One of our api partners has a deletion endpoint. It's

    
    
       .../delete/:id
    

If you don't pass an id... it deletes all records. Because _that_ is a thing
you would want, rather than a bug where you somehow got a null id.

~~~
andrewflnr
Yikes. So much for failing fast.

~~~
Volundr
I dunno. I bet a lot of things failed very fast.

------
packetslave
I managed to send the entire Google datacenter backbone through one 20gb link
in Finland. This did not spark joy.

Search SRE had a 5lb bag of shredded money as a "gag gift" that was given to
whomever caused the most recent outage that impacted search ads.

------
chandra381
I used to work in Market Research R&D, in a non-technical role as a project
manager. I was deployed to a project that tried to use affective computing
(i.e emotion recognition) to understand consumer responses to advertisements.
It was a total disaster.

We'd hook respondents up to a webcam and record their facial expression as
they would watch a series of videos. The vendor's emotion recognition machine
learning software would then basically assign scores saying that at this
second, the viewer expressed xyz emotion.

The project failed for 2 reasons - one was that the theoretical link between
what expressions people were presenting, and their actual emotions to a
particular piece of media was not fully proven - which meant the model output
was not particularly helpful from the beginning.

Secondly, and this is really important - the model was trained on images of
western faces (i.e white people) - and because our target audience - southeast
asians - emote very differently a substantial chunk of the output data needed
to be trashed (it couldn't process darker faces well, it interpreted a grimace
as a smile, etc)

So there you have it - this was something I should at least have anticipated -
I got in a lot of trouble

------
Jaruzel
My biggest screw up ever:

[http://www.jaruzel.com/blog/dont-screw-up-a-vaxcluster-
tale](http://www.jaruzel.com/blog/dont-screw-up-a-vaxcluster-tale)

------
sethammons
"I'm an expert because I've made all the mistakes you can in a narrow field."

~~~
hinkley
Are you Neils Bohr?

------
wolfspider
I've accidentally taken out a television station before. Was remoted in and
clicked on the wrong thing with a new platform I was being walked through and
exploring. All of a sudden MeTV in the Dallas/Fort Worth area went down in the
middle of the afternoon and a LOT of very angry people began calling in but we
had it running again moments later. If you are doing a demo of a live
enterprise solution -- probably shouldn't click around to see what happens ;)

~~~
Thaliana
The way that I learned to write the WHERE clause of a SQL update statement
first was updating an entire column of a very important SQL table in a TV
stations automation software database.

I also took CNBC off air briefly, although that was their man's fault as he
told me to unplug the wrong video server.

------
jake_morrison
One of my favorite job interview questions for sysadmins is asking about a
time that they screwed up and broke production. If they don't have one, then
it makes me nervous. Either they are lying, or they don't have enough
experience, or they will be too conservative and will block all progress.

------
CapmCrackaWaka
I once forgot that a vendor added their own adjustment to a bidding algorithm
we had in place. It was significant for certain regions. I created a bidding
model without taking the adjustment into account, pushed it to production, and
spent ~30kUSD extra in a few hours before anyone noticed the unusually high
bids coming from our vendor. We put controls in place to prevent this
afterwards ;)

------
C1sc0cat
One of my screw up's was when I was SYSAD (head admin) for the Prime 550 at
the UK office of a large consulting engineers.

We had our field engineer in doing a PM and he needed a scratch disk and I
said oh you can use xxxx and pointed at the sticky label which had all the
disk id's on.

Turns out that some one had been using this for a big GIS project in Amman and
ended up wiping 6 month's work

~~~
flyinglizard
Oh no. Did anyone manage to salvage any of that?

~~~
C1sc0cat
Ah well we had some maps printed out so we could redo it from that with out
having to fully redo all the work.

------
Lex-2008
> I now put a mollyguard over those things any time there's any chance of them
> being exposed and having unscheduled activations.

that's what differentiate a good engineer from not-so-good - they learn on own
mistakes!

~~~
C1sc0cat
I recall at my first job one of our computer rooms had the emergency stop
button on the wall - just at head Hight.

One time I or My Boss (I cant recall who) stepped backwards and hit the off
button with his head - we had our electrician fit a molly guard after that.

~~~
packetslave
Several jobs ago, we had a datacenter where the PDUs for the racks were
mounted at the very top. Normally, this wasn't an issue as this was well above
head height...

One day we hired an engineer who was a Sikh. Turns out the PDUs were almost
exactly at turban (dastar) height. Cue the outage alerts (and the installation
of mollyguards).

------
floatingatoll
I once remove a single unused DNS record that resulted in a 10gbit/s DDOS
consisting of lookups for “a”, “aab”, “aabaaaa”, “aabaaac”, etc. (Hint: Perl.)

~~~
rpm91
My perl is super rusty, but you've got me curious...what happened?

------
caseysoftware
One of my favorite questions to ask technical candidates is:

 _Tell me about a time you made a mistake that you thought was going to get
you fired._

1\. Everyone has one. If you don't, you haven't been doing this long enough
and I want you to make a couple of those mistakes elsewhere first.

2\. If you didn't learn anything from it, you're going to make that and bigger
mistakes in your hubris. I'd rather you do that elsewhere.

~~~
Humdeee
The younger you are when you commit the mistake, generally the more fearful
you are of the result of that mistake. That same mistake might get a "well
that sucks" from someone who's been around the block.

Tech for 12 years and I've never made a mistake disastrous enough to be
fearful for my job. The worst costing ~$20k in hardware (couple server CPUs).
Told my manager right after without hesitation (this was also at a startup).

I would not stress that much anyway now if it were to occur. Having been
through mass layoffs from startups twice before, you change and become
hardier. I will be careful but will never be fearful of employment. Short of
doing a Desk Pop[0], I'm falling asleep every night with both eyes closed.
Life is too short as is. Let me go and I will spend my next morning on a
nearby beach with a good book.

[0]
[https://www.urbandictionary.com/define.php?term=Desk%20Pop](https://www.urbandictionary.com/define.php?term=Desk%20Pop)

~~~
Nextgrid
Just curious, how did you destroy the CPUs? Was it physical maintenance or did
you set the wrong voltage/etc (usually this happens during overclocking in
consumer-grade machines which is why I’m curious how this happens on server
machines).

------
mdholloway
A little over a month into my first engineering job, I decided to go for a
weekend stroll. I threw my work laptop into my messenger bag on the off chance
I'd wind up in a cafe and feel like checking email or poking at some code.
Started the day right with a hearty breakfast burrito, popped into the health
food store down the street to pick up a couple bottles of local kombucha for
later (gotta have those probiotics), and off I went.

After two or three hours of exploring, I noticed something weird: it was a
sunny LA afternoon, but I felt something like a drop of liquid hit the back of
my leg. I kept walking, but felt another drop, so I stopped and checked. Yep,
definitely real and definitely liquid. Also, it smelled like vinegar. Where
was it coming from? Who would do such a thing, and how?

Perplexed, I walked on, until my bag started emitting a drawn-out Mac startup
tone, and I realized just what I'd done. I opened it up, and sure enough: the
seal on one of my kombucha bottles had failed, and its entire contents had
emptied into my new work laptop.

------
froindt
Every couple years a "failure resume" gets trending on LinkedIn or reddit, and
I always love reading the comments.

It's also a refreshing reminder that "just because someone is successful and
has a great resume doesn't mean they're flawless".

Resumes and LinkedIn pprofiles are like Instagram posts - enhanced to bring
out the best aspects and with enough photoshop/makeup to hide the worst.

------
scottlamb
I once wrote a new version of a config generator and pusher for a small part
of a major service. I knew data pushes were the largest global outage vector
at my company, so I wrote carefully conservative validation logic and unit
tested it. But I never tested what the caller did when the validation failed,
and I had a dumb mistake there. It pushed an empty file, which was worse than
pushing the allegedly-invalid config. Oops. That was a ~30 minute outage of
the aspect of the service controlled by this config.

Of course an outage is never caused by one mistake. That mistake was mine, so
I felt badly about it. There were also mistakes in code reviews, validation in
the part receiving the config, and operational procedures. And then the big
one: the company as a whole was in this awkward phase where everyone knew
quick global pushes were bad but there wasn't good common tooling to support
doing staged config files easily. That was the worst mistake behind dozens if
not hundreds of major outages.

------
jjeaff
I was once working on a live MRP server after hours. It was needed to do
everything from customer service to shipping to tracking work in progress. So
if it goes offline, they would basically have to shut down until it was back.

I needed to reboot at one point and when I did, it started giving me "boot
disk not found". I couldn't get it to boot, at all. It seemed the boot disk
was corrupted.

I was literally in a cold sweat for 2 hours, late into the night, until I
finally noticed that I had left a diskette in the drive which was causing the
bios to try to boot from there first.

I have had plenty of other cases where I actually messed something up. But
that feeling you get when you think you have irreparably broken something is
so terrible.

------
RmDen
Some of mine

Was testing code and pushed a file to FTP 2 days early... vendor picked up,
processed file.. the people who signed up in the next 2 days were in the file
pushed later... but the vendor already processed the earlier file so they
didn't get their metro cards that month

Somehow managed to rebalance underlying components for a Trendpilot ETF
monthly instead of quarterly... daily audit that compares the values on NYSE
vs in our DB caught it.. lucky for me there was no money in it yet

dropped a table once at lunch time right before taking a bite of my
sandwich... did restore table within 10 minutes , didn't eat lunch that day
... lost appetite...

In ETL tool hardcoded something to test.... left it there when running for
real

------
davinic
Luckily I learned from a young age. When I was 7 or 8 I was using the computer
my dad used to run his company. It had dual 3.5" floppy drives, and a new (to
me) hard drive. Needing to format a floppy, I opened the format utility and
for some reason I thought I should chose "hard disk" (because it wasn't
floppy?!? hmm) when prompted.

So I format the "hard disk" and for some reason my 3.5" wasn't formatted. So I
tried again and again to no avail and gave up.

The production manager came in to work Monday morning to a fresh hard drive.
Some things were backed up and some things had to be recreated.

The outcome of this necessitated learning a new skill: bypassing passwords.

------
ajdecon
One of my favorite recurring team conversations is the one where everyone
shares stories of the outages they've caused or the systems they've broken.
This conversation has happened eventually on every SRE
(sysadmin/PE/devops/whatever) team I've joined, usually when a junior team
member causes their first outage and is having an emotional meltdown. I
remember my own meltdown of that form, and I remember it helped hearing about
the terrible problems my friends and mentors had caused in their turn.

The first outage where I thought I was going to get fired: I was working on a
system that had a single-point-of-failure server, and through a mishap with
rsync I accidentally destroyed the contents of /etc. That SPOF also had no
backups. (I'm not claiming it was well-designed...) Thankfully the job that
depended on that server would not kick off until morning, so my team slowly
reconstructed its functions on a separate machine and swapped it in behind the
scenes. I helped as much as I could while vibrating with anxiety, and my team
was incredibly kind throughout. I was not in fact fired. :-)

The most recent outage I caused? Yesterday! I accidentally rebooted most of
the machines in a development cluster. It's a dev system, there's no SLA, on
the whole I don't feel horrid, but it definitely ruined a few people's work
for an hour. This morning I spent a few minutes putting in a guard rail to
prevent that particular mistake again...

If you're in this job long enough, everyone breaks things -- it just happens.

------
JshWright
Adam Savage and Matt Parker recently had a conversation that spent a lot of
time covering the topic of "screwing up" and how we should respond when we do
(Matt's new book is about math screw ups that have had real world
consequences). It's a great interview in general, in my opinion.

[https://youtu.be/ig-2xlXfex4](https://youtu.be/ig-2xlXfex4)

------
GauntletWizard
In a major production launch, we moved traffic between two versions of a
backend with a blue/green deploy. The new version was hosted on Kubernetes,
and I was pretty new to using it in production. The changeover went well,
pretty great, actually. The problem came up the first time we deployed to the
new infrastructure - We saw a huge spike of connection disconnects. We did not
get a good answer why at the time, except the vague sense that the deployment
had gone a lot faster than we intended.

The second time we deployed, I happened to glance at the deployment size
immediately after deploying. For about five seconds, our deployment size went
from 100 down to 2. The reason for this was simple: The "Replicas" count was
specified in the deployment spec, and it was set to the size we used in our
staging infra. That had been fine in prod, and was quickly overridden by our
autoscaling configuration, but it did cause the Kubernetes infrastructure to
take down every existing pod (minus two), then bring up a bunch of new pods
very quickly.

------
inopinatus
The true measure of experience is the depth and variety of our screwups, and
the quality of ones character illustrated by what we take away.

------
Traster
One thing I really pride myself on is that because I screw up so often I have
a really good intuition for how things get screwed up.

------
mcguire
So then there was the one time I was engaged in sysadminery and had my cow-
orker sitting next to me while we were trying to debug some issue. She says,
"Hey, is there anything useful in the README file?"

I immediately typed "rm README" and hit enter.

Then I crawled under my desk and wouldn't come out until we'd gotten the file
restored from backups. Naturally, it had no useful information in it.

Then there was the time, for no readily apparent reason, where I typed "DELETE
* FROM table" (in the dev database). Fine, I thought, it's time to go home,
and submit a request to get the DB restored.

It turns out that they kept one (1) day's worth of backups, which they took at
6:30pm or so. I submitted the request at about 6:00pm and the DB guy had
already gone home; he did the restore about 7:00am the next morning. Yes, he
restored an empty table.

------
scottlamb
In another comment, I pointed out a mistake of mine that was a major factor in
an outage.

I also screw up all the time in ways that would cause outages, except we have
automated tests, tsan/asan, code reviews, a staging environment, various
safety checks, experiment gates, pre-mortems, slow rollout procedures, an
alert on-duty SWE and on-call SRE, etc.

Today one of my mistakes was caught early in the prod phase of our push.
That's much later than I would like but still before it did any real damage. I
submitted the bad code last Wednesday and have been out sick with the flu (and
caring for my preschool-aged kids) since then, so my awesome team handled my
problem for me.

------
thewebcount
> I was home alone as a kid, watching some movie on TV. I saw some guy grab a
> beer can and do that thing where you jam a pen in the side to make a hole,
> and then crack open the top.

Given what I think is her age (judging from using C64s and whatnot) I'm going
to go out on a limb and guess this was "The Sure Thing" [0] with John Cusack
and Daphne Zuniga. It's a great movie if you haven't seen it.

[0]
[https://www.imdb.com/title/tt0090103/?ref_=nv_sr_srsg_0](https://www.imdb.com/title/tt0090103/?ref_=nv_sr_srsg_0)

~~~
rachelbythebay
Okay now I have to go check. Thanks for the tip!

------
bradknowles
Well, there was this one:
[https://www.theregister.co.uk/2018/04/16/who_me/](https://www.theregister.co.uk/2018/04/16/who_me/)

Then there was the time I broke e-mail for Global Network Navigator, which was
a partnership between O'Reilly and AOL. Lost all e-mail for over a million
users on what was then the first nationwide ISP. I also submitted that one to
The Register as well, but they haven't published it, at least not yet.

------
_rend
Another small screw-up anecdote: I once tried symlinking a file into my home
directory, only to realize I had actually symlinked it into my current
directory into a file called '~'. I did the only sensible thing I could think
of, which was to run `rm -rf ~` to get rid of it... After about half a second
I realized what I had done, but by then enough of my home directory had been
wiped clean that I needed to restore from backup.

Always a fun one to share. :)

------
rhacker
My password update script for one site used this SQL:

UPDATE Users SET Password=?

We had backups. Selective restore. 7 accounts that were new, not in the
backup, got a special flag that required a reset.

~~~
Nextgrid
Very similar situation here although the query had a syntax error which meant
it didn’t go through, but initially I didn’t realise that and had to post the
dreadful question on Slack “do we have backups of the production DB?”.

Years later I still consider it my biggest screw up. Everything else can be
explained by bad processes, documentation, etc but this one is just me being
stupid.

------
erikerikson
I love the depiction of 'The One'. Thank you. Seems like the person you're
often asked to try and prove you are in interviews if you want to be employed.
Your publicly available code depended upon by serious people across the world
is overridden by your performance in this short high stakes moment we've
ginned up. I've tried to do better. It's hard and I've made some bad hiring
choices.

------
reidjs
Non tech, but leadership related screw up for me. I started a surprisingly
popular motorcycle group ride. Around 20 people show up to the first ride and
I'm a bit nervous. I forget the route pretty quickly and we all get lost and
separated. One person crashed. One person got a big ticket for
speeding/improper parts. The ride back was nice though.

Overall one of the craziest days on my life.

------
skytreader
Remember when Gitlab had their famous DB incident? From that we had some sort
of an inside joke in my then-workplace. If you're gonna do something big and
potentially prod breaking just "don't be _that_ guy" (said in the same spirit
as "break a leg").

I became _that_ guy.

My then-workplace didn't always have enough funds, though as an employer they
were generally generous especially considering their actual finances. This is
relevant to the story because this employer:

1\. was very lenient when it came to office attendance. So we frequently
worked remotely in odd hours; that was normal. But as a matter of
professionalism, I always tried to be conscientous when it came to the hours I
put in. Most weeks I probably did more than usual, merits of which is another
discussion entirely.

2\. periodically organized events to promote the business. But being short on
funds, they didn't have money to hire an actual photographer. So they'd ask me
to shoot because I was interested enough in photography to, at the very least,
have the gear for it.

The day I became _that_ guy they had this event I'm supposed to shoot but they
really communicated the time badly to me. I expected to be able to do at least
three, maybe four, hours of work before I'm needed with my camera. This is
what I communicated to my TL.

Turns out they needed me _earlier_, such that I only had an hour of work done
so far. Again, office culture was lenient about such things so my TL didn't
really mind if I left then. The event was some kind of a big deal besides.

I'd generally start my "hours" in the afternoon, way after lunch. So by the
time this event was done, it was already pretty late in the evening. I had my
dinner and received a message from my TL. Nonverbatim:

"Hey can you update PostgreSQL (9->10) tonight? It shouldn't take too long and
here's the steps..."

It was still in to my "usual" working hours but a couple of things that night
made this request result to disaster:

1\. I was tired from the event. Honest to goodness tired. I should've called
it off when I couldn't even entertain myself enough to stay awake waiting for
one of the given steps to finish. But I didn't because...

2\. I didn't have the heart to beg off on this task when I've only done one
hour of technical/engineering work for the day. To be fair, my TL always
abided by the rule "Don't touch prod when tired; you will make things worse".
Pretty sure he would've understood if I explained the state I was in. We
could've done it the next night. But when you're tired and embarassed at
having only done one hour of work for the day so far your decision making is
exceptionally unsound, for lack of a stronger adjective.

Unfortunately the technical bits of this story gets fuzzy; it's been two years
ago. But two years ago we have just migrated to Kubernetes and a couple of
months in the team was still adjusting their mental models from servers to
containers/deployments/statefulsets/pods. From just thinking between HDD vs
SSD tradeoffs to Persistent Volume architecture issues. This is also why
upgrading Postgres was such an ad hoc process for us then. We simply didn't
know better (if something not "ad hoc" even exists).

Part of the instructions was to "delete the old data directory of Postgres"
(cue: I have read this in a postmortem before...). Because I was tired and
lazy I wrote a script so the update could go without my (much needed!)
supervision. The instructions were sound and deletion would've been safe--
assuming all the steps prior to the deletion finished successfully. It did not
and I did not use `set -e`. Which meant I just deleted all the prod data in
master. I was efficient. The realization woke me up harder than sugar ever
did.

To cut this already long story short, I at least had the sense to concede at
that point and wake up my TL with the bad news. Much like the rest of this
story, what saved me that night came in twos:

1\. I at least had the sense to put the site into maintenance mode.

2\. I used `rm -rf`, as opposed to issuing DROP statements to psql. Which
meant that my fuck-up did not replicate. So we just promoted the replica to
master and downgraded the master to replica and monitored replication.

These two together ensured no data loss. Apocalypse canceled. Everyone in the
company went to work in the morning none the wiser.

This story actually had a less fortunate sequel but that story is not for me
to tell. And besides, I've written long enough.

------
downerending
I've only made one technical screwup in my career, and a minor and fixable one
at that, but it left a deep impression.

These days, when running as root, I concentrate hard on every command, asking
myself whether this is really exactly the right thing in every respect.

When my hands start shaking, I know my mind is in the right place.

------
cstuder
What does SEV stand for?

~~~
scarejunba
SEVs are severe on-call issues at FB. They look like SEV3, SEV2, SEV1, etc.

Other places may use similar terminology but OP is at FB.

~~~
trollied
> Other places may use similar terminology but OP is at FB.

She's not worked there for nearly 2 years now:
[https://rachelbythebay.com/w/2018/03/10/free/](https://rachelbythebay.com/w/2018/03/10/free/)

~~~
packetslave
yeah, she's at Lyft, I believe

------
JshWright
Either my memory is much worse, or my level of screwing up must be much more
impressive. I feel like many of those wouldn't be notable enough to stick in
my memory years later. There's no way I'd remember some random electric shock
from decades ago...

------
dwd
On blowing the C64 fuse, there was a commonly known soft reset technique that
involved crossing two connectors on the game port. Cross the wrong ones and
you would blow the fuse.

------
juststeve
i love this blog

------
csours
I used to do plant floor support in an automotive assembly plant. Think
desktop support, but super extra. Think multiple serial single points of
failure, on a one minute metronome. Here are some things I've screwed up.

\---

A simple, early one: We used VNC for remote desktop support of line-side
production computers. One of my team leads was walking me through what was on
the screen and what was going on. I was used to right clicking on these
screens to see more of what was going on, but this one happened to be running
a script that was interrupted by this. When I right clicked, my team lead
freaked out, and the operator on the floor freaked out, and started moving the
mouse themselves and clicking everywhere. After a while they started just
doing their job again, but shortly after we got a call from a supervisor.

\---

I got a call saying tracking was off on a production conveyor. This means that
operators were getting incorrect instructions and work was being recorded on
incorrect units. I adjusted tracking to match what I was being told. All good.

I shortly got a call from the same conveyor saying tracking was off. I told
them "Yes, I just fixed it". "Well it is wrong now, it was fine a minute ago".
So I adjusted it to match what I was being told now. Who knows what that other
guy was smoking.

Right as I finished re-adjusting tracking I got another frantic, high energy,
expletive filled call saying the tracking was off.

Dear reader you may have guessed what was wrong.

Since I got multiple sets of contradictory information, I decide to go out to
the floor. This is what I see (simplified):

Footprint 1,2,3,4, etc

FP1 FP2 FP3 FP4 FP5 FP6 FP7 FP8

008 007 006 ___ 005 004 003 002

You see, the first person was on the first half of the conveyor, and the
second person was on the other half. They were both correct, but neither had
the full story. There was an empty carrier in the middle of the conveyor.

\---

Last one. One of our weekly ops tasks was to verify that the 3 (three!!)
scheduling services agreed with one another and also the production schedule.
Unfortunately sometimes we got late breaking schedule changes, like running
extra time or extending or moving lunch. On a Friday night/Saturday Morning, I
got one such change. We were going to run an hour extra to make up for earlier
lost units.

I made the requested change and went back to "compiling code" etc. (Perks of
night shift)

Some time later... I get a call on one of my radios (Nextel) saying the lights
were off in the back of the shop. I say, "hmm, that's odd" and go to the
screen to turn the lights on in that area. I get a call on my other radio,
saying the lights were off in the middle of the shop. Oh sh*t. For context,
the lights were now off for over 200 pissed off people who just wanted to
finish their overnight shift and go home. I continue to press buttons to turn
lights on, hindered by the fact that the lighting controllers were on a very
very slow daisy chained serial bus. My radios continue to go off with people
urgently and excitedly informing me that the [expletive deleted] lights were
off. I also got a visit from the Plant Shift Lead (2 or 3 steps down from the
plant manager). I was pretty surprised to see her, as I was kind of wedged in
a corner with a bookcase blocking half of the entryway to my cubicle.

Anyway, I eventually got the lights turned back on. Looking at the schedule
changelog, I had successfully extended the shift, but for the wrong day. I had
done it for Saturday, as the clock was past midnight when I edited the
schedule. Oops.

\---

These were all relatively early in my career, but I think they're pretty
colorful.

------
m463
> (literal and electrical) ground

:)

------
quickthrower2
Maaan her posts always fly on HN

~~~
rcarmo
Real life stories tend to be more interesting than startup navel-gazing :)

~~~
mc3
True, although startup navel gazing is not as common as one would expect.
Current top 10:

    
    
      Swift Playgrounds for macOS (apps.apple.com)
      Judge Orders Navy to Release USS Thresher Disaster Documents (usni.org)
      Where are all the animated SVGs? (getmotion.io)
      Stage is a minimalistic 2D, cross-platform HTML5 game engine (piqnt.com)
      How the CIA used Crypto AG encryption devices to spy on countries for decades (washingtonpost.com)
      N26 will be leaving the UK (n26.com)
      The coming IP war over facts derived from books (abe-winter.github.io)
      Growing Neural Cellular Automata: A Differentiable Model of Morphogenesis (distill.pub)
      A popular self-driving car dataset is missing labels for hundreds of pedestrians (roboflow.ai)
      Investigating the Performance Overhead of C++ Exceptions (pspdfkit.com)

------
hoistbypetard
> Prepare for maximum navel-gazing!

 __There __is some truth in advertising!

I feel like so many of the posts I read and enjoy could lead with that
statement.

------
known
"Everybody is a genius. But if you judge a fish by its ability to climb a
tree, it will live its whole life believing that it is stupid" \--Einstein

~~~
goodcanadian
Cute, but probably not Einstein:
[https://www.macleans.ca/education/uniandcollege/why-we-
shoul...](https://www.macleans.ca/education/uniandcollege/why-we-should-
forget-einsteins-tree-climbing-fish/)

------
maest
I admire that the author actually responded to some of criticism, accepted it
and took it in stride. It's something I find myself having difficulties with
more often than I'd like.

However, "even in turds you can sometimes find a peanut". I mean, come on...

~~~
sokoloff
I thought that line was amusing. I wouldn’t read it too literally.

~~~
JshWright
It's definitely off-putting. I'm not sure what your mean by not reading it
"too literally". Obviously no one thinks the author is speaking literally...
It's still reasonable to think the phrasing is gross.

~~~
striking
Sometimes the comments are gross.

~~~
JshWright
Sure? I'm not sure how that's relevant...

I'm confused about why folks seem so upset by people expressing this opinion .

~~~
striking
Personally, I'm confused about why people insist on dissecting every single
sentence in this article and some others like it.

~~~
maest
Is it bad form to cricise an article?

~~~
striking
What is your criticism exactly? You quoted the article and said simply, "come
on".

IMO this is only validating the criticism the article levels at comment
sections like HN's. You have picked out some random sentence and expressed no
more than idle disagreement. Maybe I personally wouldn't compare your comment
to a turd, but there's not a whole lot of nutritional value in it either.

Perhaps the reason the article expresses this concern in this specific way is
because it is warranted. Because people insist on disassembling articles
coming from this domain sentence by sentence and posting comments that really
don't say anything helpful or sometimes anything at all.

~~~
maest
Do I really need to spell out why that phrasing is

1\. unpallatable 2\. indiscriminantly rude towards an entire community?

> Because people insist on disassembling articles coming from this domain
> sentence by sentence

I feel like I may have walked into something where I don't have much context.
I'm not sure what you mean by that.

Also, I find it strange people are so fixated on my criticism, and nobody has
commented anything about the praise I made in the very same post.

