
Ask HN: What's the worst you've ever screwed up at work? - kadabra9
We&#x27;ve all been there (most of us, at least). What did you do (or not do), how did you first react, and how did you handle it?<p>Bonus points for sharing what you learned&#x2F;key takeaways from the experience.
======
patio11
I've only cried literal tears once in the last ten years, over business. Due
to inattention while coding during an apartment move, I pushed a change to
Appointment Reminder which was poorly considered. It didn't cause any
immediate problems and passed my test suites, but the upshot is it was a time
bomb that would inevitably bring down the site's queue worker processes and
keep them down.

Lesson #1: Don't code when you're distracted.

Some hours later, the problem manifested. The queue workers came down, and AR
(which is totally dependent on them for its core functionality) immediately
stopped doing the thing customers pay me money to do. My monitoring system
picked up on this and attempted to call me -- which would have worked great,
except my cell phone was in a box that wasn't unpacked yet.

Lesson #2a: If you're running something mission critical, and your only way to
recover from failure means you have to wake up when the phone rings, make sure
that phone stays on and by you.

Later that evening I felt a feeling of vague unease about my change earlier
and checked my email from my iPad. My inbox was full of furious customers who
were observing, correctly, that I was 8 hours into an outage. Oh dear. I
ssh'ed in from the iPad, reverted my last commit, and restarted the queue
workers. Queues quickly went down to zero. Problem solved right?

Lesson #3: If at all possible, avoid having to resolve problems when
exhausted/distracted. If you absolutely must do it, spend ten extra minutes to
make sure you actually understand what went wrong, what your recovery plan is,
and how that recovery plan will interact with what went wrong first.

AR didn't use idempotent queues (Lesson #4: Always use idempotent queues), so
during the outage, every 5 minutes on a cron job every person who was supposed
to be contacted that day got one reminder added to the queue. Fortuitously, AR
didn't have all that many customers at the time, so only 15 or so people were
affected. Less than fortuitously, those 15 folks had 10 to 100 messages
queued, each. As soon as I pressed queues.restart() AR delivered all of those
phone calls, text messages, and emails. At once.

Very few residential phone systems or cell phones respond in a customer-
pleasing manner to 40 simultaneous telephone calls. It was a total DDOS on my
customers' customers.

I got _that_ news at 3 AM in the morning Japan time, at my new apartment,
which didn't have Internet sufficient to run my laptop and development
environment to see e.g. whose phones I had just blown up. Ogaki has neither
Internet cafes nor taxis available at 3 AM in the morning. As a result, I had
to put my laptop in a bag and walk across town, in the freezing rain, to get
back to my old apartment, which still had a working Internet connection.

By the time I had completed the walk of shame I was drenched, miserable, and
had magnified the likely impact that this had on customers' customers in my
own mind. Then I got to my old apartment and checked email. The first one was,
as you might expect, rather irate. And I just lost it. Broke down in tears.
Cried for a good ten minutes. Called my father to explain what had happened,
because I knew that I had to start making apology calls and wasn't sure prior
to talking to him that I'd be able to do it without my voice breaking.

The end result? Lost two customers, regained one because he was impressed by
my apology. The end users were mostly satisfied with my apologies. (It took me
about two hours on the phone, as many of them had turned off their phones when
they blew up.)

You'd need a magnifying glass to detect it ever happened, looking on any chart
of interest to me. The software got modestly better after I spent a solid two
weeks on improved fault tolerance and monitoring.

Lesson the last: It's just a job/business. The bad days are usually a lot less
important in hindsight than they seem in the moment.

~~~
d0m
Can't agree more with Patrick, mine is formulated a bit differently:

Lesson #1: Don't push code on Friday afternoon.

Lesson #2: Beer, Code and Commit is totally fine. Just don't push! Wait until
next day to review and push/deploy it..

~~~
jwdunne
Lesson 1 rings a mad bell. I distinctively remember a text from a colleague
saying "yeah, I'm never launching anything on a Friday".

As a general rule, the best time to launch is first thing on a Tuesday. Why
Tuesday morning? We are also well aware of "Monday morning madness" too - pre-
planning stress for Monday morning is akin to masochism.

~~~
steveklabnik
I wonder if this is the rationale for Patch Tuesday.

~~~
mdisraeli
Pretty much. It's arbitrary, but is fixed, not the first week or last week of
the month (often reporting periods), not Monday, and not the weekend. If you
then assume that organisations test the patches, that means they can deploy
them for thursday morning

------
yan
Not the worst at all, but probably one I found most amusing. One of my jobs
included some sys admin tasks (this wasn't the position, but we all did dev
ops), among my other responsibilities. I spent half a day going through
everything with the person responsible for most of the admin tasks at the
time. She was an extremely dilligent and competent admin, did absolutely
everything through configuration management and kept very thorough personal
logs and documentation on the entire network. One of my first tasks was to
change backup frequency (or other singular change) and going by how I usually
did things at the time, just sudid a vi session, changed the frequency and
restarted the service.

She found out about it pretty quickly due to having syslog be a constant
presence in one of her gnu screen windows and gave me a look. She quickly
reverted what I did, updated our config management tool, tested it, then
deployed it, while explaining why this was the right way to do things. I
slowly came around to doing things the right way and haven't thought much
about the initial incident until we found her personal logs that she archived
and left on our public network share for future reference.

In the entries for the day that I started, we saw the following two lines:

    
    
        [*] 2007/09/09 09:58 - yan started. gave sudo privs and initial hire forms.
        [*] 2007/09/09 10:45 - revoked yan's sudo privs.

~~~
sillysaurus2
_She found out about it pretty quickly due to having syslog be a constant
presence in one of her gnu screen windows_

I'm amazed that this is possible. How would I set something like that up? A
realtime log of only the most significant events of a remote system?

In fact, I'd like to take this opportunity of ignorance-admitting to ask the
community for general linux/bsd sysadmin resources. What books should I read,
or what topics should I study? I want to become an expert at modern sysops.
Modern deployment, hardening, backup, managing dozens of boxen, etc.

I've been thinking of going through any MIT OCW on the subject, but it seems
like hard-earned experience might not necessarily translate well to an
academic setting. What would you recommend I do?

~~~
sentenza
What position are you starting from? My old workplace was a university group
where we (admins) were recruted from the available pool of PhD students. So
I'm used to guiding people from "no knowledge" to "enough knowledge to be
dangerous". The first step was to force the prospective admins to run a
specific system on their "productive machine" and keep it in such conditions
that _everything_ works.

This way, a complete admin newbie would learn about digging through the
systems by working out the kinks of practical everyday problems. Remember,
this is only the most basic instruction, nowhere near enterprise-grade.

If there was a "prospective admin" who had never before run Linux, I'd tell
them to install and use Ubuntu/Mint. (Those guys whould usually only be
trained to be a helping hand for a "senior" admin.)

If he'd already used Ubuntu at home, I'd tell them to start using Debian and
work out how to set up an SSH server and set up their home machine so they
could access it remotely.

If they had dabbled with Debian, Fedora, SuSe or something similar, I'd tell
them to install Arch and set up some "interesting things", like a mail server
or a nis server.

If they were using Arch or Gentoo at home, I'd just personally show them the
important things about our system and have them wingman with me for a few
days.

If you are already an advanced Linux or BSD user, my approach is of course not
appropriate. Instead I'd recommend to pick skills that you want to learn
(iptables? Exim?) and set that up. Read Manuals! Read RFCs!

Best of luck.

~~~
asdasf
>If they were using Arch or Gentoo at home,

If they are using gentoo, you should be finding someone else. Gentoo users are
typically the most dangerous combination of profoundly ignorant, yet absurdly
overconfident in their abilities. Seeing a bunch of autotools and gcc output
scroll by does not teach you anything. But the mistaken reputation as an
"advanced" distro makes people think that by using gentoo, they are therefore
"advanced".

~~~
Phlarp
Because large groups of people can always be prejudged by which technologies
they deploy!

~~~
asdasf
No, specifically gentoo users can. The distro literally serves no real
purpose, nobody with any unix experience would consider using it. It is quite
literally the distro for people who don't know what they are doing, but want
to feel "advanced" by watching stuff they don't understand scroll by.

~~~
Phlarp
Your comments tell me more about you than they do Gentoo users.

------
gmays
In late 2008 when I was in the Marines and deployed to Iraq I was following
too closely behind the vehicle in front while crossing a wadi and we hit an
IED (the first of 3 that day).

Nobody was killed, but we had a few injured. Thankfully the brunt of it hit
the MRAP in front of us. If it hit my vehicle (HMMWV, flat bottom) instead I
probably wouldn't be here.

That was the first major operation on my first deployment, too. Hello, world!

My takeaway? Shit just got real.

We ended up stranded that night after the 3rd IED strike (our "rescuers" said
it was too dangerous to get us). It was the scariest day of my life, but in
similar future situations it was different. I still felt fear and the reality
of the existential threat, but I accepted it. It was almost liberating.
Strange.

I deployed for another year after that (to Afghanistan that time). After
Afghanistan I left the Corps and started my company. Because if it fails,
what's the worst that can happen? Lulz.

~~~
kadabra9
This really puts some of the boneheaded moves I've made in my career in
perspective. One thing that's always kept me pretty even keeled after a blowup
is to take a breath and tell myself that no matter how bad I've screwed up,
I'm still here, still breathing, and there (most likely) is some way out of
the hole I've dug, no matter how painful.

Depending on the industry, that might not be the case though. Thanks for your
service.

------
ggreer
One summer in college, I got an internship at a company that made health
information systems. After fixing bugs in PHP scripts for a couple weeks, I
was granted access to their production DB. (Hey, they were short on talent.)
This database stored all kinds of stuff, including the operating room
schedules for various hospitals. It included who was being operated on, when,
what operation they were scheduled for, and important information such as
patient allergies, malignant hyperthermia, etc.

I was a little sleepy one morning and accidentally connected to prod instead
of testing. I thought, "That's weird, this UPDATE shouldn't have taken so
long- _oh shit_." I'd managed to clear all allergy and malignant hyperthermia
fields. For all I knew, some anesthesiologist would kill a patient because of
my mistake. I was shaking. I immediately found the technical lead, pulled him
from a meeting, and told him what happened. He'd been smart enough to set up
hourly DB snapshots and query logs. It only took five minutes to restore from
a snapshot and replay all the logs, not including my UPDATE.

Afterwards, my access to prod was not revoked. We both agreed I'd learned a
valuable lesson, and that I was unlikely to repeat that mistake. The tech lead
explained the incident to the higher-ups, who decided to avoid mentioning
anything to the affected hospitals.

If it's any consolation, the company is no longer in business.

Just remember when you screw things up: Your mistake probably won't get anyone
killed, so don't panic too much.

~~~
munificent
You didn't screw up here. The entire infrastructure, org chart, and policies
that allowed you to _accidentally modify a production database containing
critical medical information_ screwed up.

Blaming yourself here is like blaming yourself for being hurt after being told
to drive a car with no seatbelt or brakes.

~~~
ggreer
Sure there's plenty of blame to spread around, but I still would have felt
terrible if someone had been hurt or killed.

What system would you put in place to prevent this? The issue was that I
connected to prod when I thought I was connecting to a test DB. We each had
different credentials for prod vs everything else, but the SQL client
remembered my username and password. Anyone with prod access could have made
the same mistake.

~~~
canadev
In the past, I've set up big MOTD style messages that say "PROD" in fancy
ASCII graffiti when I ssh/connect a DB client/whatever to production. I think
I will set one of those up now for my current setup.

Also, sort of related, I'm using MacOS, and in the back of my head I've wanted
to create a tool that will change the color of the menu bar (at the top of the
screen) to, say, bright yellow, when I'm connected to the VPN so that I don't
accidentally visit a porn site while still connected to work.

That said, neither of these systems is even close to fail-proof :)

~~~
bluedino
It doesn't really solve anything, but I've done bright blue prompts for
staging, red for production, and green for development.

~~~
blowski
This is essentially what I do - Black on White for production, White on Black
for development. If I'm running development commands on a Black on White
screen, something doesn't feel right. It isn't a life-or-death application, so
this is enough.

------
hluska
A local Subway franchise was the very first company that hired me. I was
extremely young, shy, and intensely socially awkward, yet excited to join the
workforce (as I had my eyes set on a Pentium processor).

When I worked at Subway, the bread dough came frozen, but you would put loaves
in a proofer, proof it for a certain amount of time, and then bake it. My
first shift, however, got busy and I left several trays in the proofer for a
very, very long time. Consequently, they rose to roughly the size of loaves of
bread, as opposed to the usual buns.

It was my very first shift alone at any job in my life, so I did the most
logical thing I could think of and put the massive buns in the oven. They
cooked up nicely enough and I thought I was saved. Until I tried to cut into
one.

Back in that day, Subway used to cut those silly u-shaped gouges out of their
buns. In retrospect, I think this was most likely a bizarre HR technique
designed to weed out the real dummies, but at the time I was oblivious (likely
because I was one of the dummies they should have weeded out). When I ran out
of the normal bread, I grabbed one of my monstrosities, tried to cut into it,
and discovered that it was not only rock hard, but the loaf broke apart as I
tried to cut it.

That night, my severe shyness and social awkwardness had their first run-in
with beasts known as angry customers. I was scared I would get fired, so I
promptly made new buns, but spent the rest of my shift trying to get rid of my
blunder. I discovered some really interesting things about people that night.
First, you'd be surprised how incredibly nice customers are if you are
straight up with them. Some customers I never met before met the big, crumbly
buns as an adventure and, in doing so, helped me sell all the ruined buns.

In the end, I came clean (and didn't get fired). That horrible night was a
huge event in the dismantling of my shell. It taught me an awful lot about
ethics. And frankly, that brief experience in food service forever changed how
I deal with staff in similar types of jobs.

~~~
canadev
I gotta say, that's a pretty awesome story. Didn't expect that to be the seeds
of transformation.

------
Smerity
I was testing disaster recovery for the database cluster I was managing. Spun
up new instances on AWS, pulled down production data, created various
disasters, tested recovery.

Surprisingly it all seemed to work well. These disaster recovery steps weren't
heavily tested before. Brilliant! I went to shut down the AWS instances. Kill
DB group. Wait. Wait... The DB group? Wasn't it DB-test group...

I'd just killed all the production databases. And the streaming replicas.
And... everything... All at the busiest time of day for our site.

Panic arose in my chest. Eyes glazed over. It's one thing to test disaster
recovery when it doesn't matter, but when it suddenly does matter... I turned
to the disaster recovery code I'd just been testing. I was reasonably sure it
all worked... Reasonably...

Less than five minutes later, I'd spun up a brand new database cluster. The
only loss was a minute or two of user transactions, which for our site wasn't
too problematic.

My friends joked later that at least we now knew for sure that disaster
recovery worked in production...

Lesson: When testing disaster recovery, ensure you're not actually creating a
disaster in production.

~~~
grecy
I think you just started a new Agile-like trend:

'hyper-committed disaster recovery testing'

------
michh
Classic forgetting the full WHERE-part of a manual UPDATE-query on a
production system. The worst part is you know you fucked up the nanosecond you
hit enter, but it's already too late. Lesson learned? Avoid doing things
manually even if a non-technical co-worker insists something needs to be
changed right away. And if you do: wrap it in a transaction so you can
rollback, leave in a syntax error that you'll only remove when you're done
typing the query.

~~~
gargarplex
I always add a LIMIT even when not necessary.

Why doesn't MySQL have a version control baked in? Even if it preserves just
the last n hours of state..

~~~
icedchai
It kind of does. It's called the binary log.

~~~
gargarplex
Didn't know about this! Thanks.

[https://dev.mysql.com/doc/refman/5.0/en/binary-
log.html](https://dev.mysql.com/doc/refman/5.0/en/binary-log.html)

[https://dev.mysql.com/doc/refman/5.0/en/point-in-time-
recove...](https://dev.mysql.com/doc/refman/5.0/en/point-in-time-
recovery.html)

------
yen223
I wrote a piece of code controlling an assembly line machine. These machines
require manual operation, and would come with a light curtain, which detects
when someone places their hand near the moving parts, and should temporarily
stop the machine.

A relatively minor bug in the software that I wrote caused the safety curtain
to stop triggering when a certain condition was met. We discovered this bug
after an operator was injured by one of these machines. Her hand needed
something like 14 stitches.

Lessons learnt:

1\. Event-driven code is hard.

2\. There's no difference between a 'relatively minor' bug and a major one.
The damage is still the same.

~~~
BuildTheRobots
Couldn't read your comment without a shudder and my brain going straight to
the Therac 25 incident(s).

~~~
dkokelley
I hadn't heard about this so I looked it up. Chilling.

[http://en.wikipedia.org/wiki/Therac-25](http://en.wikipedia.org/wiki/Therac-25)

~~~
kohanz
We just recently reviewed the Therac-25 case study as my organization is
working towards ISO 13485 certification. I wonder whether the OP's
organization was using ISO development practices.

------
jawns
I run Correlated.org, which is the basis for the upcoming book "Correlated:
Surprising Connections Between Seemingly Unrelated Things" (July 2014,
Perigee).

I had had some test tables sitting around in the database for a while and
decided to clean them up. I stupidly forgot to check the status of my backups;
because of an earlier error, they were not being correctly saved.

So, I had a bunch of tables with similar names:

    
    
        users_1024
        users_1025
        users_1026
    

I decided to delete them all in one big swoop.

Guess what got deleted along with them? The _actual_ users table (which I've
since renamed to something that does not even contain "users" in it).

So, how do you recover a users table when you've just deleted it and your
backup has failed?

Well, I happened to have all of my users' email addresses stored in a separate
mailing list table, but that table did not store their associated user IDs.

So I sent them all an email, prompting them to visit a password reset page.

When they visited the page, if their user ID was stored in a cookie -- and for
most of them, it was -- I was able to re-associate their user ID with their
email address, prompt them to select a new password, and essentially restore
their account activity.

There was a small subset of users who did not have their user IDs stored in a
cookie, though.

Here's how I tackled that problem:

Because the bulk of a user's activity on the site involves answering poll
questions, I prompted them to select some poll questions that they had
answered previously, and that they were certain they could answer again in the
same way. I was then able to compare their answers to the list of previous
responses and narrow down the possibilities. Once I had narrowed it down to a
single user, I prompted them to answer a few more "challenge" questions from
that user's history, to make sure that the match was correct. (Of course, that
type of strategy would not work for a website where you have to be 100% sure,
rather than, say, 98% sure, that you've matched the correct person to the
account.)

~~~
10feet
Ha, nice one. Whenever I start a new job, the first thing I do is create a
backup of the database, because I have done something similar before. Backup
everything the first day, onto your own machine.

------
leothekim
Not the worst, but certainly most infamous thing I've done: I was testing a
condition in a frontend template which, if met, left a <!-- leo loves you -->
comment in the header HTML of all the sites we served. Unfortunately the
condition was always met and I pushed the change without thinking. This was
back in the day when bandwidth was precious and extraneous HTML was seriously
frowned upon. We didn't realize it was in production for a week, at which
point several engineers actually decided to leave it in as a joke. Then
someone higher up found out and browbeat me into removing it, citing bandwidth
and disk space costs.

Now, if you go to a CNET site and view source, there's a <!-- Chewie loves you
--> comment. I like to think of that as an homage to my original fuckup.

~~~
strozykowski
Haha! I had to go out and check, and sure enough, there it is in CNET.com's
source.

------
itwasme
I once worked for a company that schedules advertising before films. This
wasn't in the US and the company had a monopoly over all of the ads shown
across the country. It was my first programming job and done during university
holidays, so I was there for a couple of months and then back to university.
Toward the end of the following year I get a phone call: something was wrong
with the system, it was allowing agents to overbook advertising slots. I
diagnosed the problem over the phone and they put a fix in but management
decided it was too late for the company to go back and cancel all of the ads
that were already booked. This was not surprising as it was the most money
they'd ever made. Conveniently, the parent company owned the cinemas so they
did a deal where they just showed all of the ads that were booked.

Because of me, one December, everyone in the country who went to the cinema
got to watch anywhere between 30 and 45 minutes of ads before the main
presentation started.

Lesson learned: write more tests, monitor everything.

~~~
mcintyre1994
That time isn't normal where you are? How much did you make selling your
system to the UK cinema industry? :)

~~~
itwasme
Haha. It was quite a long time ago otherwise I would have remembered the usual
maximum booking time. I wouldn't be surprised if they exported the bug, given
its success.

------
snikch
Sigh, I cringe even remembering this one.

We were storing payment details sent from a PHP system into a Ruby system, I
was responsible for the sending and receiving endpoints. Everything was
heavily tested on the Ruby end but the PHP end was a legacy system with no
testing framework. Since the details were encrypted on the Ruby end, I didn't
do a full test from end to end AND unencrypt the stored results.

Turns out for two months we were storing the string '[Array]' as peoples
payment details.

Takeaway: If you're doing an end to end test, make sure you go all the way to
the end.

------
discardorama
I bet > 66% of these are something to do with databases. :-)

My story (though I wasn't directly responsible): we were delivering our
software to an obscure government agency. Based on our recommendation, they
had ordered a couple of SGI boxes. I wrote the installation script, which
copied stuff off the CD, etc. Being a tcsh afficianado, I decided to write it
in tcsh with the shebang line

    
    
       #!/usr/local/bin/tcsh
    

Anyways: we send them the CD. Some dude on the other side logs in as root,
mounts the CD, and tries to run "installme.csh". "command not found" comes the
response. So he peeks at the script, and sees that it's a shell script. He
knows enough of unix that "shell == bash". So he runs "bash installme.csh" . A
few minutes go by, and lots of errors. So he reboots; now the system won't
come up. The genius that he is, he decides to try the CD on the second SGI
box. Same results.

In the script, the first few lines were something like:

    
    
        set HOME = "/some/location"
        /bin/rm -rf $HOME/*
    

Hint: IRIX didn't ship with /usr/local/bin/tcsh. And guess what's the value of
"HOME" in bash?

~~~
dllthomas
' _And guess what 's the value of "HOME" in bash?'_

In the rm line of the snippet above, "/some/location". Magic variables in bash
tend to lose their magic once set.

~~~
Robin_Message
I assume `set HOME = /some/location` is the tcsh syntax to set a variable.

In Bash, it doesn't do anything useful.

~~~
kps

      > In Bash, it doesn't do anything useful.
    

In sh and derived shells, it sets the arguments ($1, $2, and so on). In this
case you end up with $1 being ‘HOME’, $2 being ‘=’, and $3 being
‘/some/location’.

------
jboggan
I love these topics.

~ 2007, working in a large bioinformatics group with our own very powerful
cluster, mainly used for protein folding. Example job: fold every protein from
a predicted coding region in a given genome. I was mostly doing graph analysis
on metabolic and genetic networks though, and writing everything in Perl.

I had a research deadline coming up in a month, but I was also about to go on
a hunting trip and be incommunicado for two weeks. I had to kick off a large
job (about 75,000 total tasks) but I figured spread over our 8,000 node
cluster it would be okay (GPFS storage, set up for us by IBM). I kicked off
the jobs as I walked out the door for the woods.

Except I had been doing all my testing of those jobs locally, and my Perl
environment was configured slightly differently on the cluster, so while I was
running through billions of iterations on each node I was writing the same
warning to STDOUT, over and over. It filled up the disks everywhere and caused
an epic I/O traffic jam that crashed every single long-running protein folding
job. The disk space issues caused some interesting edge cases and it was
basically a few days before the cluster would function properly and not lose
data or crash jobs. The best part was that I was totally unreachable and thus
no one could vent their ire, causing me to return happy and well-rested to an
overworked office brimming with fermented ill-will. And I didn't get my own
calculations done either, causing me to miss a deadline.

Lessons learned:

1) PRODUCTION != DEVELOPMENT ever ever ever ever 2) Big jobs should be
proceeded by small but qualitatively identical test jobs 3) Don't launch any
multi-day builds on a Friday 4) Know what your resource consumption will mean
for your colleagues in the best and worst cases 5) Make sure any bad code
you've written has been aired out before you go on vacation 6) Don't use Perl
when what you really needed was Hadoop

~~~
lambdaphage
Nice. I once needed to do reciprocal blast for the complete genomes of about
300 bacterial species. That's on the order of half a billion queries, but the
work was embarrassingly parallel, and each discrete job only took about 90
seconds. I wrote a little shell script to kick them off on the cluster, and
went home.

I woke up the next morning to several inbox screens' worth of messages from
angry people I didn't know, demanding explanations for what I did to their
jobs and their cluster. I don't think I have ever biked to the lab faster.

After multiple rounds of palm-drenching emails with the cluster sysadmins and
the computational mathematics group PI (and my own boss agonizingly cc'ed), we
determined the cause. The cluster sysadmins, lacking imagination for the
destructive naivete of their users, had not foreseen that anyone would want to
submit more than 10^4 jobs at once. That broke the scheduler, preventing other
people from running jobs and me from canceling them. Meanwhile the blast jobs
blew past the disk quota, leading to a Hellerian impasse where I somehow
lacked the space to delete files so I could create space. I still don't fully
understand it.

I believe it took a full day to get the cluster back online.

------
admiraltbags
Lurker turned member to post this.

Second web related job at an insurance company, I was 20 years old at the
time. We were heavy into online advertising, mostly banners at the time (this
was right around when adwords started to get big). The company just bought out
all of the MSN finance section of their site for the day-- it was a pretty big
campaign ($100,000). We drove all the traffic to a landing page I had created
with a short form to "Get a quote".

IT had given me permissions to push things live for quick fixes and such, I
made a last minute design tweak and, you guessed it, broke something. I was
checking click traffic and inbound leads and realized traffic was through the
roof but leads were non-existent. This was about 45 minutes after the campaign
was turned on. I jumped on the page and tested it out and got an error on
submit. FUCK. I literally started to perspiration INSTANTLY.

Jumped into my form and quickly found the bug, can't recall what it was but
something small and stupid, then pushed it live without telling a soul.
Tested, worked, re-tested, worked. Ran some quick numbers to get a ballpark
estimate on the damage I caused... several thousand.

Stood up and walked over to the two IT guys, mentioned I borked things and
that I had fixed it... what should I do? I can still see the look on their
faces. Shock, then smiles. Walked back to my desk and about 10 minutes later
my two bosses show up (I worked for both dev & marketing managers).

They said thanks for catching the problem, not to worry. I did good for
finding it myself, fixing it, and pushing it live. I was still sweating and
shaking. They walk off and later that day marketing manager informs me MSN
will refund us for the 45 minutes of clicks.

It took about a month before I felt competent enough to touch our forms again.

------
nostromo
I was once in charge of running an A/B test at my work. Part of the test
involved driving people to a new site using AdWords.

After the test was complete, I forgot to turn off the Adwords. (Such a silly
mistake...) Nobody notices until our bill arrives from Google, and it's
substantially higher than normal. When my coworker came to ask me about it,
"are these your campaigns?!?" I just sank in my chair.

I think it cost the company $30k. I suppose it's not that much money in the
grand scheme of things, but I felt very bad.

------
byoung2
When I worked at ClearChannel back in 2010, we rebuilt Rush Limbaugh's site.
When migrating over the billing system, I realized a flaw that granted at
least 20,000 people free access to the audio archive ($7.95/month). The
billing provider processed the subscriptions, but their system would only sync
with our authentication database once a week with a diff of accounts added or
removed in the past 7 days. You got the first 7 days free for this reason. If
this process failed (e.g. due to a connectivity issue, timeout, or SQL error),
all accounts after the error would not be updated. Anyone with a free trial or
people who cancelled during a week with an error would get a permanent free
trial. I rewrote the code to handle errors and retry on failure so that errors
wouldn't happen in the future, but my downfall was running a script that
updated all accounts to the correct status. Imagine angry Rush Limbaugh fans
used to getting something for free now getting cut off (even though it
shouldn't have been free). Management quickly made the decision to give them
free access anyway, so I rolled back the change.

------
killertypo
During a server migration for our web based file sharing system our lead
engineer (at the time) forgot to ensure that all cron jobs (for cleaning up
files and sending out automated emails) had been turned back on.

Queue me 7mos later reviewing the system. Realizing that critical jobs were no
longer running and that our users were all essentially receiving 100% free
hosting for however much storage they wanted. SOOOO i turned the jobs back on.

The lead engineer before me left no documentation of what the jobs did other
than that they should be run. In my stupor i did not review the code. The jobs
sent out a blast of emails warning that files would be deleted if not cleaned
up or maintained. Then seconds later deleted said files...

We nuked around 70GB worth of files before we realized what happened. WELL GET
THE TAPES! Turns out our lead engineer ALSO forgot to follow up w/ system
engineers and the backups were pointed at the wrong storage.

No jobs lost, thankfully the manager at the time was a word smith of the
highest degree and can play political baseball like a GOD.

------
tptacek
I once accidentally ruined the Internet.

[https://www.google.com/search?q=ptacek+kaminsky+leak](https://www.google.com/search?q=ptacek+kaminsky+leak)

------
tilt_error

      # cd /etc
      # emacs inetd.conf
      # ls
      ...
      ... inetd.conf
      ... inetd.conf~
      ...
      # rm * ~
      # ls
      # ls

~~~
yankoff
Man, for some reason that double 'ls' made me laugh for 5 minutes. Just tried
to imagine the surprised look on your face.

~~~
jonalmeida
irl, that would have been:

    
    
      $ ls
      $ sleep 5
      $ ls

------
zimpenfish
Many years ago, when I was but a fresh faced idiot, the partition that
contained the mSQL database which had All The Data filled up. I moved it into
/tmp because there was plenty of space.

On a Solaris box.

Hilarity ensued when we next rebooted it.

~~~
ambiate
For those who don't know, solaris uses tmpfs for /tmp. It is a virtual
memory/swap based file system. Anything in /tmp is actually temporary if the
machine reboots/powers off.

~~~
scott_karana
I like setting this up on Linux machines too. There are tons of ephemeral
files that get written there, depending on your usage case, and I'd rather not
waste the IO for writing pids to lockfiles. Disk is cheap, but RAM is fast and
cheap. :)

------
alexmarcy
My worst would have been catastrophic if I had waited one minute to make my
mistake.

I was commissioning a new control system at a power plant's water treatment
facility. I was fairly new to the industry and had mostly looked over the guy
who did the bulk of the work's shoulder as on the job training.

This particular day the guy was out sick and we had to finalize a couple of
things before we ran through the final tests.

There was an instruction to open a valve to fill a tank and it had the wrong
variable linked to it. The problem was to maintain the naming standards I had
to do a download to the processor to make the change. When I had been doing
work in the office this was not a big deal, download the program to the
processor, it stops running for a moment while it loads the new logic into
memory and starts back up.

Not thinking through the implications of the processor shutting down while the
process was up and running I made the code changes, hit download and about 30
seconds later an operator came running over looking like he had seen a ghost
and he was pissed.

While I was making my code changes the operator was hooking up a hose to drain
a rail car of some chemicals. The way the valves were configured before I made
my changes was correct and would have had no consequence it I didn't touch
anything. The way the valves were configured when the processor restarted
would have routed the rail car's contents to the wrong tank resulting in a
reaction which would have created a huge plume of highly toxic gas. The way
the wind was blowing this plume would have blown directly to the largest town
in the area and could have killed a ton of people.

The operator heard the valves in question changing position before he opened
the valve on his hose to empty the rail car and figured something was up. When
he saw the whole process had shut down he got really angry because I had
ignored the protocol in place to avoid such a disaster.

I got chewed out and kicked off the site. My boss attributed my mistake to
inexperience and I had to give a safety presentation on what I did wrong.

Lessons learned: Be sure you are aware of any implications your actions have.
If you are unsure or guessing about something stop what you are doing and go
ask someone first.

Don't give people mission critical work on their first project and have them
work unsupervised. Training is important.

Always be aware of safety requirements, especially when you are working with
machinery, automated processes, chemicals or anything else that can hurt, maim
or kill you.

~~~
micro-ram
Rule #1 - Don't make things worse by guessing.

------
edw519
Boss: We have thousands of bad orders that must be fixed now!

Me: No we don't. We have 121 bad orders.

Boss: There are thousands of them!

Me: No there aren't. There are exactly 121 of them. I'm sure.

Boss: I'm not going to argue with you!

Me: Good. Because you'd lose.

I fixed 121 orders that night. The next day my login & password wouldn't work.

~~~
msh
I might be dense, but I dont get it.

~~~
noir_lord
He got fired for the cardinal sin, proving the idiot boss an idiot.

In theory you are supposed to be able to tell your boss he is wrong (and a
good boss will appreciate this) in practice it depends on the boss.

------
rfreytag
About 30 years ago I deleted the JOBCONTROL process on an old VAX 11/780
thinking it might be the reason why someone's process was stuck.

It wasn't a but an hour before I lost sysadmin privileges.

Never "experiment" with a production system - ever.

------
frogpelt
Non-tech-related:

I was doing HVAC work while I was in college and we were removing an old air
handler from underneath a house. Just inside the crawl space, under the access
door was a water pipe. My boss told me to make sure I held it down while we
slid the air handler out through the hole. I lost my grip on the pipe and the
air handle snapped it in two, at which point gallons of water began to gush
into the crawl space.

I ran for all I was worth to the road, which in this case was about 600 feet
away, to turn off the water at the water meter. I ran up and down the road in
front of the house and never found the water meter. So I ran back to the house
and inside and told the homeowner who promptly informed me that they used well
water. She called her husband and he told us where to turn off the well pump.

It wasn't really that bad in the grand scheme of things but letting the
homeowner's water gush under the house for about 15 minutes does not bode well
when you are supposed to be there to fix problems not create them.

------
Ecio78
I can't decide between these two:

1) after few months working in a bank, I was doing some simple admin check
task via RDP to a Windows 2003 (no, maybe 2000) server, when I right-clicked
the network icon and instead of clicking the properties options i clicked
"disable". Just the time to say "oh sh!t" and to realise that it was the
production Trading On Line machine, on a remote datacenter, during market
hours, and to discover couple of minutes later that the KVM over IP was crappy
and was not working. We had to call the datacenter operators to go back to the
local KVM and re-enable the NIC.

Lesson 1: Better move slowly when you're on a production machine (and also
have plan B and C to reach your machines is a good idea)

2) same bank, one or two years later, I was doing some testing on a new mail
system that integrated also VoIP (SIP). Mail/SIP System running in a VM (I
think Vmware Server at that time) in the same remote datacenter as above. So,
I enable the SIP feature and after few seconds, bum, we lose the whole
(production) datacenter and the connection between the local server room and
the datacenter. Panic, I look at my colleague, WTF in stereo, everything come
back for few sec, bum again down. Long story short, the issue was that that
version of Netscreen firewall ScreenOS had a buggy ALG implementation for SIP
that lead to core dumps. The fun thing is that we had two of those in HA, same
version of course, so they were bouncing between core dumping, rebooting slave
becoming master and then core dumping again etc.. We had to ask a datacenter
operator to reach the rack, disconnect one of the cables from the firewall
(the one that was managing the traffic of the DMZ where that machine was
hosted) and then reach the virtual host to kill the machine.

Lesson 2: you can segment your network but if everything is connected through
the same device(s), sh!t can still hit the fan...

------
reppic
One time I tried to change a column name in a production database. I learned
that when you change a column name, mysql doesn't just change a string
somewhere, it creates a new table and copies all the values from the old table
into the new one and when that table has millions of rows in it, it really
slows down your production server.

~~~
xutopia
I keep thinking that's the most ridiculous thing ever.

------
contingencies
First job, circa 2000, at an ISP that was run very clearly as a business and
cutting corners. Not only was it critically understaffed, but management was
more interested in laughing their way to the bank than management. They had me
- with literally no routing protocol experience - manage a live route
advertisement transition between two peering providers. Result: all customers
offline, ~24 hours.

Reaction was standard: mostly to point out I did my best in unfamiliar
territory and things should be sorted soon.

Take aways were: (1) less support calls than expected - users put up with
things. (2) you learn when you fail (3) always have a backup

They kept me on at that job but I left pretty soon anyway as I got a 'real'
(as in creative) job hacking perl-powered VPN modules for those Cobalt
Raq/Qube devices, and building a Linux-related online retail venture for the
same employer ... that worked great, but failed commercially.

~~~
penguinlinux
I worked at an ISP in NY exactly around 1997-2004 we had also the Rac/Qube
devices and I had to manage stuff I was not familiair with :) I learned so
much by trial by fire.

------
JasonFruit
I sent an email to three thousand insurance agents informing them of the
cancellation of policy number 123456789, made out to Someone Funky. I learned
to appreciate Microsoft Outlook's message-recall function, which got most of
them. I also learned that just because you're using the test database instance
doesn't mean nothing can go wrong.

------
tsaoutourpants
Back in my younger days, I once had a project manager who was asking me to
make a significant network infrastructure change but refused to tell me why
the change was necessary and basically told me to do as I was told. I messaged
a coworker to see if he knew what was going on, and dropped in that the PM was
being a "fucking cunt." I was unaware, however, that the co-worker and the PM
were troubleshooting an issue together and the PM was staring at his screen as
my message came through.

The PM brought the issue to the CTO, but somehow I didn't get fired. Ended up
apologizing (obviously a poor choice of words :)) and moved on. Never made
that infrastructure change.

Key takeaway: if you're going to talk shit, don't do so in writing. ;)

~~~
tsaoutourpants
Also, I had a friend with a similar (perhaps worse) story. His company sent
every employee an e-mail about being on time, to which he pushed Reply to All
and typed "FUCK YOU!" He laughed to himself and went to push the Discard
button, but accidentally hit Send in the process. He was about to try to
ExMerge it out of Exchange when he heard a BlackBerry vibrate and realized
that no amount of ExMerge would get it off the BBs. He spent the next hour
going door-to-door apologizing, and also managed to not get fired.

~~~
lostlogin
Damn iMessage is a shocker for this. Coworker sends out group message saying
big boss wants xyz and includes big boss in the group. Someone always misses
that its a group message and sends an expletive or sarcastic reply. I am also
guilty.

------
cgh
I was in a remote meeting and failed to realise my laptop's camera was
broadcasting. A roomful of people saw me, clad in horrid workout clothes, jam
my finger up my itchy nose and scratch my balls.

Key takeaway: always check the cam.

~~~
xauronx
And make sure your phone is muted. The first conference call is easy. When you
have one every day for a year and it becomes so common place... sometimes you
forget. I've heard some people on my team coughing obnoxiously, yell at people
driving, doing the dishes, etc. Mute your shit, and tell your team mates
immediately when they aren't muted.

------
hcarvalhoalves
Happened to a colleague: it was the end of the day, and we were packing up to
leave. He used Ubuntu on his notebook, so he typed "shutdown -h now" on his
shell prior to closing the lid. Seconds later he's groaning, having noticed it
was a SSH session to the production server...

It wouldn't be a big deal, wasn't for the fact it was an EC2 instance, and
back then halting the instance was equivalent to deleting it permanently. We
then spent the night at the office recovering and testing the server. I think
we left 3:00 AM that day.

Lesson #1: it's never a good idea to "shutdown -h now" on a shell. any shell.

Lesson #2: have the process to spin up a new production server fully automated
_and_ tested

~~~
brey
molly-guard is an good way to prevent this - forces you to type in the
hostname of the machine you are trying to halt / restart etc.

[http://manpages.ubuntu.com/manpages/trusty/man8/molly-
guard....](http://manpages.ubuntu.com/manpages/trusty/man8/molly-guard.8.html)

~~~
scott_karana
Excellent if you have complete and thorough control of every server you touch,
but if you don't, it could be dangerous to rely on. Murphy's Law means it'll
be that one dang machine that gets shut down...

Personally, I'd think that training this is a lot more portable.

    
    
      # hostname
      <foo>
      # shutdown -h now

------
Tloewald
In terms of feeling bad, I once had a client who wanted to demo a multimedia
project that we currently had in alpha on his Windows 3.11 laptop, but the
sound drivers weren't working properly (everything else was fine). He had
about an hour before he had to leave for the airport. I started monkeying with
the four horsemen of the apocalypse (Windows.ini, System.ini, Autoexec.bat,
and Config.sys) as I had many times before but I screwed up saving backups,
bricked his machine, and couldn't fix it). In the end it was more embarrassing
than anything else, but it was a facepalm stupid mistake.

The lesson from this is pretty obvious. Backup. Make sure your backup is good
and safe.

My worst work-related mistake was getting into business with a friend. It cost
me the friendship, a very valuable client, and a good portion of my retirement
savings. I'm not sure how related it was, but a few years later my (former)
friend killed himself.

And the lesson here is not to go into business with friends. Or at least to
set up the business as if you're not friends.

------
BjoernKW
Around 2000 my team was responsible for installing and maintaining a larger
amount of servers in 19" racks in a data centre.

Most servers had those hot swap drive bays for convenient access from the
front while the server was running. You only had to make sure no write
operation occurred while you pulled the drive out of the bay.

So, I had to exchange a backup disk on a database server running quite a few
rather large forums. The server had two disk bays: One for the live hard disk
and one for the backup disk. I was absolutely sure at that time which one was
the backup disk so I didn't bother to shut down the database server and incur
a minimal downtime. Of course, I was wrong and blithely yanked the live disk
from the drive bay.

I spent the rest of the night and most of the following day running various
MySQL database table repair magic. It worked out surprisingly well but having
to admit this error to our forum users was embarrassing, nonetheless.

Lesson: Appropriately label your servers and devices.

------
preinheimer
I ended up as the architect for a new live show we were putting on. You could
either pre-purchase some number of minutes, or pay per minute, it was like
$4.99/minute or something insane.

The billing specs kept changing, as did the specs for the show itself. New
price points, more plans, change the show interface, add another option here,
etc. The plan had been to do a free preview show the day before to work out
the kinks. That didn't happen.

The time leading up to show start was pretty tense, lots of updates, even a
few last minute changes! Then the show actually started, brief relief. The
chat system built in started deleting messages, one of those last minute
feature changes had screwed up automatic old-message deletion. We had a fix
though, update the JS, and bounce everyone out of the show and back in so the
JS updates. Fixed!

Then the CEO pointed out that the quality just kept getting worse. Turns out
that while the video player had both a numeric value and a string description
for the different quality levels, it assumed they were in ascending order. So
once it confirmed it could stream well at a given level, it automatically
tried the next, which worked! Poor quality for everyone. Fixed, and another
bounce.

Then it was over, time to go home. Back in the next day to finish off the
billing code. I decided to approach it like a time card system. Traverse the
logs in order, recording punch in time, when someone punches out, look up
their punch-in times and set that user's time spent to the difference. Remove
punch-in and out from the current record so they're not used again.

Now two facts from above added up to a pretty serious bug. 1) I _set_ the time
spent to the difference between the two times. Not added, set. 2) We bounced
everyone from the show twice to update their JS, and video player. So everyone
had multiple join/parts.

I under-billed customers by tens of thousands of dollars.

Things I learned:

\- Don't just argue that you need a trial run, make sure management
understands the benefits. Why, not What.

\- Duplicate billing code. After that a co-worker and I wrote two separate
billing parsers for things, 1 designed to be different, not efficient.

\- Give yourself ways to fix problems after they crop up. The bounce killed my
billing code, but not doing it would have damaged the actual product (which
later became a regular feature). Wish that thing had been my idea.

~~~
roryokane
Your “duplicate billing code” strategy is called N-version programming
([https://en.wikipedia.org/wiki/N-version_programming](https://en.wikipedia.org/wiki/N-version_programming)).

------
riquito
Last day of work before moving to the new job: I do some cleanup and rm -fr my
home directory. Seconds passed. Minutes passed. I start to think about how can
it take so long.

I list the content of my home directory trying to understand which folder was
so big. Then I see it. A folder usually empty. Empty because I use it as
generic mount point. A mount point that the day before was attached via sshfs
to the production server...

I had a strange feeling, like if I was seeing myself from behind, something
crumbling inside me. And at that moment someone start to ask "what's happened
to <hostname>"?

I take my courage and I say "I know it"...

That was really hard. The worst day at work in years, and during the last day
too. Luckily we had a good enough backup strategy and the damage was mostly
solved in a couple hours.

There I realized how much of an idiot I was to have mounted the production
server on my home and I grow a little.

------
trustfundbaby
rm -rf _._

yup. that really happened. it was 4-5am in the morning and I'd been working
all night. I was on the server trying to set something up and was trying to
blow away a folder ... I did a normal rm and that didn't work (obviously)
because there was crap in the folder. So I pulled out my nuclear weapon to
nuke the folder but left off the preceding ./ (which still wasn't that smart
anyway) ... I sat there for a second wondering why the deletion was taking so
long ... then another 30 then a minute ... then I looked at what I'd just
typed again ... then I realized what had happened.

ctrl-c'ed (or d, can't remember now) out of it. then tried to find root
folders

cd /etc => folder not found

cd /var => folder not found

I'm from a third world country where we laugh at Americans (sorry) for
throwing up when they're nervous or having panic attacks, but at that moment,
I had a full blown panic attack. I'll never forget it.

The work was a subcontract for a client who was doing work for Nike, and it
was a decently sized project that was critical to the success of the firm, and
I'd just blown away their live production server ...

Afer freaking out and almost crying for 5 minutes. I decided to call media
temple support (we were using one of their vps servers) ... and by the biggest
absolute stroke of luck they'd just backed up the entire server ... not even 2
hours prior to my madness. $100 for a full restore (I don't recall why) and
would I like to do that?

HECK YES I WOULD!

so they restored the server for me. I wrote an email to the head of the small
company I was doing all the work for, explaining what I had happened and
telling him I'd sent over a check for $100 to cover the backup because it was
my fault. He was obviously very relieved and never cashed the check I sent.

I still get chills thinking about that exact moment when I thought I'd fucked
up my career and reputation for good.

------
embarrassed99
Leading a group working in an underground bunker on a live military radar site
in the Australian outback, where it rains every few years. We had to open a
rooftop cable duct and when the job ran overtime we closed it up with some
rags that were to hand. That night it rained.

The next morning, the bunker was full to ground level and the automatic power
cutoff had failed, as the float switch was directly under the cable duct and
the water pressure of the deluge and kept the float depressed. By the time the
water stopped flowing the float was under a foot of mud. The powered circuits
were undergoing electrolysis and eating themselves away, made worse the the
site managers refusing to drain the bunker or turn off the power until a week
long arse-covering evaluation had been completed.

A few hundred million dollars of front line radar was out of action for
several months.

Being a naive newly graduated engineer, I wrote a completely honest report and
analysis. My boss said it was one of the best reports he had read and there
was no impact on my career (if anything it got me noticed by the upper
echelons of the organisation).

Lessons:

1\. If you tell the truth you will be respected, even if it is incriminating.

2\. If there is a way for something to go wrong it can do so (slight variation
of Murphy's Law). Even if it's judged to be uneconomic to take preventative
action, be aware of the possibilities, so you can make a conscious decision
about the risk.

------
jamesbrownuhh
Demonstrated SQL injection to a colleague on the live website. Bringing a
sample URL up into the address bar, I explain, "You see, that ASP script takes
the value of ?urlparameter and updates the record - but what if I modify
urlparameter so that instead of 1, it is... (types) semicolon dash dash DROP
TABLE usermaster (presses enter)"

"Shit. Well, as I have just demonstrated, it becomes possible to wipe out a
million user login credentials at the touch of a button. So now we'll be
needing to restore that from the backups which we don't have." Luckily, and
ONLY BY CHANCE, I happened to have a copy of that table exported for other
reasons from a few days back.

Lessons learned: Never press enter.

~~~
Phlarp
The problem was the service being susceptible to injection in the first place.

This wasn't a mistake; just a hilariously successful penetration test!

------
donretag
A long time ago while working on a *nix box logged in as root, I executed a
simple "!find". Basically execute the last find. In root's history, the last
find command was something like "find ... -exec rm ...". The command was run
at the root of the content directory of a CMS, deleting all the content (major
media website). CMS was down while backups were restored.

I now never execute ! commands as root. Actually, nowadays I simply use
CTRL-r.

------
onyxraven
My first deploy at a once-top-10 photo hosting site as a developer was a
change to how the DNS silo resolution worked.

Users were mapped into specific silos to separate out each level of the stack
from CDN to storage to db. There was a bit of code executed at the beginning
of each request that figured out if a request was on the proper subdomain for
the resource being requested.

This was a feature that was always tricky to test, and when I joined the
codebase didn't have any real automated tests at all. We were on a deploy
schedule of every morning, first thing (or earlier, sometimes as early as 4am
local time).

By the time the code made it out to all the servers, the ops team was calling
frantically saying the power load on the strips and at the distribution point
was near critical.

What happened: the code caused every user (well upwards of millions daily) to
enter an infinite redirect, very quickly DoSing our servers. It took a second
to realize where the problem was, but I quickly committed the fix and the
issue was resolved.

Why it happened: a pretty simple string comparison was being done improperly,
the fix was at most 1 line (I can't remember the exact fix). There was no
automation, and testing it was difficult enough that we just didn't test it.

What I learned: If its complicated enough to not want to test using a browser,
at least always build automation to test your assumptions. Or have some damn
tests period. We built a procedure for testing those silos with a real browser
as well.

I got a good bit of teasing for nearly burning down the datacenter on my very
first code deploy, but ever since, its been assumed that if its your first
deploy, you're going to break something. Its a rite of passage.

------
rmc
When trying to put our webserver-cum-database-server onto nagios, I tried to
apt-get install nagios-plugins. For some reason when installing that, apt
wanted to remove mysql-server. I just pressed "Y" without thinking (because,
hey, it's like 99.9999999% the right thing to do). So apt dutifuly stopped and
uninstalled MySQL in the middle of the day.

Within about 2 minutes CTO strolls in asking about the flood of exception
emails due to each request being unable to connect to the database.

Thankfully, I was able to apt-get install mysql-server, all the data was still
there, and things were back to normal within 5 minutes.

------
ufmace
When I first started my professional career, I was a field engineer in the
oilfield, working on drilling rigs around Texas. There was some amount of
computer stuff, but a lot of hardware work too. One of the things that we had
to do was install a pressure sensor on the drilling mud line, which is
normally pressurized to around 2k psi with water or oil-based drilling fluid.

This sounds like a simple task, but it gets complicated by the variety of pipe
fittings and adapters available. Our sensors are a particular thread type, and
we have to find a free slot to install them, and come up with any pipe
connection converters necessary to install them there. Another tricky part is
that the rig workers who actually know about all of this stuff are often not
particularly eager to help out.

So on one particular job, the only free slot to install the sensor is a male
pipe fitting, capped with some sort of female plug. Our sensors are male in
that pipe size, so I need a female-female adapter to install it. I go looking
around and come up with one, not paying too much attention to it. I install
it, and everything seems to go more or less smoothly. We go on drilling with
this installed for like a week or two.

One day, the rig manager comes to find me and ask me about this adapter that I
used. He tells me that it is meant for drinking water lines, and is only rated
to 200 psi. And had been installed on a 2000 psi line for weeks. My jaw
dropped in shock - I have no idea how that adapter didn't fail, and it's
entirely possible it could have hurt or killed somebody if it did.

They sent one of their guys to find an adapter that was actually rated for the
pressure and replace it, and never said much else of it. No telling how much
trouble I could have been in there if anything else had happened. It did make
me a lot more safety-conscious.

------
quackerhacker
I messed up epically on an interview. It was a 3 part interview for a JS/RoR
coder.

1\. I passed the resume and chat portion

2\. I passed the telephone questionnaire and got along great with the
interviewer

3\. (Fail) I scheduled my interview on a Friday at 4:30pm and there is a 30
min travel time. I left 1hr early...still it was Memorial Day weekend, so I
thought the streets would be quicker than the freeway since it was at a stand
still. I was so stressed that I literally had an anxiety attack and couldn't
even find the address. Never happened to me before, so I'll never forget it.

~~~
Zikes
I think most interviewers would be understanding if you called to reschedule
in that situation.

------
Debugreality
This one is really embarrassing. I started a new job for a small company as
the only developer with the aim of creating a new site for them. So they gave
me full access to their very small technology stack that included one Mssql
server.

So one of the first things I wanted to do was setup a development db for which
I exported the structure from their prod db. I then proceeded to change the
name of the create database statement at the top to the new dev db I wanted
and ran the script.

Unfortunately the prod db name was still pretended to every drop and create
table command in the script so I had just replaced their whole prod db with an
empty one.

Owning up to that was one of the most embarrassing moments of my career. It
was such a rookie mistake I just wanted to die. Luckily they had daily backups
so I only cost their 4 man business about half a day of work but... it was
enough for me to be a much more careful developer from that day forward!

------
earino
easy:

me: "unix definitely won't just let me cat /dev/urandom > /dev/sda"

other: "sure it will"

me: <presses enter>

what I learned? unix will absolutely let you hang yourself. 1998, production
server for a fortune 5 company.

~~~
krisdol
Why in heaven's name did you try that in production?

~~~
earino
it was 1998. i was young and foolish. no, seriously, i was 19. i was also
_SUPER_ convinced that it wouldn't work :) I have since learned to be waaaay
less convinced since then.

------
grecy
I added some products to a system on a Thursday, not remembering we added some
new columns to the product definitions, and the columns were nullable.

I was off Friday, so I come in Monday morning to see that ~20k customers have
been getting free stuff since Thursday lunchtime.

Lost something like $200k because of two nullable columns :(

~~~
munificent
Dramatically increased user satisfaction for relatively small marketing budget
of $200k!

~~~
grecy
Actually, the customers get pissed off even when they get something for free,
because it's just another sign of how incompetent we are :(

------
joncooper
There's a saying in the rates market: "don't counter-trend trade the front
end".

I lost $7 million dollars in minutes by being short $700 million of US 2yr
notes when the levees failed during the hurricane Katrina disaster.

Although my bet that the 2y point would be under pressure in the intermediate
term turned out to be true, I got carried out by fund flows as folks spazzed
out to cut risk by rolling into short duration high quality paper.

To his credit, my boss, who sat across from me, said only: "wouldn't want to
be short 2 years." He let me make the call, which I did, and I covered my
position. (Ouch.)

My book was up considerably on the year already, but this was a huge hit, and
nearing year-end. I dialed back the risk of my portfolio and traded mostly
convex instruments (options) for the remainder of the year.

------
joshbaptiste
In 2001 my first IT tech job as help desk analyst I heard beeping in the
server room on one of the Solaris/Oracle machines and pressed the power
off/power on button on the chassis. DBA came running in and I promptly left
saying "oh I think it rebooted itself". The company went bankrupt shortly
after so no huge lashing came my way but all my more experienced friends where
like "wtf never do that again!"

~~~
ambiate
Was probably a program running beep codes, haha. I have also had servers
executing beep codes that brought great anxiety to me for endless hours. Turns
out it was just some debug alert calling the motherboard speaker beep.

------
SDGT
Ticked a debug output flag on prod for a specific IP (Proprietary CMS,
couldn't replicate the issue on test even with a full codebase and db sync),
brought down the entire server for an hour.

edit: This was after I asked for permission to do this.

Lesson learned: Don't EVER use Coldfusion as a web server.

------
PakG1
My first real summer job was working for a computer store that also did tech
support contracts with local businesses. I'll preface that the boss should
never have given me the responsibilities he gave me, or should have gotten me
to job shadow more experienced people, but the shop was tiny and I was
actually the only full-time employee.

We had the tech support contract for the city's Mexican consulate. One of the
things we were doing was patching and updating their server and installing a
tape drive backup system. Server was NT4.

I'm in there doing work after 5pm, and wrongly assume that everyone's gone
home for the day. Install some patches and the server asks me if I want to
reboot. I say yes. Few moments later, a guy sticks his head into the server
room and asks if I'd shut down or rebooted the server. Oh, whoops, someone's
here. Yeah, I just installed some patches. Oh, OK, see ya.

Next day? Turns out he had been doing some work in their database where they
track and manage visa applications. That database got corrupted when I did the
server reboot while he was doing his work. That night, the backup process then
overwrote the previous good copy database on the tape drive with the newly
corrupted database. We had not yet started rolling over multiple tapes to
prevent backups of corrupt data, though we were going to purchase some tapes
for that purpose shortly.

Summer was ending, and I quit a week later to return to school. Horrible
timing in terms of quitting! No idea what happened after that, as I was
spending the summer in a city that was not my own. I do know that the original
database developer contractor was on vacation at the time and so they couldn't
reach him. I think the consulate was SOL. I regret rebooting that server
without checking if anyone was working to this day.

Lesson learned? Don't assume anything when doing anything. Carried that lesson
with me for the rest of my life. And find a boss who knows how to guide you if
you don't have much experience in your area. I guess for founding startups, at
least get an advisor.

edit: spelling

------
alok-g
The following was not actually me, but worth sharing.

They had ASIC design runs for research purposes once every three months,
yielding your design on Silicon as ten 6" wafers. It gives enough parts for
testing the first revision of your design. The person was carrying the wafers
to a vendor for cutting into separate ICs and packaging or something. Gets to
the parking lot, and where are the keys. Puts the wafers on the top of the
car, finds the keys in his pockets and starts driving. Boom, the box of wafers
was still on the top of the car, now on the ground. All broken. Some $100K in
wafers + three months lost + bad face before the customer + ... Lesson: Don't
put stuff on the top of the car!

~~~
10feet
I don't think that is the lesson, I think the lesson is that it is clearly a
two person job. One person to carry the wafers, the other person to remove and
hazards/ open doors/ double check everything.

~~~
alok-g
It's easy to say that after the fact. There are many delicate tasks people
handle as a part of their day-to-day jobs, and not every task can afford more
people to help without increasing the costs.

------
kisamoto
Introducing a master/minion update system to work I ran a batch update to take
a certain percentage out of the cluster.

Unfortunately I got my selection criteria wrong and pulled out all of one
cluster and half of a second, halting a few thousand operations.

Luckily the monitoring system was very quick to alert me of this and using the
same (wrong) selection criteria it was a fairly simple process to stop the
update and put them all back in the cluster.

Takeaways? The age old cliche of "With great power comes great
responsibility". Oh and have good monitoring!

------
drdeadringer
I dropped two units of equipment, ~$1.5Mil a piece. Each unit was dropped in
separate incidents. No damage at all, but management didn't care. I blamed
myself despite mitigating factors such as impossible schedules, vicious multi-
tasking "to compensate", and less-than-ideal support equipment. At the time, I
didn't handle it very well but I ended up living through it -- first
job//assignment ever in the worst environment I've ever had before or since
with the worst coworker I've ever had before or since, and I mess up in the
millions of dollars. "Lasting Impressions", tonight at 8/7 Central.

I left that job about 3 years later when the metaphorical train stopped at a
nicer place. My name is still known in certain circles for this ["Oh bah, how
could I forget?" one former manager recently stated], but I don't plan to go
back there at this time.

I learned that life's too short for assholes and working in an environment you
don't like. If you don't screw up, your soul will die and you'll become that
former coworker you hated so much and who hated you in return. It's worth
picking and choosing where you work.

~~~
rajacombinator
what kind of equipment costs 1.5Mil and can be carried/dropped by a single
person?

~~~
drdeadringer
The defense kind.

------
bengarvey
I poured gasoline into the tractor's radiator instead of the gas tank.

Thankfully, someone stopped me before I turned it on.

~~~
benburleson
Is this really that bad? IANAMechanic, but if the coolant system is sealed,
shouldn't the gas just work similar to antifreeze?

~~~
seanhandley
Not if it's diesel - it'll be heated up (so it'll partially vaporize) and
it'll be under pressure. There'll be air in there as well. Could quite happily
combust if it got warm enough!

------
maxaf
I spoiled business users by saying "yes" way too often.

------
andy_thorburn
My worst screw up was causing a fire that destroyed one of the two prototype
3D printers my company had built.

I was working at a startup that was trying to create an affordable 3D printer.
We had two working prototypes that were used for everything - demos, print
testing, software testing, PR shoots, everything. Each prototype had cost
hundreds of man hours to build and debug and quite a bit of cash as well.

Among other things I had done all the work on the thermal control system for
the printer, it kept the print heads and build chamber at the correct
temperature. One night while working on one of the printers I hit an edge case
that my control code didn't handle well and the printer turned all of the
heaters on full-bore. Half an hour all the plastics in the prototype had
either melted or burned and I was left with a room full of smoke and a pile of
scrap aluminum.

------
glazskunrukitis
Two screw ups come to mind.

1\. First day at a job. I need to get familiar with a legacy system and get a
SQL dump from it to create a local copy of the database. After some SSHing and
MySQLing, I confuse my two split terminal panes and end up importing my local
dump to production server. Of course database names and users were the same so
I end up dropping the database. No biggie. Backups were available from
previous day.

2\. Similar story to the first one. I got a new shiny Zend Studio IDE. Want to
set up sync with remote server (just a static company website with no version
control). Fill all the settings, press the sync button - and what happens?
Zend Studio somehow figured that I want to force sync my local folder, which
is empty, to the remote site, and it just deletes everything on the web root
and uploads my empty folder. Wat. Should have read the settings twice.

~~~
codygman
This could be a strong example for why "the unix way" is better. You don't
overwrite your server with anything unless you explicity rsync/scp it.

------
kirkthejerk
I mixed up the meanings of "debit" and "credit", and wrote a credit card
processing app that ended up PAYING $75K to our customers instead of charging
them.

I'm still not sure how this bug slipped past the bank's tough app
certification process, though.

------
aryastark
This wasn't me, but a coworker.

We were rearranging the layout of the office. Coworker was moving in to his
new space, setting up his desk. He boots up his computer, wonders why he has
no network. Looks around, discovers the ethernet cable isn't plugged in. Plugs
it in to the wall, _still_ has no network.

A few minutes pass, and the entire office is running around wondering why the
hell the network isn't working. Maybe an hour passes, the network guys are
losing their shit trying to hunt down what is wrong. I'll give you a hint: the
router was lit up like a Christmas tree, and the aforementioned coworker had
both ends of his ethernet cable plugged in--but neither end was attached to
his computer.

------
webstonne
I asked them for a job in the first place.

------
jmspring
Early on in the implementation of one of the PKCS "standards" while at a
browser company many years ago, due to an improper interpretation of a spec
that was still in flux. There wasn't enough testing and "release bits" went
live.

I had to quickly get a patch in for the improper code and had to maintain that
buggy implementation. In addition, the "standard" itself got a rather scathing
write up from Peter Gutmann, which is completely valid:

[https://www.cs.auckland.ac.nz/~pgut001/pubs/pfx.html](https://www.cs.auckland.ac.nz/~pgut001/pubs/pfx.html)

This is a critique on the "standard" itself, the process was just as ugly.

------
dougbarrett
I used to work at Fry's Electronics right before iPhones were released and MP3
players were seeing better days. Creative had come out with a nice, $300 MP3
player and I was in charge of creating the sign tags in my department because
I was the only one who could get it done the quickest. I would do hundreds a
day, and sometimes there would be slight slip ups, in this case I forgot a 0,
so there was a lucky customer that got a $300 MP3 player for $30 that day.

Luckily, there was no slap on the wrist or anything, the store manager knew
that after doing thousands of these cards this was only one of a few slip ups
I've made so they just brushed it off and moved on.

------
anilshanbhag
This is about [https://dictanote.co](https://dictanote.co) I changed the login
flow to use a different package. After pulling the latest changes in server, I
restarted apache, opened the website to notice everything working smoothly and
went to sleep.

9 hours later I wake up to check my inbox has 800+ emails. Django by default
sends out email when an error occurs and a tiny mistake of not installing a
package led to a lot of frustrated customers and well a huge pile of email in
my inbox !

Moral of the story: Put that pip freeze > requirements.txt and pip install -r
requirements.txt into your deployment flow.

------
it200219
I had installed "osCommerce - Open source E-Commerce platform" just like
Magento on one of our client who had > 500 transactions a day.

Some how in settings, we had flag "Store Credit Card Info" as "Plain Text"
enabled. The Admin/Staff of that client could have use this information to
make transactions (As in Backend it would show Full CC info into order
details)

We didnt realized untill we worked on it again for some bug fixes and adding
new features.

Lesson Learned :- When transitioning from DEV to PROD env, make sure to check
all these critical flags and correctly set

Luckily, the client didnt had any idea about what was wrong in backend.

------
schmichael
I unknowingly reset serial number counters in a bicycle part's database, so
now there are a few hundred people in the world with high end bike hubs that
overlap each other.

Lesson: Keep the code that touches production databases as simple as possible
so it's easy to verify exactly what it does. I was using a framework's
database tooling incorrectly because I never dreamed what I used would touch
the databases's counters.

(Not my worst mistake in terms of people affected, but it's the only mistake
that was _literally laser etched in metal forever._ )

~~~
tadfisher
Chris King?

~~~
schmichael
I plead the fifth.

~~~
tadfisher
We should grab a beer sometime :)

------
chrislomax
Simple one really and probably most common. Realise a data integrity issue on
DB, try to load from backups and notice that the backups have the same
integrity issue. Find a backup from about 2 weeks previous where data is
intact and piece together the good pages from the daily backups and the 2
weeks old one.

All in, took 4 days and a new server where the hard drive had stored bad pages
on the DB. We lost 2 days of orders (they were processed through to the
internal systems though so not really lost)

Lesson learned, validate backups and check page integrity when backing up

------
unfunco
I think everybody has done this at some point, and I'm sure I will not be the
last person to have done it; leaving the WHERE clause from DELETE and UPDATE
statements when writing SQL, I caused about 45 minutes downtime on our RDS
instance the last time I did it, but since we had multi-AZ setup, no data was
lost. I also frequently get mixed up between development and production
environments.

Every database alias I have now has the MySQL --i-am-a-dummy flag appended.
This has been a career-saver in my eyes.

------
taf2
I hacked our development machines using a rooted rpm, we only had access to
the sudo rpm command so I decided to deploy our rails app using capistrano. to
work around the sudo rpm only access I decided to add some install scripts to
the rpm because these run as root. This allowed me to re-configure sshd making
it possible to do a local capistrano deploy. I was smart about it by reverting
the ssh changes back after the deploy completed - bash has a kind of ensure
that allows you to roll things back like a transaction. The cool thing about
the whole thing was that our ops team was on the ball and detected the changes
to the sshd configuration even though I restored them. Mind you this was all
in a staging development environment. The issue was just how immature it was
of me to go this far to cap deploy instead of rpm install our rails app. For
me, I looked at it then like a good learning experience in hacking rpms and in
security. When you run sudo rpm -Uhv package.rpm - you better trust
package.rpm it can execute any shell scripts it wants as root. Also, in the
future I would walk away from a company like this much sooner. I enjoyed
everyone there I worked with and would work with any of them again, but just
would not want to work in such a stress filled environment for so long again.

------
danellis
I performed a two-minute manual query on a MySQL database I was told was a
backup. What I didn't realize was that it was a live backup, and that it would
stop the production database from responding to queries for those two minutes,
meaning that authentications failed for two minutes. Several colleagues called
for my employment to be terminated immediately, but luckily they got
distracted by other issues.

------
peterwwillis
"Let go" a few hours into the first day on the job.

A friend had referred me for a sysadmin job opening at a web hosting company
in Florida. After a brief interview I got the job for a pretty decent salary
and was told when I could start. What they hadn't told me was that my schedule
would be tuesday to saturday. I had informed the hiring manager of my
preferred schedule (monday-friday), but I guess nobody mentioned it to the
manager of the group.

When I got there they told me my schedule and I immediately told them that's
not what I signed up for. So they asked me to sit for a while so they could
figure out what to do next. I took a tour of the NOC, and saw one of their
tier 1 technicians was chatting and watching a movie. I walked up and asked
him "Heyya! Workin' hard, or hardly workin'?" and smiled. He did not smile
back. So I went back to the desk I was assigned to, which was already logged
in - with the credentials of the previous admin.

While I waited I decided to see what other trouble I could get into. Sure
enough, all the old passwords were saved in the old admin's browser with no
master password. I couldn't copy-paste the list, so I took a screenshot and
began to find a way to print the list out to post on my cube wall. Before I
could finish I was asked to leave for the day while they figured out my
schedule changes. I should have gotten the hint when they asked me to leave
the badge there.

Later I got a voicemail telling me they'd pay me for the time I spent there
(about three hours) and they'd no longer require my services. Luckily I got
hired soon after to a different company, which was also hiring away all the
talented people from the place that had let me go, and the web hosting company
eventually went under. So it turned out to be a good thing in the end.

~~~
sejje
Which part was the screw-up?

~~~
peterwwillis
Well, not making sure all the terms of the job were met, being flippant to
somebody I didn't know at a new job, and digging into someone else's
credentials were all pretty murky. But overall it was just getting fired on my
first day vs working out the issue over time.

------
bobdvb
I used to work at a major European telecoms company... Unlike everyone else I
am not talking programming mistakes, these are generally physical fumbles:

One day I was doing a change control, I was scheduled to change some settings
on the modulator of a satellite system providing internet access to a portion
of the Middle East. I called the satellite operator and told them I wanted to
do the scheduled work, as they would have to confirm I was outputting new
configuration within the constraints of the contract.

I entered the change and re-initiated the modulation, the operative said he
was seeing nothing. Now, because this signal was going to the middle east I
couldn't see it in Europe and without substantial plumbing I couldn't tap in
to the antenna. My heart started going, I was checking amplifiers, up-
converters, everything and I couldn't see anything wrong. After a few minutes
the client called panicking because the action should have taken moments, not
minutes. After more confirming with the satellite company that I wasn't
transmitting I checked back through my steps and eventually saw I had missed
one crucial thing: when you changed certain parameters it muted the output! A
quick few button presses and the patient man on the phone said "there it is!"
and I can relax again.

Lessons learnt: 1) I should have noticed a critical LED on the modulator was
not lit! 2) This is the reason change controls say "working period 10 minutes"
(time taken 9m50s). 3) A good boss will defend you if you recognise your
mistakes and don't f*ck up too badly. 4) Don't go for a quick drink with your
brother before a night shift.

In the broadcast industry they say you aren't a real engineer until you've
taken a TV network off air. Lets say I am very experienced, but my employers
have never had a problem with me. This is probably because I have also seen
people meltdown under the pressure of delivering live services to millions or
even billions of viewers, but if you can keep cool you can deal with it.

------
famousactress
Accidental sudo chown www-data:www-data /. on the production server.

Thoughtful pause "Why is this taking so long!?"

"OH FUCK"

------
rosser
An UPDATE statement without a WHERE clause.

In production.

I'm the DBA.

------
a3n
Connected leads on an expensive piece of equipment, power live, being _very
careful_ with a pair of needle nose pliers. Because the power switch was way
in the other room, tag out procedure took time and I was late.

Poof. Equipment electronics fried and useless.

I was chewed out. Could have been way worse.

Follow your safety procedures.

------
double051
We shipped an Android app that didn't like the way we had our HTTPS certs
configured, so I had logic in there to accept the connection if the cert
matched the one we had.

Two months later, the certs were expiring soon and we changed our
configuration to something Android liked by default. The bad news was that our
production Android app rejected the new configuration and only wanted to
accept the current certs.

We ended up quickly shipping a hotfix that accepted the current and upcoming
configuration a few days before the certs expired. There technically wasn't
any 'downtime' as long as users updated the app, but this all took place right
before 'holiday vacations', and the QA team had to test the fix while all the
devs were away.

------
jason_slack
I once revoked my bosses e-mail and VPN access because his password was
'password123'. It was my job to keep things safe after all and I had asked him
nicely a few times.

EDIT: I proposed a new password of: @$tevezA$$ignedPwD@# (Steve's Assigned
Password)

He said no to that one.

~~~
jlgaddis
Heh, I have a co-worker that recently deployed a production router with a
blank administrator password. Fortunately, I've got automated jobs in place
that find those kind of things. The password is now a variation of "John is a
dumbass" and I'm just waiting for him to ask for the password.

~~~
jason_slack
That's awesome!

------
TheCapn
I may, and/or may not have caused a production site's PLC to go into STOP mode
during daily operations while making network updates remotely.

Possible outcomes of unplanned system haults include plugged machinery that
would need to be manually cleared, mixed products which would become immediate
net losses for the company and damaged motors.

Thankfully no product was being run at the time. I have also implemented
changes across the board to our client sites that prevent this type of shit
from ever happening again. You know when you look at a system and go "this is
going to bite us in the ass eventually?" This was one of those systems, they
just needed a new hire to give them the push.

------
enthdegree
I was this close from setting the asset management server's hostname to
`ASSMAN'

~~~
jlgaddis
Heh, we have a Windows Server that all of our accounting software runs on.
It's named "beancounter".

------
krishnasrinivas
I had done "rm" of a _big_ log file to free up space on the customer's server
but our process kept on filling the log file. I assumed that the disk has got
enough free space and I got busy with something else. The space was never
freed up as the file descriptor was kept open by the process. Ultimately the
entire disk got filled up by the opened log file and their server came to a
grinding halt. I think the customer stopped using our product after that
because we never heard from them again.

learning from this experience: never do an "rm" on the log file, instead do
"truncate -s 0" on the log file.

------
peg_leg
This was some time ago. I've learned a lot since. I mkfs'd the main disk on
our email server. There was no redundancy. There was a new volume that needed
to be formatted, my superior told me to do it. I protested that I didn't quite
know how. He pushed it. So, I would up wiping it by mistake. since then I've
made it a mission to make the entire stack at that place resilient and
redundant. Now, it's virtualized, failover file and DB systems, NLB web
servers, redundant storage, proper backups. It would take a hell of a lot more
than what I did to make the same mistake again.

~~~
FireBeyond
You could ask your boss if you can test it via re-running that fateful command
...

------
erobbins
my first day ever using unix. Left with a root shell. Trying things out,
learning, made a few junk files somewhere or other. I was done with that and
decided to delete them... "delete everything from that directory" I think: rm
* /path/towhateveritwas

now on to my tasks.. had some files to print out. Where did they g...... FUCK.

I found a box of tapes and some sunos manuals. Spent the next several hours
figuring out how tar and tape drives worked. Got everything back. Never told a
soul.

1992\. I've never done anything so careless since.

------
Davertron
I wrote an update script for a database table not realizing I had the key
wrong (I'm kind of fuzzy on the details, but essentially I think it was a
composite key but I was only using one of the columns in my WHERE clause...)
and accidentally updated all customers addresses in our database to the
addresses of one account.

Luckily we had backups from that morning so we only lost any address updates
people would have done that day, but it made for some interesting customer
service calls for awhile...

------
gbasin
Added some additional logging for an edge case, rolled it out to production
and then went camping in the remote wilderness for a week. Two days in, the
edge case got hit and the logging wasn't sufficiently tested. It logged as
intended... and kept logging and logging... until out of disk space :(

Oh yea, I run a proprietary trading firm (still at the same spot), as a result
of that bug we went down and lost about $250k over the next few hours. Testing
is important in automated trading :)

------
teamcoltra_
I deleted the entire sales team's sales database (for Canada's second largest
cable company) because I was making a minor change and was too lazy to back it
up first.

------
ozten
Many years ago I was being shown the server room for the first time. They
asked me to unplug a certain box. I unplugged everything on one power strip.
Panicking at the drop of ambient noise in the room, I quickly plugged it back
in, but...

I have no idea why they didn't use UPS, but it took many critical servers
offline and caused a few hours of headaches for everyone.

Come to think of it, that was the last time I was allowed in the server room.

Lessons learned - don't let developers in the server room.

------
jonathanjaeger
I often turn on my performance-based ad campaigns before going into the
office, as they are very predictable at the beginning (slow ramp up to spend).
However this time the CTR was through the roof for something new and spent
$15,000 by the time I could turn them off when I got to the office. It only
brought in about $5,000 in revenue. Not the end of the world in the grand
scheme of the monthly P&L, but still not something to replicate.

------
wpietri
Long ago when I was, I think, a sophomore in college and worked for the
university IT group, I was trying to add an external drive to an early NeXT
machine [1]. I wanted to try out their fancy GUI development stuff, you see. I
was at best a modestly competent Unix admin, and this was circa NextStep 1.0,
so the OS was... rough. It was in the dark days of SCSI terminators, so just
telling if the drive was properly connected and, if so, how to address it was
challenging.

After a couple hours of swearing, instead of working from a root shell in my
own account, I just logged into the GUI as root. And there was a pretty
interface showing the disks. I could just click on one and format it. Hooray!

Well either the GUI was buggy or I clicked on the wrong disk, because as the
format was going, I realized the external drive wasn't doing anything. I was
formatting the internal boot hard drive. And since nobody but me gave a crap
about this weird free box somebody had given them, they had repurposed it. As
a file server. For the home directories of a bunch of my colleagues. Who were
now collecting around me wondering what was going on. Oops.

No problem, says I. I'll just restore from backups. But this thing used a
weird magneto-optical drive [2]. The only boot media we had was on an MO disk.
The backups were on another. And there was only one of these drives, probably
only one in the whole state. The drives were, of course, incredibly slow,
especially if you needed to swap disks. Which, I eventually discovered, I
would have to do about a million times to have a hope of recovery.

Long story short, I spent 28 hours in a row in that chair. It was my immersion
baptism [2] in the ways of being a sysadmin. The things I learned:

Fear the root shell. It should be treated with as much caution as a live
snake.

Have backups. People will do dumb things; be ready.

A backup plan where you have never tried restoring anything may lead to more
excitement than you want.

Be suspicious of GUI admin tools. Avoid _new_ GUI admin tools if at all
possible. Let somebody else be the one to discover the dangerous flaws.

If you were smart enough to break something, you're smart enough to fix it.
Don't give up.

When some young idiot fucks up, check to make sure that they are sufficiently
freaked out. If they are, no need to yell at them. Instead support them in
solving the problem.

Seriously, my colleagues were awesome about this. I went on to become an
actual paid sysadmin, and spent many years enjoying the work. The experience
taught me fear, and a level of care that sticks with me today. I'm sure at the
time I was wishing somebody would wave a magic wand and make it the problems
go away, but working through it gave me a level of comfort in apparent
disasters that has been helpful many times since.

[1]
[http://en.wikipedia.org/wiki/NeXTcube](http://en.wikipedia.org/wiki/NeXTcube)
[2] [http://en.wikipedia.org/wiki/Magneto-
optical_drive](http://en.wikipedia.org/wiki/Magneto-optical_drive) [3]
[http://en.wikipedia.org/wiki/Immersion_baptism](http://en.wikipedia.org/wiki/Immersion_baptism)

~~~
georgemcbay
root shell, plus rm with any sort of wildcard matching, plus a bit too much of
a delay before you get your shell prompt back results in a very specific kind
of panicked anxiety that almost anyone who has been programming or sysadmining
for a while can easily relate to.

~~~
jrabone
I'm sure zfs on Linux has a random sleep built in just to increase my anxiety
levels...

------
_mikelcelestial
I accidentally deleted all data from live database, thought it was our beta
database server. Good thing it is synchronized on our beta servers so was able
to bring it back in no time. The moment I clicked that delete button I was
like face palming myself all over. I learned from thereon to double check
every time especially when working on between production and test servers.

------
highace
Changed the default RDP port on a remote Windows box, but didn't open the port
on the firewall and couldn't get back in. Whoopsie.

------
adw
I've screwed up countless things, many much more expensive than this, but
those stories aren't entirely mine to tell.

But this was one of my first. Years ago, making boot floppies for a physics
lab where I was reinstalling all the servers:

I meant: dd if=/dev/zero of=/dev/fd0

I did: dd if=/dev/zero of=/def/hda

Oops. Bye, partition table.

(Always double-check everything you type as root.)

~~~
cjensen
I meant to dd a partition to tape... and ended up doing the reverse on our
fileserver. Tape is slow, so I caught it after a few kilobytes.

Somehow I managed to salvage the situation using the man pages for the
filesystem, the C compiler, and making sure I did not reboot. Really don't
remember much about that late night...

------
deanly
Not me, but a co-worker at an internship I held:

Said person entered the number of metric tons of concrete 3 magnitudes higher
than it should have been. Imagine the cost difference between 1.0 * 10^6 and
1.0 * 10^9 metric tons... Our boss was not pleased, to say the least.

But imagine how easy it is to enter a few extra zeros in an excel data cell.
Yikes!

------
CurtMonash
Short version:

I was a stock analyst, for a firm with dozens of institutional salesmen and
thousands of retail brokers. Some of my recommendations were very, very wrong.

The right thing to do is stand up, take the heat, and explain what you now
know as best you can. I learned that watching a colleague who I thought was
otherwise an unserious ass.

------
rokhayakebe
I built a content site, worked on it for two years and a few months. While
updating the entire codebase to make the site faster and easier to work on for
future updates, I accidentally deleted my database. 2 years gone, SEO traffic
gone.

Takeaway: Sometimes, it takes a disaster to realize you were in another
disaster anyways.

------
arethuza
I led an engineering team that almost sent out a demo on tens of thousands of
IBM CDs (this was 1998) that contained test data that included some that had
been sourced from the worst possible alt.* newgroup.

As it turned out the only data that did go out was the single word "sheep" in
the search index.

------
alexmarcy
The worst one I ever heard about was while I was at a potato processing plant
in Idaho where they make McDonald's hashbrowns.

After the potatoes are peeled and washed they are run through a pipe with
blades to slice the potatoes into french fries. These blades are sharpened
with lasers and are insanely sharp because they need to cut a lot of potatoes
before being changed.

One day they were shutdown and it was time to change the blades. The lady
doing the change placed the new blades on the table and bumped the table when
she turned to grab a wrench from her toolbox. The new blades started to fall
and she instinctively reached out to grab them to prevent them from falling to
the floor.

She ended up not grabbing anything because the blades sliced her fingers clean
off. They took her to the hospital and due to the blades extreme sharpness,
the cut was so clean reattachment was a pretty easy procedure. I don't know if
she had any long-term negative effects from the incident.

Safety is important, be aware of your surroundings and don't instinctively
grab things you shouldn't be touching in the first place.

------
jjindev
Once, as a relative UNIX newbie, I "cleaned up" a Sun box, until I had moved
things I needed for boot off the boot partition. I got it all back, manually
mounting partitions, and etc. but I was certainly in a cold sweat for about 15
minutes.

Perhaps the only lesson is "slow down."

------
Beltiras
I work at a newspaper as a programmer for the website. Mostly my job is
backend programming, some HTML and CSS work (mostly left to designers). I run
our local computer infrastructure as well as manage a cluster for our online
presence and assist in technology related journalism as well as assisting our
CEO in managing the IT budget.

I inherited a mess of an architecture and am finally getting around to
rewriting our deployment process. We buy VM services from a local outfit and
the prices are basically an arm and a leg for rather small machines. Due to
this my predecessor put in place an insane deployment script. It pulls the new
version from github then reloads code on the running dynos, one after another.
Reverting is out of the question with our current approach to VCS (something I
am also fixing). Most of the time this is no problem, all we are changing
really is some template code, or introducing new models and their views.

Thinking back I am quite happy we don't run into more problems than we do, but
also happy that this type of insanity is soon in the rearview mirror.

The worst mistake was recently, cost us about 4 hours of downtime during the
busiest time of the day.

A big feature on all news sites are lists of stories to present to the user to
look at after they have read what you put in front of them at the moment. They
may take the form of most viral, most read, most commented, sliced by time or
category or many other factors. My predecessor had written all those lists
statically, which made maintenance a nightmare and extension very fragile.

I made a function that was a _generic_ list of items. You supply basic
parameters, amongst them a QuerySet for what would construct the list and my
function would check to see if it was cached and if it wasn't, generate it and
cache it.

The framework I use (Django) generally uses lazy evaluation for all QuerySets
and I rarely have to think about the size of the list I generate, I just take
care to limit the query before I list() it. During development nothing showed
up as a problem and I deployed this and all seemed to be good with the world.

A week passes by where I made at least 2 minor deploys (small changes to
templates, minor tweaks to list filters) and all seemed to be good with the
world.

Designer sends me a pull request, I look over the code, just some garden-
variety template changes, nothing that should raise an eyebrow. Make the
merge, plan to deploy and then go to lunch. Deployment done, all seems well
for 2 minutes but then suddenly servers lit on fire. Pages spewed 404's and
500's like there was no tomorrow.

For 4 hours I tear my hair out, examine every piece of code I was deploying
_that day_ , call in the big gun support (the kind that costs more money than
I care to think about). Everything I was looking at pointed to the caching
agent not working. Too many pageviews requesting the database, too much load
on the servers, reboots made them work fine for about a minute but then
everything became bogged down.

The big gun support pointed something out finally that I had missed: Traffic
from the database to the dynos was abnormally high. Made me take a look at
code that had been there for a while and lo and behold: For some reason when
you pass a QuerySet as a parameter, it seems to be evaluated for the receiving
function! 2 lines of code added, one deploy, problem fixed.

I have no idea to this day how this code could be live for a week without
causing problems but an unrelated change triggers the bad behavior. This is
not be the first time I've seen strange behavior from code, having seen a
Heisenbug in Java code.

There's a happy ending to this. I made a big mea culpa slideshow where I
pointed out all the flaws and what we needed to do to prevent a re-occurence.
I got support to make the changes needed and my new cluster goes live day
after tomorrow. Now I can carefully change NEW dynos for a deployment, keeping
the old one's around if the shit hits the fan. I got some changes instituted
in how we approach VC, something that's hampered work for a while. And we save
money in the long run because we will no longer be paying an arm and a leg for
the VMs (AND I got to learn about clustering machines with HA, goodstuff with
gravy).

~~~
jlgaddis
_> ... and my new cluster goes live day after tomorrow._

On a Friday? :-)

~~~
Beltiras
End-of-month.

------
krak3n_
The worst thing I have done is terminate a running production instance with no
database backups.

Client, not happy.

------
seanhandley
Wrote an article on the company blog and linked it on HN. Traffic brought down
the server -_-

~~~
jhonnycano
that shouldn't be a screw up, but a fire's baptism for your company's
infrastructure

------
slowmover
system("tar --delete-files -czf archive.tar.gz $datadir/");

What could possibly go wrong?

~~~
slilo
$datadir could be emtpy

------
m3mnoch
way back in the late 90s stone age of interactive ad agencies, we were doing
our first really big gig for hp. it was a demo shipping out to retail stores
showcasing one of their products -- a run of 30,000 stamped cd roms.

i was the one developing the macromedia director app running on the cd.

we were on-time.

we were ready to send them out the door.

it was awesome.

and then we tested the rom outside of our network...

in some far-off corner of code, i had baked in a hard reference to one of our
file servers on our network for some streaming assets. the cd failed as soon
as you put it in the drive due to that reference to the missing file.

by the time we discovered this, we'd already glass-mastered and stamped 30,000
discs to the tune of $40k or so. or, about $6k per employee. in a company that
booked about $50k the previous year. where i worked for free for 9 months.

so, my line of code cost our little company the equivalent of almost all of
our previous year's revenue -- not profit, but revenue.

we, of course, had to make the run again -- only this time at the emergency
rush prices. and this time, we were running late.

we managed to book some time in the middle of the night at the stamping plant.
it was 4am. i had a courier standing over my shoulder watching me run the
final build again, this time without the dreaded line of code -- which broke
other things i had to fix when i removed it -- before he could take it.

i finished testing. ejected the disc. handed it to the courier, who started
running as he was placing it into its case. he drove like hell to make it to
the airport where we counter-to-countered it on a 2-hour, 6am flight to vegas
for stamping.

oh, and it almost got even worse from there. almost.

we didn't know if they would be able to stuff the cds into the packaging
because this was an emergency run and they didn't have the people available.

so...

we were actually on our way to rent a uhaul which we calculated we could drive
to vegas just in time for the stamping run to finish. from there, we would
load the discs on their spindles, and 4 of us were going to sit in the back of
the van, stuffing 30,000 discs while we drove the uhaul to palo alto. from
vegas. yes, stuffing discs in the back of a traveling uhaul.

we even had the patio furniture from one of the employees yards already picked
out to sit in while we were in the back of the truck.

luckily, the plant managed to squeeze in our packaging (at rush pricing, of
course) and all we needed to do was have one of our guys take them as luggage
on a later flight that day to the bay area instead.

as to a couple, big lessons learned?

1) i can honestly tell you, i've never, ever had a hard-coded, local network
link in anything i've shipped since and never will again. always test off-
network. especially these days with mobile apps and their on-off-network
states.

2) a strong, non-finger-pointing team is where you need to be. i felt
appropriately awful, but we handled it as a team and proceeded to grow that
little company to about $40 million a year before a merger.

p.s. oh, and next time, remind me to tell you about the time i ran a database
query on production that nuked the entire website for the publicly-traded
software company which relied on -- wait for it -- the website to do all its
commerce.

~~~
hrrsn
There is no way you can leave a juicy line like that at the end and not expect
me to ask for the story!

------
pasbesoin
Believing the CFO when he made a point of telling me, "If you ever need
anything, let me know."

Gratitude is demonstrated through actions, not vague verbal commitments.

------
typicat
mysql> drop database PRODUCTION Do you really want to drop the 'PRODUCTION'
database [y/N] y^Hn Database "PRODUCTION" dropped

~~~
tadfisher
The correct answer is always ^C

------
findjashua
In my newbie days of event-driven programming, I forgot to add 'if (err)
{...}' in an express application and crashed the server.

------
slipangel
sudo chown -R myname:myname /

Learned: Learning on the job as you hack away on problems is great, but
recognize that it's one part enthusiasm and one part risk management. Also
learned to never try anything on the command line that wouldn't want to see
pulled from my bash history and stuck on the breakroom fridge. Also learned to
cope with humiliation well.

~~~
laxk
Put space in front of any command to avoid history.

------
blueskin_
Meant to reboot my desktop...

    
    
      [root@importantServer]# reboot
    

"Hmm... This is taking a while..."

------
derwiki
I accidentally brought down yelp.fr by typoing the timezone field in the
database.

------
saganus
So the post is not on the front page anymore but I guess confessing feels good
juging by all the people that contributed.

My screwup was at my first "real" job, fresh out of college. I was asked to
free up some space from the production server at $BIGCOMPANY, because it was
already at 99% capacity (it managed to get to a 100% for a few minutes before
I "solved" the problem). The thing is, at this $BIGCOMPANY, for some reason
the budget for disk drives was non-existant, and this meant that whenever the
disk usage was at or below 95%, we were _happy_ because we still had free
space... figure that.

So here I come, armed with the most dangerous tool a newbie can wield... root
access and the drive to impress your boss. I said to myself, "I've used root
at my home machines plenty of times and nothing bad happened because I've been
using Linux for several years by now and I know I need to be careful... so I
don't get why everyone says you should never log in as root". Oh boy, how I
learned the hard way.

To continue my story, it turns out that the easiest/fastest way to free up
some space was to delete the log files for pretty much everything(except the
last 5 or 10 logs... because we were "careful", in case we ever needed them).
We usually deleted things under certain directories known to hold "useless"
logs. So here comes Mr. Newbie-guy-with-the-need-to-shine, and I thought to
myself, "why keep deleting the logs from the same directories over and over if
that only buys us about 1 or 2 percentage points, instead of cleaning as much
logs as possible for the system and freeing up a lot more space?"

After thinking about it for like 10 seconds the most genius thought of my
career materializes: do an rm -rf *.log on the topmost level directory of
where we used to store everything (webserver, webservices, databases, etc). I
happily pressed enter, and a couple of minutes later, hooray! I got the disk
usage down to a whooping 90%! I was a hero! that meant we had bought enough
time to keep on working without worrying about the disk space for at least
another one or one and a half months. This was a clear victory and an
testament to my superb sysadmin skills.

Fast forward 4 hours, and the phone starts ringing like crazy as every other
employee (non-IT ones) started wondering and then calling us to try to figure
out why was their data gone. They did not understand how come they have been
working A-OK so far, and then suddenly ALL data from sales team, admin team,
the bosses, etc was gone. And then a few minutes later... the whole intranet
came down crashing and burning.... then a full stop... nothing was working.

So we went to the logs directory... oops... no logs there!. Ok, let's try to
ping the DB. Dead. It's not running and it's responding with an unknown error.
When I tried to connect to it would do so, but then some cryptic ORA-xxxxx
error came up. No problem says I, I'll just google it and fix it.

Not so fast young grasshopper. That error meant that the DB was out of sync
with its own files used for, ironically, data corruption prevention and
rollback (or something like that... to this day I still don't fully undersand
what those files were used for).

As far as I can remember those logs where a sort of pre-commit place, where
all changes would be stored on those files and every X amount of hours the
changes would get commited to the actual DB tables. It was some functionality
that supposedly was used to correct corrupted entries and to recover (figure
that.. ) and rollback data when lost, or something like that. And
unfortunately bringing the system back in-sync was way out my league (did I
failed to mention that I was by no means a DBA?).

However a struck of good luck came down on me, as the company had a support
contract with Oracle and it was the Platinum-covered-diamonds level or
something. That meant that after creating a support ticket at like 1AM, I got
a call from one of the support guys like less than 20-30 minutes later. This
guy seems calm and tells me I should not panick, it was just as easy as doing
$crypticOracleStep1, $crypticOracleStep2, $crypticOracleStep3 and voilá! all
would be good again. Except for the fact that I had NO IDEA what those steps
actually required me to do. Almost in tears I ask the rep to pretty please
SPELL every command I needed to execute, letter by letter. I did not want to
screw up again.

So there I was, at close to 2AM, with my boss breathing down my neck asking me
what every frigging letter of the command I was typing did (which I had no
idea...), all the while trying to keep up with this supper friendly guy that
was patient enough to spell everything two times.

After a couple of commands later, behold! the DB could be brought up again! oh
boy, did I felt relieved. I was jumping up and down because I had fixed my
stupid mistake... or so I thought. After almost causing the support guy to go
deaf due to my loud cheering, he says "however...". wait... what? there's a
"however"?!?. Then he continues saying, "since you deleted the pre-commit file
of the last day, the DB is back in-sync... up to yesterday". My jaw dropped to
the floor. That meant that the ENTIRE previous day was utterly lost.... sales
data, contracts, customer's info, etc.

I thanked the guy for his help, hanged up the phone and turned to my boss
telling him that I was ready to turn in my resignation letter just after
helping capture what availabe data was actually there (in papers, by calling
customers and asking them again, etc).

My boss then turns to me and says, don't worry. We've all been through this at
least once in our careers. Even I made a mistake that is terribly similar...
however when I brought down the database, it took us one full week instead of
one day... and rest assured that as I learned my lesson, you did as well. And
I need guys like you, that have the initiative to solve things... and the
ability to learn from mistakes. So don't worry, you are not losing your job.
However you can't go home until you help everyone get as much data back as you
can.

Aw shoot... well.. I guess it could've been worse. So after having lunch with
my boss and the other teammates at like 6-7 AM, I went to the sales dept and
started asking around how could I help them get their data back.

Those were the longest 38 continous work hours I've ever had to resist. I did
not go back home until more than a full day and a half later. I was tired as
hell to say the least... but to this day I think it was a blessing that I got
to learn such a hard lesson but being backed up by a boss that was very cool
and progressive about it.

Lessons learned:

0) Never ever ever ever use root, especially for deleting files and ESPECIALLY
with the -f flag.

1) Do not assume that something you know will hold. Confirm it in the
particular system you are going to be working with. (i.e. do not assume .log
files are always log files because in your laptop that holds true)

2) Be ready and willing to assume the consequences of your actions. Most of
the time if you assume responsability for your mistakes, people will forgive
you and even give you a piece of advice.

3) Never ever ever ever use root.

------
mattwritescode
Deleted a database table and not the temporary table I was working on.

------
coherentpony
I shouted at someone.

------
smalu
chmod -R 0777 /* instead of chmod -R 0777 _

~~~
noir_lord
Ouch.

------
kentwistle
git push -f

------
cdelsolar
sudo reboot

------
benched
I once cared about a job to the point of damaging my mental health. I haven't
made that mistake since. I did, however, rather stupidly accomplish the same
thing, years later, by caring too much about an entrepreneurial venture.

------
vacri
Still feeling my way around in the new job, I was fiddling with a backup
script, got distracted, turned back, and dropped the production database. Two
minutes later "Hey, is the website down"? Then I look at the prompt...

I run around like a headless chicken trying to find who knows the right backup
to use and so forth, and I can't figure out why everyone is so calm and
collected about it. _Production was down_ /shit, I hope I still have a job.
Turns out we had no active clients at the time - no-one was accessing the
site. We'd finished one run and were in 'dead time' before the next. My next
project involved implementing coloured prompts and I no longer leave
production ssh sessions lying around when I've finished with them.

My CTO still has me listed as "database [vacri]" in his phone...

------
ClayFerguson
I didn't do this, but was the one who figured out what happened. I guy wrote
an installation utility for internal use to automate certain software setups.
Part of the program had to clear out a certain directory, where you had to
enter the name of the directory. Problem was, if you leave that field blank
(the default), it converted into c:\ and people would run it, and it would
wipe out their hard drive. After finding the problem, I told only the guy who
did it, and no one else. I didn't have the heart to destroy his reputation by
telling everyone what had done it. I SHOULD HAVE let the chips fall where they
may, because I needed to be sure NO ONE ever ran that EXE "utility" again.
They figured it out pretty quick, but nobody really knew the true problem but
me and the guy who wrote the bug!

------
michaelochurch
Tried to prevent a massive product failure.

It failed anyway, but I wasn't around when it did and there would have been no
"I told you so" credit even if I were.

One of those "big company" lessons, but probably applicable to startups (which
have an even higher ego density).

~~~
kenesom1
Bummer, they should've listened to you...

Some companies are using internal prediction markets, where employees can
speculate on various initiatives:

[http://en.wikipedia.org/wiki/Prediction_market#Use_by_corpor...](http://en.wikipedia.org/wiki/Prediction_market#Use_by_corporations)

Open allocation seems like the ultimate prediction market - people deciding on
what initiatives to work on and invest their time in is a stronger signal than
dysfunctional internal politics. And people are likely to be more motivated
working on projects they think have value.

------
camperman
dd if=outfile of=infile

Raw unadulterated fear followed by panic.

A full reinstall.

Triple checked dd params ever since.

~~~
xutopia
To me `dd` stands for "destroy disk"

------
elf25
Working Christmas Day at the Liquor store (before cameras were everywhere) and
drinking Tanqueray and Mt Dew ALL day. WHEEE!!

------
failsrails
This one time we used Ruby on Fails. That was the worst screw up ever!

~~~
smoyer
I didn't down-vote you, but I think the key to your mistake is "one time".
You're never going to learn a framework enough to use it the first time
(without an experienced mentor on board).

Those that did down-vote you are responding to the "Ruby-on-Fails" part of the
sentence ... you were the failure, as you would have been with JavaEE, Django
or Drupal. Your comment was needlessly offensive (to some at least) - if
you're going to be critical, you should learn enough to contribute back too.

