
Ask HN: We've all been there, what was your big stuff up? - shermanyo
I applaud the transparency of the GitLab team in their recent outage, but felt bad for the engineer who&#x27;s typo was called out. Anyone who&#x27;s done something similar will know the feeling immediately after realising your mistake...<p>To show that this sort of thing happens to the best of us, let&#x27;s share some of our horror stories :)<p>A few months ago, I joined a new team and was still finding my way around the environments. I was tasked with performing manual deployments to a Dev, QA and Staging environment that weren&#x27;t wired up to our automation system yet. We&#x27;d scheduled maintenance windows a week apart for the QA and Staging envs as we allow customers to test against these.<p>So the day of my QA deployment, I start by applying the database changes which all complete successfully.  
Next, I upload the new .ear files and deploy the new build of our web app. Again, all looks good, so I tell the QA team they can start testing.<p>Then the alerts started...<p>I deployed the app to the Staging env by mistake (and unexpectedly restarted the app server). I didn&#x27;t realise the naming scheme of the hostnames indicated the environment in this case :&#x2F;<p>Our UI broke immediately due to the schema changes, so my mistake was _very_ visible. I was lucky I could roll back the change easily, but I don&#x27;t think I&#x27;ll forget that day any time soon.
======
herghost
Working on a software deployment across the whole company but without a
reliable means of distributing software. Using a combination of AD login
scripts where available, but mostly relying on the antivirus product which was
installed to locally run scripts on each endpoint.

Cut to 1:30am after a full day of eking out 1 or 2 endpoints here or there,
and I've figured out a new method to try. But first I need to test it and make
sure it's not going to break anything else, so I create a separate asset group
in the AV software and add only my machine to it. I add a simple "hello,
world!" type script just to show that the script is executing and wait.

And wait.

No "hello, world!". It's 2am, I'm back in the office at 7am, my new insights
will hold until tomorrow. I'm going to bed.

About 6:45 I'm in the queue at the shop to get coffee and bacon and my boss
walks in for the same. We small talk and then he gets an incident call.

There's a virus affecting all of <locality's> machines. Uh-oh. He's getting
ready to abandon his coffee and bacon aspirations (he's the Head of Security),
when I ask what's actually happened.

"As everyone's logging in in <locality> this morning they're getting a command
prompt pop up that just says 'hello, world!'"

Oh. Fuck.

I abandoned _my_ coffee and bacon aspirations and assured him that this wasn't
a virus, it was a misconfiguration that I'd made only hours before.

It was sorted within minutes and was broadly taken with good humour. But I was
referred to as "World" for a while afterwards when people greeted me.

~~~
passiveincomelg
from the canonical list of Reasons Against Overtime

I surely contributed to that list, but was too tired to remember any details.
:)

------
petecooper
I worked at Tesco, a UK grocery chain, in my teens. I was involved with stock
control, among other things, and was partially responsible for populating
shelves at a new store.

All products at Tesco have an 8-digit product number (SKU) in addition to the
EAN/UPC. There's also a three digit case size number. Like this:

05123456-024

Each product has an estikmated weekly sale and a capacity (shelf plus
warehouse) to aid efficient warehousing. Each product has a case size of less
than 1,000. Well, all but one -- white sugar. That has a case size of 1,024.
It's annotated on the shelf ticket as '024', dropping the leading '1'.

I didn't know about this until ~43 tonnes of sugar arrived on 6 trucks the
following day. For a new store. In a small town.

It turns out that me misreading '024' as a case size and over-ordering sugar
by a factor of 43x was enough to have the internal ordering software updated.

~~~
jacobush
See, that is what's wrong with you quiche eating europeans. Fixing the
error... What a twist ending! (Twisted even!) A heartfelt "you're fired!" and
a lawsuit, now that's the American way.

~~~
chewyfruitloop
...isn't that the standard order size for a week in the states :p

~~~
jacobush
Nah, they prefer their sugar hidden in other products. Besides, there's just
no bang for the buck in table sugar. High fructose corn syrup is the sweet
spot.

------
sokoloff
We were moving buildings in 2006. The datacenter was not getting re-IP'd and
did not have cross-connectivity, some infrastructure was moving ahead of the
final move, including the backup targets.

So, I'd turned off the svn backups (dumps and post-commit incrementals) when
the targets moved about a week before the final people move. We got into the
new building and in the rush of getting everything setup, I'd forgotten to re-
enable backups (had not made a checklist). Sure enough, svn server crashes,
BDB corrupted, last backups about 8 days old.

Fortunately, we had nightly build snapshots, code on dev workstations, etc, so
it was mostly a rock-fetch project to put things back together starting from a
fresh repo. We had other automation that used the repo path and revision, so I
created a "devtemp" repo and restored the backup re-imported all the code
there and then laid on incrementals from nightly builds and dev workstations.
In the process, I checked in the vast majority of our code as the author of
"revision 6".

10 years later, I was _still_ getting svn blame based questions "about this
code you wrote (in -r 6)" "Man, that sokoloff dude wrote a whole lot of crappy
code..."

Now that we've been mostly on git for 2 years and only have those repos for
historical archeology, the questions are finally dying off.

~~~
shermanyo
> 10 years later, I was still getting svn blame based questions "about this
> code you wrote (in -r 6)"

That's fantastic haha

------
phaemon
A hastily and poorly written bash script. These days I start every bash script
with what I now think of as the "Brexit Options":

    
    
      set -eu; set -o pipefail
    

The key missing one in this case was `-u`. That stops the script if you have
an unset variable.

This script would do some stuff, and put a new website in place, and then
remove the old one. So, my bash script had the line:

    
    
      rm -rf /var/www/$olddir
    

You can see it already. I ran it with $olddir unset. I think I had it in my
head that the directory would simply not be found so that was fine. For those
of you unfamiliar with bash, since olddir="", what actually ran was:

    
    
      rm -rf /var/www/
    

Gigabytes lost (back then, a GB was a lot!). We had backups but they took
hours to restore. Horrible, horrible day.

~~~
nl
I did this same thing, except mine was in effect:

    
    
      sudo rm -rf /home/username/something/$SOME_VAR/*
    

by some circumstances, SOME_VAR ended up being set to a space. Turns our that
rm takes a list of directories to delete, so that deleted everything from the
entire server.

Fortunately I had backups. But yeah.. don't do this.

~~~
beaconstudios
that's an argument for quoting parameters if ever I've heard one!

------
gargravarr
On my internship while at uni back in 2010, I was tinkering with the company
SVN server. It was the only machine running Linux in the whole company, and
I'd only learned Linux the year before. If I recall correctly, I was trying to
set up Trac. Back then, it wasn't in the repos so I was having to set it up
from a tarball.

So, what do you know, I broke something in the source folder and the whole
Trac install was unusable. I decided to nuke it and start again.

I'm sure you all know where this is going by now.

Back then, I had a habit of tryping ./* for anything in the current directory,
rather than just * .

I forgot the .

Me being a total n00b and naive, I thought the permissions warnings I got were
genuine (I didn't _initially_ run the command as root) and that because I was
chown'ing stuff to www-data... yep. sudo !!

And of course, even though --no-preserve-root was a thing even back then, that
_only_ works if the argument given to rm is '/'. Otherwise, bash resolves the
wildcard and passes each one in.

It took about 5 seconds to kill my SSH session, just long enough for me to
notice the missing . and go OHSHI-

Worse, the machine wasn't backed up. It had a reasonably concise wiki on the
company in-house software. On the flipside, that meant the boss shared the
blame with me because there was no backup. We were able to rescue the SVN
repos, but the MySQL data tables were gone.

So I can totally relate to the poor Gitlab sysadmin who's probably suffering
PTSD right now. For want of a single . I managed to trash a production machine
too.

As one of my friends would later tell me, 'root is a state of mind'.

------
shakna
We were doing a cleanup of VMs.

The network had been rebuilt four times by three different people, and only
half documented each time.

One time, each VM had been named after planetary bodies. Sol was the AD,
Jupiter the print server, etc.

We found one called Mars. Completely undocumented. Doesn't exist so far as the
docs knew. The previous admin didn't remember it.

I ran Wireshark, and got nothing.

So... I didn't just shut it down, but I deleted it.

Took 10 minutes for mass panic to hit the office.

Mars was the gateway for our publicly exposed servers. No website, no email,
no VPN.

Our daily backup only copied data, not actual images.

So, just hoping, I threw a reverse pass through proxy up on the same IP, with
routes for our servers.

Quiet returned, as I went about recovering the Mars image I had deleted.

Lesson learned: if you are working in unknown territory, let it break before
deleting. Also, add VM images to the backup routine.

~~~
dawnerd
Also name things so they're a little more obvious.

~~~
shermanyo
We had a pair of VMs that switched hostnames in DNS at one point. (ie. the
hostname 'test-1' resolved to the VM with the image named 'test-2')

someone would inevitably restart the wrong one from the VM host, thinking it
was the one they'd been SSHing to :/

~~~
shakna
_Twitching_ , that would make me rather upset, and maybe some host-file
changes.

Though we did, in the same mess of network, have dc01-<domain> and
dc02-<domain>.

dc01 was the Domain Controller.

dc02 was... The backup of ad01-<domain>...

------
partisan
A few weeks into a new job, I was tasked with fixing a bug in an ASP.NET
application. I stepped through the application and was able to reproduce the
issue, which unfortunately for me only happened when a payment was submitted.
So I went through the code, once, twice, thrice, testing various scenarios to
understand why the issue would happen. At some point, I started reading the
code at a line level and realized that embedded right in the middle of that
payment logic was a line of code that sent an email to the customer indicating
that a payment had been made. Then I realized I had sent 10s of emails out
during each debugging pass. I immediately ran over to my new boss and
explained what happened. He smiled when I told him how many emails were sent
and then asked me to get customer service in contact with the customer. He was
obviously unhappy, but he said he was glad that I had realized it and that I
had raised the issue to him immediately.

Lesson: Read the code before jumping headlong into a debugging session. If you
make a mistake, inform someone immediately. They would probably rather hear it
from you than discover it on their own. I stuck to this principal at that job
and it served me really well.

~~~
clooless
Happened to me. Now, my Debug web.config file has a setting that directs all
outgoing email to a local "SpecifiedPickupDirectory".

------
chewyfruitloop
not sure ... was it the time I deleted the table space containing the unbanked
transactions for a local council which was about £1 million (I did a very
hasty recovery) or when i accidentally deleted the table space of the last 3
years data for another council ... which took 3 weeks to recover ... or when I
setup an ISDN modem to dial the wrong number every 30 seconds for 6 months
costing £10k after discount (the bill snapped the table legs when it was
dropped) .....

~~~
shermanyo
any of those will do nicely haha. thanks for sharing :)

------
tangus
A long time ago, working for a BBS, I wrote a nice interactive utility to
review and change user configurations. I named it uc ("user configure"). After
some testing, I installed it in /usr/local/bin.

Time to use it!

    
    
        # uc /bbs/users/*
    

Nothing happens. It needs some time to read all users' configuration files
before displaying the user interface, but it's taking too long. What's
happening? I decide to interrupt it. Shortly after, we find out all user
accounts starting with A, B, and C are wiped out.

Apparently, unbeknown to me, somebody had previously written a utility to
delete user accounts. It was named uc (user clear), and was installed in
/usr/bin. Fortunately we had fairly recent backups.

That day I learned about hash -r.

------
synicalx
First week as a network engineer, still on probation. I was given the simple
task of provisioning a new VLAN onto a few switches, one of which was a fairly
large and important aggregation switch.

Everything was going ok, then I get to the big switch and move over to one of
it's port channels and start adding the vlan there;

    
    
      switchport trunk allowed vlan 123
    

Shortly after my telnet session dropped. How weird, I thought. I tried re-
connecting, no dice. Tried pinging it, nothing.

Then I hear a loud "What the fuck?" from the other side of room, and I look up
to see about 30 bright red alerts on our board and a huge flood of red from
our GLTail monitoring board showing a very large number of PPPoE session
ending suddenly.

I'd missed the word "add" in my command and had wiped the other 200+ vlan's
from that interface which had in turn killed the Internet, Phone, and IPTV of
about 30,000 customers.

After restarting the switch to get it back to it's startup config, I returned
to my desk to find the golden pineapple already displayed prominently on my
desk. I also had to wear a cowboy hat for the next 10 changes I made in
production.

I'd say lesson learned, but then my co-worker did the exact same thing 2
months later while drunk and on call so I don't think we really learned
anything there.

~~~
amingilani
What's the "golden pineapple"?

~~~
synicalx
Literally a golden pineapple, it's a gold trophy in the shape of a pineapple
awarded to those who stuff something up. Good for a laugh, and also a good
reminder to pay attention and try to screw up less.

------
Intermernet
Many years ago I was asked to image a new Samba server as the old one was
throwing random errors due to age.

I waited until everyone else had left the building, grabbed the disk out of
the old server, stuck it into the shiny new server and proceeded to dd the old
disk to the new disk.

Except I got the devices around the wrong way (/dev/sda , /dev/sdb) and
proceeded to copy the contents of a blank hard drive over the top of the old
server's drive. Didn't notice until the process had finished...

I then discovered the benefit of DR plans the hard way (backups are useless
unless you test a restore).

Long story short, I managed to recover most of the files using a variety of
disk recovery tools, but I was still in the office the next morning when other
people started arriving and began to ask me why, for example, the payroll
application couldn't find it's database. I spent the next few days in panicked
forensics mode until the company was operating to everyone's satisfaction.

When I left that company years later I had implemented many redundant layers
of backups, proper DR plans that I religiously followed, and developed a
meticulous habit of testing any commands that needed to be run on any
production server.

~~~
shakna
> dd ... got the devices around the wrong way

dd always makes me nervous as hell when I need to use it. I usually end up
checking four or more times. Still got it wrong a few times.

Nothing like having to recover data with forensics to make you build a
fantastic backup system with great redundancy.

------
flurdy
Been there, done that. I managed to loose 30 days of billing data without a
working backup.

At a 7 man music streaming startup 10+ years ago, there was an issue with our
production server. The application was working, but the reporting tool on
another server was no longer getting the daily copy of the Firebird database
from the live server.

The database server had run out of disk space so was no longer able to make
backups to transfer. So I stopped the apps, stopped the database, cleared out
a lot of old logs and backups that was no longer needed, and brought
everything back up again.

And then I swore, as I am sure YP did at Gitlab when he realised what had just
happened.

The database had started up using as you would expect its last persisted
state. In this case, its state 30 days as it was then it ran out of disk
space. :( Firebird had been happy running in memory since then though not able
to persist any changes. And since the backup procedure was to export the
database to a local disk file then scp it to other nodes, it had been happily
transferring 0 bytes files for weeks. :/

Had I cleared up some disk space then exported the database before I shut it
down then there would have been no problem.

As I realised the severity of this I quickly got hold of our CEO to say I
totally fucked up. We then worked together to piece together the missing data
from access logs, 3rd party purchase records, and other reports and sources
that he had available. We managed to rebuild most of the missing data though
there were some gaps in the last 7 days. Not recovering 30 days at all may
have killed our tiny company.

We learnt from the mistake and I worked there happily for another year before
the company got bought up. Naturally the next project I did was to write a
decent alerting system (I won't go into the 300 duplicate text messages I
received from it during one night whilst on holiday in southern France).

I have made many mistakes since, just never the same mistake. And with the
years I take better and better precautions, scale horizontally, test backup
restores etc. But mistakes still happen, just don't panic, and don't try to be
the midnight oil hero :)

~~~
shermanyo
> I have made many mistakes since, just never the same mistake. And with the
> years I take better and better precautions, scale horizontally, test backup
> restores etc. But mistakes still happen, just don't panic, and don't try to
> be the midnight oil hero :)

Perfectly put. Thanks for sharing.

------
nicostouch
I was browsing through some web services code I had written a few months prior
and was now doing a bug fix for it when I noticed an if statement with a
boolean condition that would be easier to read if it were the other way
around. I modified the condition to improve readability but in doing so
actually flipped the condition. Luckily QA caught it otherwise it would have
broken customer sign up through the web portal for a number of clients. Not a
great mistake to make but it taught me a great lesson - never ever ever
refactor something without the proper tests in place first, it's not worth the
risk.

------
davidgerard
My stories to put interviewees at ease:

* (2010) When you're asked to restore last Sunday's backup to the dev CMS, make sure you're actually on the dev instance, and not, say, on the live instance. That literally every editorial person in the company uses. The day before deadline. (I got to restore 36 two-hourly incremental backups in sequence by hand. We lost only a couple of hours' work. But we verified our backups work!!)

* (2005) Never trust a UPS manual. Ever. Particularly, when it says that the "bypass" switch works smoothly, rather than, _e.g._ , glitching the power and taking down all 75+ Windows PCs in the computer room. (The Sun boxes were of course unaffected.) Recovering the Windows network took most of the morning; the NT admins were _less than impressed_. And I was a contractor too. Fortunately working under the direction of the in-house admin.

The important thing being, of course, to recover and learn from the experience
:-)

~~~
shermanyo
thanks for sharing :) its great when an 'unscheduled verification of backups'
goes well ;)

~~~
davidgerard
You have just gifted me a wonderful new phrase. "Verification of the backups
... er, _unscheduled_ verification of the backups."

------
oompahloompah
I was working support at a VPS provider for my first real-world tech job fresh
out of college and a customer was having issues with their system not booting
correctly. They were smart enough to use our integrated backups service so I
told them that they could delete their current disks and restore from backup.
So they did...

Or at least they tried to.

The backups system was incredibly wobbly at the time and would corrupt its
archives pretty frequently which is exactly what happened. They lost
everything on that server.

Did I mention that was their sole server and they had no other backups?

It turned out that they were a company providing services to a government
entity and had some pretty strict record-keeping requirements which they
relied on our service to fulfill.

I was freaking out thinking I was going to be fired after being there for less
than a year but everything was resolved fairly well (somehow).

I learned to never trust backups and the rule of thumb "two is one, one is
none" as it applies to them.

------
sofaofthedamned
I previously put this up at /r/sysadmin but here goes again:

I was a programmer in my first IT job in 1992 for a large retailer in the UK.
I was working on some stock related code for the branches, of which they had
thousands. They sold a lot of local goods like books which were only sold in a
couple of stores each - think autobiographies of local politicians, local
charity calendars, that sort of thing.

Problem with a lot of these items was that they were not on the central
database. This caused a problem with books especially as you don't pay VAT on
books, but if you can't identify the book then the company had to pay it. This
makes sense because some books or magazines you DID pay VAT on, because they
came with other stuff - think computer magazines with a CD on the front. So my
code looked at different databases and historical info to work out the actual
VAT portion payable, which was usually nil.

I wrote the code (COBOL, kill me now), the testers tested it, all went OK
until when they deployed, on a Friday night. The first I knew was coming in
Monday morning. All the ops had been working throughout the weekend as the
entire stock status for each branch had been wiped. They had to pull a
previous weeks backup from storage, this didn't work as they didn't have the
space for both copies to merge so IBM had to motorcycle courier some hardware
from Amsterdam, etc etc. As this was a IBM mainframe with batch jobs we also
had to stop subsequent jobs in case it made the fuckup worse, so none of the
stock/finance stuff could run at all.

The branches were royally fucked on Monday as, without any stock status to
know what to order, they got nothing - no newspapers, books, anything. We even
made it to the Daily Mail, I think it took at least 3 weeks before ordering
was automatic again. Cost the company literally millions in overtime, not
being able to sell stuff, consultants and reputational damage - it was big
news in the national newspapers.

The root cause? I processed data on a run per-branch. I'd copy the branch data
to a separate area, delete the main data, then stream it back. My SQL however
deleted the main data for ALL branches. It didn't get picked up in QA as, like
me, they only tested with a single branch dataset at a time.

I literally spent the week in a daze drinking hard, thinking my career was
over. My boss saved my career and me by being absolutely stellar about it.
Wherever you are Mike Addis, I thank you!

------
sokoloff
I've lost my personal home dir twice in my life:

1993 I tar.gz'd it as I was leaving college and ftp'd that file NOT in binary
mode; didn't discover it until too late.

1995, I blew away the mount point for my NFS server with all my home dir and
data but had left the server mounted (and was running as root, no root squash,
etc)

At work, training a new operator, I had them run the script that shutdown all
web servers rather than regenerating the CMS caches on them. As the alerts
rolled in, I reassured him that we'd done the right thing. Many minutes later,
we looked at the logs and saw "webservers-shutdown-all" instead of
"webservers-regen-all"

~~~
steventhedev
I managed to wipe mine by creating a directory called "~" in a REPL and then
trying to clean up a few days later by running rm -rf ~. Hit Ctrl-C, but it
still managed to chew through most of the dotfiles, and was halfway through a
few checkouts of AOSP before I stopped it.

~~~
sokoloff
Same "Hmm, that's taking longer than it should..." sinking feeling.

------
davman
I fdisk'd the LVM partition that was used as iSCSI storage for 200+ virtual
machines.

~~~
shermanyo
ouch!

------
rosser
Mine is fairly simple: an UPDATE without a WHERE clause on a table tracking
Other People's Money. Except that was how we discovered that backups weren't
working...

Really posting to share one that happened to me, though:

At a previous job, my PostgreSQL clusters were on metal (blades; not my
choice, but I wasn't given one). We were in the process of replacing the SSDs
in the blades, both for capacity and performance. We'd, if I remember
correctly, replaced one set (the replica, I think, so we could fail over to it
and then upgrade the master).

The lead sysadmin decided this represented a golden opportunity to get a side-
by-side performance analysis of the new and old drives. (I'll leave aside for
the moment the fact that he never said anything about doing this to me, which,
you know...)

So he ran fio. In read-write mode. Against the block device, _not_ the
filesystem.

And he did that on the primary, replica, and performance test (so, identical
to prod) machines at the same time.

About 20 minutes later, one of the developers, who was investigating some
other issue, reported seeing "strange errors" in his psql session. I looked at
the logs, and everything got very, very still for a moment...

We ended up having to rebuild the machines and restore from the previous
night's backups (taken 14h prior), troll the Rails logs to find the affected
orders, and refund them. I did finally get the large box for WAL archiving I'd
long been lobbying for out of the deal, too, so that was nice.

------
tluyben2
In 1999 I ran one of those rm -fR thingies with an unset var on a client
production system. Systems were dog slow those days but I only noticed when
the client called that his ERP was down. This was one my country's most
successful car rental companies and all went through there. Ofcourse (...) the
backups were broken and we did not use CVS yet. The client, a very nice man,
said 'well that is unfortunate' and that was all. We restored a very old
backup and copied the source files from my dev system to it. After that we ran
a mirror in our office (nightly db copies via ssh), used CVS and weekly backup
tests. Jikes.

Another one which was less my fault but I did blame myself for was dropping a
server with 200.000 web sites on it because we had to move datacenters and it
was xmas eve and very very slippery with ice. We slid and the server fell
which wrecked the (hardware) RAID disks. This had working tape backups so
there was a few hours downtime which was going to happen anyway as we were
moving.

Now that I am writing this anyway; the most traumatic was mid of the 80s with
my second computer when I was 10. I had one disk(!); they were expensive and I
did not get a lot pocket money. I was learning assembly after Basic became too
slow for what I wanted. I was building a game (Chuckie Egg rip off) and after
a long time not saving I ran the game and it worked. I was happy and saved the
game on the one disk with all the software I wrote with the save command. When
I pressed enter I remembered, and I remember this very vividly, that I used
the the disk basic ram space because I was running out of memory. The disk
started spinning, computer rebooted and ... files (dir) command after gave a
disk I/O error... The misery.

Edit: ugh. Just remembered a 1984 one; my father brought home a modem for the
MSX-1 and those things were, for my notion of money, kind of bare gold, price
wise. It was 100s of guilders. But it could only do Viditel. Which sucked. I
wanted BBS access and that required shoving the thing into the MSX after Basic
was already booted. The MSX cartridge ports are connected intimately to vital
computer parts. So shoving it in crooked had my best friend looking at a
purple screen after that I decided better would be to solder in a switch. I
did that before with stuff I found by the road. I had to cut an IC pin to do
it which I had done quite often; this time I cut it and it flew off...
Eventually I was forgiven.

------
tobltobs
I once run a dropped domain catcher. Some evening I was fixing a problem, when
the "Dinner is ready" call came from downstairs. Wanting to save some time I
started a test run and went to dinner. Coming back from dinner I realized that
I did introduced two small errors. The if else deciding if the script is in
dev mode was wrong and the calculation if a dropped domain was worth buying
was buggy and resulted in a "yes, buy" everytime. I didn't had a credit card
registered and only had a few hundred dollars in my prepaid account at the
registry, which saved my ass.

------
Meltdown
Ran alter table sql that changed the column type from money to int causing the
db to round the values in the column, dropping the cent values from all line-
items -- had to restore values from backup.

~~~
gargravarr
I remember hearing a story on my internship that a previous admin (departed
before I joined) did something similar, but intentionally.

The company printed labels using fully computerised process machines. Labels
for things like medicine bottles etc. So obviously they need a lot of
precision.

Now, the company had been an independent UK firm for many years, but had
recently been acquired by a US company doing similar things.

The admin (of all trades, sysadmin, DBA, etc.) was going through the DB and
noticed how the label dimensions were stored to something silly like 6 decimal
places. In the UK, we were using millimeters. That kind of precision was never
needed nor used, so the admin cut the column down, dropping the decimals.

Except he wasn't on the UK server. He was on the US server.

And since the US used inches, even 0.1 of an inch is quite a considerable
difference for printing a small label!!

He went home for the weekend, the Americans came in and outright panicked when
their process machines had useless data. My supervisor got phoned up late at
night and had to remote in and restore the values from backups.

Now, imagine how angry he was when the admin did exactly the same thing the
following Monday!!

Hooray for the metric system :D

------
malux85
A "reboot" in the wrong terminal window took the DB primary offline, when I
meant to reboot a local VirtualBox instance ... that sinking feeling when you
see the vbox instance is still up :/

~~~
gargravarr
One of the first things I install now is molly-guard...

------
_ah
Remember when e-cards were the hot new internet thing? I was an intern working
on such a system and, while debugging an issue ran the dreaded "DELETE FROM
Cards" without that all-so-important WHERE. Lost the whole table, 10,000s of
cards.

Of course I was operating directly on production data (dev server? what dev
server?). Of course there were no backups.

Since this was a website catering mainly toward children, I threw up a quick
error page on the website explaining away the problem and inviting users to
create a new card. Nobody ever complained...

------
bryanrasmussen
I had a small Sinatra server I had written to get data from Oracle and place
into a resque queue for some machine learning categorization to run.

It was small, but there was code that could be made more dry by moving out
into a function, so I did that and call that function whenever I needed to get
new data.

Unfortunately I forgot to close the connections that I was opening with my
refactored function.

Checked it was working, time for the weekend.

At the same company one time our tech lead got some code from the main Java
dev in Argentina that was supposed to be an improved way of handling customer
alerts, he copy pasted it and put it into the alerts class without even
thinking about the code. 2 months later there was a major issue being raised
about why was it nobody was getting any alerts! Tech lead worked from home on
this important issue. There were meetings.

Anyway I was the one who found the issue because I thought let's check the
repository, hey tech lead touched some Java about 2 months ago which he
shouldn't be doing! Let's look at the code.

The code was just obvious that it would never do what it was supposed to do,
even to me who doesn't do Java.

This is not to say the tech lead was a bad developer, far from it. He probably
just lacked some of the deep paranoia that would prevent me from ever copy
pasting in some code from a trusted colleague without reading it over a couple
times.

------
gvurrdon
A few years ago I worked on a research project at a university where we
generated quite a bit of data, and we would occasionally (every few months)
run an archiving script which would copy data from one MySQL table to another
and then truncate the source table. I once ran this without remembering that
the columns in the source table had been changed a few weeks previously and
the destination table and script had not been updated accordingly. Result: The
insert into the new table failed but the old one was truncated anyway.

This shouldn't have been too much of a problem because a full dump of the
MySQL database was made every night, copied offsite with scp, and kept for a
week. But, when the time came to restore the affected table from one of these
dumps I found that every dump for the last week had timed out silently and was
incomplete.

The result was the best part of a week trying to reconstruct the missing parts
of the table from what could be found in the dumps and to restore it
successfully. I got most of it back, but we couldn't do much work for that
period.

------
contingencies
Nice idea for a thread. It's never one person's fault.

At the very first job I ever had, with literally hours on the job, my employer
somehow put me (at 17, with zero routing experience) in charge of a major
change to the primary yet supposedly redundant ISDN routes at a small ISP.
Needless to say there were some unhappy customers the next day or two as
incorrectly advertised routes drew traffic down the wrong pipe, we figured
that out, updated correctly and propagated. Luckily most customers were
schools, internet was not yet a critically expected utility, and we buggered
things up on a Saturday, so the majority of the customer flak was only on
Monday morning. I can't remember what the routing protocol was in those days,
but it was probably something early like RIPv1. Meh. Certainly a learning
experience I felt bad about, but in hindsight totally not my fault!

~~~
shermanyo
I've heard some horror stories from mates working in data centers, where a
simple route or firewall change knocks out a huge number of customers :/

------
Boldewyn
Looking for a specific line in the production server's /etc/password with this
command:

sed -i -n '/foobar/p' /etc/passwd

trimmed the file to a single line's length. D'oh! Shouldn't have added that
`-i`, should not have done that.

Luckily I could restore it from a (working) backup while still logged in.
Phew!

------
JensRantil
It was daytime. One of our MySQL tables were growing at a rate higher that
usual. Easiest way to check the size of the table? Duh, you `cd` into
/var/lib/mysql on the database primary to `du -sh` the InnoDB table
file...`SHOW TABLE STATUS` requires some arithmetic for true file size.

To this day I have no idea why I typed `rm -fr /var/lib/mysql` instead of
`cd`. I blame muscle memory. Good news is MySQL pools its open files. This
means the database was operational (but slowly failing opening less used
tables) for 15 minutes or so.

We quickly promoted a database slave to primary and took the old one down for
backup restore. I went for a teary walk. Shaken.

At the postmortem we concluded we needed more metrics from MySQL to avoid
SSHing into servers. We also concluded how error prone one-off commands are.
Personally, I'm much more careful nowadays.

------
TranceMan
Using a bash script I wrote and running as root calling rsync to backup to
/mnt/backup/blah which was supposed to be an nfs mount.

At some point [turned out to be 5 months prior to attempting the restore] the
nfs became un-mounted and I was backing up to a dir in the same drive in /mnt

Yes the hard drive had died.

------
subutux
I worked at a small local computer shop in Belgium. We ran a local NFS/SMB
server storing all sorts of different ISO's & applications for archiving. All
this was ran on a md-raid 10.

At one point I got a notification that one of the 4 disks is failing. Trying
to determine the disk that was failing using dd if=/dev/sda of=/dev/null for
each disk & looking at the "activity" light on the chassis of disks. Spotted
an active led of a disk & executed the mdadm command for removing that disk.
replaced the disk with a brand new one & started syncing mdadm.

Turns out I should had disabled networking, because someone was accessing the
datastore & I removed the wrong drive. Lost all of it & 14 hours of my life,
trying to recover it with testdisk.

------
boyter
Made a mistake with a cloudformation script which promptly removed the svn
servers drives. The worst part was watching it fail the first time then
retrying and suddenly the drive is gone.

Thankfully I had been involved in a migration of the svn server between
regions a few months back and in paranoia had tested the hell out of the
backup and restore process, and that most repositories had been migrated to
git. Still it did stop my heart for about 20 seconds when I realised my
mistake.

------
GrumpyNl
Working with VRS and harddisks of 20Mb ( that was top of the bill in those
days) With the disk came a program called cd.exe and we thought it was for
checkDisk, it wasnt. It was cleanDisk or Cleardisk. A quick format was done.
Spent 24 hours manually restoring all the files on the disk.

------
ddmf
When I was young my dad had an early compaq laptop - green cathode ray screen,
looked like the case for a sewing machine.

Tried to use some commands, but didn't know dos - format c: of course doesn't
show the directory in a specific format...

------
clusmore
See also:
[https://news.ycombinator.com/item?id=13543872](https://news.ycombinator.com/item?id=13543872)

------
tzs
1\. We had servers hosted at hosting company X. X decided to leave the hosting
business, so we had to move. We leased servers at hosting company Y, and set
up mirrors of our X databases at Y, with them replicating from X.

Then we started moving non-database things to Y. Each thing we would move
would first still point to the databases on X, then would be updated so that
it is reading from the mirror at Y but still writing to X. Once everything was
working that way, we changed the configuration so we could write to the
databases on either X or Y, with two way replication between the two, and then
started changing things at Y to write to the Y databases. When that was done
we had everything running on Y, but with the two way replication between X and
Y still going so that if we found a serious problem we could revert the
problem application back to X.

We than ran this way for a while until we were happy with the results. The
powers that be then authorized shutting down everything at X. Since we had
sensitive data at X (such as onfile credit cards), we needed to make sure we
deleted everything before relinquishing the servers. Deleting the data was my
task. I got the go ahead to start, and fired up my scripts.

A little while later a high level manager comes running to my office yelling
"stop the wipe!".

It turned out that the IT folks had not turned off the two way replication
before turning the servers over to me for wiping. My scripts included deleting
all the data from within the database, so that got replicated over to the live
servers and wiped out everything in our customer database. Oops!

We were able to reconstruct it from the latest daily backup plus the
replication logs, but it did lead to a few hours outage in the middle of the
day.

(This wasn't actually my stuff up, as it was IT that omitted the "stop
replication" step, but I had seen their migration checklist and should have
caught that they did not have such a step).

2\. In my college days, working part time as a sysadmin at Caltech High Energy
Physics, I wrote a batch processing system for the physicists. (I don't recall
why they wanted some new system instead of just using cron or at, but they
did). There were two programs in my system, "batman" (BATch MANager) for the
physicists to use to manage their jobs, and a dameon "robin" (Run Overnight
Batch INput) to actually do the work. There was a configuration file,
/etc/batman.

While working on batman and robin, I had many occasions to replace the
configuration file, or to delete it. In other words, I had many occasions to
type, as root, "cp new-config /etc/batman" or "rm /etc/batman".

Out of habit I had a tendency to automatically type "passwd" whenever I typed
"/etc/". So one day I ended up typing "rm /etc/passwd" instead of "rm
/etc/batman". Oops. I contacted the other system administrator, the one who
handled backups, and he came in and restored /etc/passwd.

Then I did it again! He restored it again, and this time also made a hard link
/etc/safe-from-tzs to /etc/passwd, so that if I did it again we could just
relink instead of restoring from backups. This was quite embarrassing.

That worked fine, until instead of it being an rm command that I did the
batman/passwd switch on it it was a cp command. That led to another safe-from-
tzs that was a copy, not a link. Even more embarrassing.

(I know that there are a couple more incidents I left out, but I don't
remember the timing. One of them I was able to recover without bothering the
other sysadmin, because it was an early Sunday morning and I was sure no one
else was on and no batch processes were running, and I was able to cut power
to the VAX before the periodic cache flush committed my error to disk).

3\. This one wasn't mine, but I had to figure out what happened and fix it.
The place I worked sold a subscription product. One day the re-billing somehow
ended up getting big runs of off by ones on the credit cards. What I mean by
that is instead of charging account A1's credit card, C1, and charging account
A2's credit card, C2, and so on, we ended up putting A1's charge on C2, A2's
charge on C3, and so on.

Fortunately we only had a small number of subscription plans, and one was by
far the most popular. That resulted in most of these shifted charges still
putting the right amount on most cards, so only a few transactions had to be
voided and redone due to the wrong amount. We got very lucky there.

I was given the task of figuring out what had gone wrong, since the billing
process and databases all ran on Unix systems and I had by far the most Unix
experience (I'd been a Unix kernel hacker at Interactive System Corporation
and at Callan Data Systems in prior lives). I fairly quickly found the
problem. The billing process was forking off children so that it could process
several accounts in parallel. That was fine, except that it was opening its
database connections in the parent process before the fork. Different children
ended up sharing a single database connection, and so one child could end up
reading results there were from another child's query. I restructured it a bit
so that it didn't need to use the database until after the fork, and moved the
database opening into the children and all was well.

~~~
shermanyo
oh wow, number 2 was amazing. I know the exact feeling.

That (well deserved) condescending "just-for-you" protection, then the
sheepish admission that "I did it again anyway, I'm so sorry" :P

------
gargravarr
Got another long story. This one wasn't me personally, but affected the team I
was in at the time. The responsible party has moved on (left for another
company).

Background: I work for a company making business expense software - yes, it
really is as dull as it sounds. I used ot be a SQL dev, now a DevOps engineer.
At the time, I was still a dev. Our software does all manner of expense
tracking, including requests and procurement. As a result, the customer's
individual site is usually seeded with a dump of their HR data. The team I
worked for managed the importers to deal with all sorts of weird home-grown HR
files.

One thing the company did right was building a very powerful generic import
engine - for something built in-house, it's very flexible, easy to use and
works with just about every format I've ever seen. We'd define the rules for
transforming the data to our systems and let the import engine chew through
it. Obviously this sort of thing lends itself to automation, and we built
exactly that. The system defined clients, file drop locations and the import
rules to use, then would check regularly for any files to import.

One of our clients spotted a bug with their HR importer, so that got switched
off in the config table while we worked through it, but the client continued
to send us regular HR dumps via SFTP. Not a problem.

Fast forward a couple of weeks. We have a new starter who's being trained on
the stuff our team manages. The person doing the training, himself the
previous most recent joiner to the team (something of a tradition, and hasn't
resulted in too much Chinese Whispers) demonstrated the automation engine by
creating a new job. In doing so, he manages to commit the dreaded forgotten-
WHERE-clause boner. He accidentally switches every single automation job ON.

By coincidence, the aforementioned client is alphabetically at the top of the
list. And the bug with their importer is unfixed. Nonetheless, the importer
starts up, spots 2 weeks' worth of HR dumps for this client, and gets to work.

6 hours later, I spot that clients further down the alphabet haven't had their
automated runs start. At the time, the import engine ran single-threaded; I'd
complained about it as inefficient, but it's probably this fact that let us
realise what was going on. These are big HR dump files, so the importer is
taking upwards of an hour per file to process them, but it's not moved past
this job. We as a team piece together what's going on behind the scenes.
Unfortunately, the client had been using the site at the time, and had run
through some very large/expensive procurements, of which we as a company take
a cut (our business model).

The bug in the HR importer? It was incorrectly closing people's accounts and
trashing their records. So while the client has been merrily been ordering
expensive stuff on an unaffected account, the importer was inadvertently
corrupting the rest.

Now, we do have very thorough backups in progress - daily full backups and
15-minute incrementals for production. And they work. Unfortunately, this is
where the CTO stepped in. Rather than just roll back the site to the point six
hours ago before the import ran, he directed that we fix the HR data instead.
The reasoning being, the requests linked out to third parties and couldn't be
copied - new requests would have to be made, stopping us replaying the valid
data in those 6 hours. Personally I don't know how accurate these claims are,
as I wasn't directly involved in what happened next, but the net result was
that the guy who committed the booboo, another of my colleagues and my
immediate boss were given orders to fix the HR data, writing scripts by hand.

It took 3 developers one and a half days to write the scripts to massage the
HR data back together. I have to wonder if our cut of this client's bill
actually exceeds the daily cost of 3 devs. And there was much frustration
along the way, when they started using an internal-only dev site to test the
scripts on, only for one of the support personnel to unknowingly overwrite it
while troubleshooting an unrelated issue and nuking all their work for half a
day. It was a total farce, done purely to save face with the client. Rather
than rolling back to a known-good state, we spent a lot of time and effort
trying to take a known-bad state back to a might-be-good one. Idiocy.

And of course as a result I have to sit through meetings where my boss
describes how we can't let developers have direct access to tables in case
this happens again, so there's now a suite of stored procedures managing all
the config and triggers on the table to stop any other queries. There were no
repurcussions for the guy who made the mistake because it was a genuine
mistake, but strikes me as a typical knee-jerk overreaction.

It did affect me indirectly, because for that day and a half, I was the only
one able to work tickets!!

------
wantoncl
Kinda long setup but background is needed...

We're a Microsoft shop, .Net application with SQL Server RDBMS on Azure
Virtual Machines. We use Availability Groups (AG) for failover/redundancy and
read-only copies. AG uses Windows clustering, and the recommended MS
configuration for clustered Azure VMs is a storage pool of the attached VHDs,
with OS virtual disks on top of that. It looks and acts the same as physical
hardware. One thing I didn't fully appreciate is that the Windows cluster
still aggregates all the storage across nodes even if they aren't cluster
resources. Remember this.

About 8-9 months ago we wanted to migrate our databases to a new VM config
without incurring any downtime. Standard method with AG is to add a new node
to the cluster, include it in the AG config, and manually synchronize the
databases to that node. Once done we fail over to the new nodes and evict the
old ones. We did several tests and all went well, so we scheduled the
production migration for 9:00 pm that night.

We added the new nodes to prod during the day and made a small change in drive
letters (originally R:, S: and T:, new nodes were SSD based so consolidated it
all on Z:) Turns out that while AG will replicate with database files on
different drive letters, it won't fail over unless the drive layout is
identical. While this was technically a mistake it saved my ass.

We had all the drive letters on the new nodes and decided to prep early for
the migration. I open our PowerShell script to manage the storage pool and
virtual disks. I run Get-VirtualDisk and see all the volumes listed, but each
name shows 3 times (we had 3 nodes). I've noticed this before but didn't
really grasp the implication, because all the volumes had the same names on
each node. I decided to remove the old drive letters from the new node and run
Remove-VirtualDisk for each of them (local backup drive first, then
transaction log drive, then data file drive). Literally 5 seconds later:

"Hey ___, the application isn't responding. Is there something going on with
the database?"

Turns out, there was. Since the volumes were all the same name, and it was a
Windows cluster, drives on all nodes were dropped.

Fuck.

As this was a storage pool with striped disks, rebuilding it was going to be
tricky at best, and if it could be done at all, would need manual work and a
lot of time. And no one had any experience doing that. And this was during
peak business hours.

Fuck fuck fuck fuck...

You notice the sequence I dropped drives? All local backups were gone. Our
cluster disk redundancy consisted of copying to every node in the cluster. Our
Azure cloud backups were 12-24 hours old, and would take minimum 4-6 hours to
restore.

Fuck ^ Graham's number. (yes, all fucks were spoken out loud during this
episode)

Since MS recommends not keeping the system databases on the C: drive, we had
moved them to one of the drives that no longer existed, so SQL Server wouldn't
even start on the primary node. I'm actually sweating at this point and
thinking I'm going to get fired. In any case we get the word out to our
support folks and they notify customers.

Fortunately that new Z: drive with all the database files on it was still
intact. After the the fastest round of Googling ever, I ended up having to
evict the dead nodes from the cluster, and forced a failover (the option for
this is helpfully and deliberately named FORCE_FAILOVER_ALLOW_DATA_LOSS).
First time I ever had to do this BTW. (have done many planned failovers
though)

Everything came back up, and from what we could tell no actual data was lost,
any in-flight transactions were rolled back when the disks went away. In
effect we performed our migration in the middle of the day instead of that
night, and were down for 23 minutes. (We had planned on 1-2 hours)*

Lessons learned and/or reinforced:

0\. You CAN recover from catastrophic data loss. Don't panic. Or, panic in a
controlled and practiced fashion.

1\. Understand your product's features and implement them properly.
Fortunately SQL Server's features are designed to preserve data as best it
can, even if you fuck up.

2\. You cannot have too much redundancy. We now copy our backup files to
independent local storage, and have a disaster recovery site in another data
center, log shipped on 15 minute intervals. (Sadly we have yet to test failing
over to it)

2a. All our production database changes now require additional backups
immediately prior. (You should also verify that they can be restored)

3\. On a Windows cluster, ensure every disk volume has a unique name that
includes the node in it.

4\. Practice disaster recovery. And that means true disaster, unexpected
issues. Look at these and other threads for examples, and ACTUALLY DO THEM.

5\. Practice all production changes multiple times in a separate environment.
WRITE DOWN the procedures and DO NOT DEVIATE FROM THEM when actually changing
production.

Thanks for reading this long thing.

* Our customers were actually happy after the migration because the SSD performance was much faster than the original disk setup.

~~~
gargravarr
> Don't panic. Or, panic in a controlled and practiced fashion.

Douglas Adams Seal of Approval.

