
An administrator accidentally deleted the production database - 3stripe
http://support.gliffy.com/entries/98911057--Gliffy-Online-System-Outage
======
arethuza
My very first job - ~25 years ago.

Destroyed the production payroll database for a customer with a bug in a shell
script.

No problem - they had 3 backup tapes.

First tape - read fails.

Second tape - read fails.

Third tape - worked.... (very nervous at this point).

I think most people have an equivalent educational experience at some point in
their careers.

Edit: Had a project cancelled for one customer because they lost the database
of test results..... 4 months work! Their COO (quite a large company) actually
apologised to me in person!

Edit: Also had someone from Oracle break a financial consolidation system for
a billion dollar company - his last words were "you need to restore from tape"
and then he disappeared. I was _not_ happy as it was his attempts at
"improving" things were the cause of the incident! Wouldn't have been angry if
he had admitted he had made a mistake and worked with us to fix it - simply
saying "restore from tape" and running away was not a good approach.

~~~
roflc0ptic
A coworker of mine used to say "It's not the backup, it's the restore."
Meaning your backup process isn't meaningful unless you have a tested and
effective means for recreating your system from that backup. It has stuck with
me.

~~~
slavik81
How do you do test your restore without destroying the only trustworthy copy
of your data?

~~~
ziziyO
Do it in the Staging or User Acceptance Testing environment.

~~~
twic
This is the right answer! You want a staging/acceptance/mirror environment
that's the same as production, right? So make it with nightly restores of the
production backups. You get a crisp, fresh staging environment, and regular
validation of your backups too. Just remember to run full production
monitoring against your staging environment too.

------
steven2012
This is what happens when you don't have a disaster recovery plan, or if you
have one but never test it out. You need to test your disaster recovery plans
to actually know if things work. Database backups are notoriously unreliable,
especially ones that are as large as the one this post is talking about. Had
they known it would take 2-3 days to recover from a disaster I'm sure they
would have done something to mitigate this. This falls squarely on the
shoulders of the VP of Engineering and frankly it's unacceptable.

I worked at a company that was like this. My first question when I joined was,
"do we have a disaster recovery plan?" The VP of engineering did some hand
waving, saying that it would take about 8 hrs to restore and transfer the
data. But he also never tested it. Thankfully we never had a database problem
but had we encountered one we would have lost huge customers and probably
would have failed as a business.

I also worked at a company that specializes in disaster recovery, but our
global email went down after a power outage. The entire company was down for 1
day. There were diesel generators but they never tested them and when the
power outage occurred they didn't kick in.

Case in point: Test your damn disaster recovery plans!!!

~~~
sqldba
Speaking from firsthand experience, business doesn't care. The managers know
that a) no matter what happens it's likely that their jobs aren't on the line,
b) they already have stockholder money and so can just hand-wave any problems
away as "once-off and we've fired the staff responsible".

So what I get to see are DR plans that are obviously faulty, where they cannot
be tested for something as simple as we don't have 20TB of extra disk handy to
do a single failover.

"That's okay", the boss will say, "as long as we have it on paper."

Okay dude. As long as I have your comment in an email to protect myself. I'm
okay with being fired for something I warned everyone about, as long as I can
also show that to my next boss, to prove my common
sense^H^H^H^H^H^H^H^H^H^Hexpert advice gets overridden.

------
Smerity
I was testing disaster recovery for the database cluster I was managing. Spun
up new instances on AWS, pulled down production data, created various
disasters, tested recovery.

Surprisingly it all seemed to work well. These disaster recovery steps weren't
heavily tested before. Brilliant! I went to shut down the AWS instances. Kill
DB group. Wait. Wait... The DB group? Wasn't it DB-test group...

I'd just killed all the production databases. And the streaming replicas.
And... everything... All at the busiest time of day for our site.

Panic arose in my chest. Eyes glazed over. It's one thing to test disaster
recovery when it doesn't matter, but when it suddenly does matter... I turned
to the disaster recovery code I'd just been testing. I was reasonably sure it
all worked... Reasonably...

Less than five minutes later, I'd spun up a brand new database cluster. The
only loss was a minute or two of user transactions, which for our site wasn't
too problematic.

My friends joked later that at least we now knew for sure that disaster
recovery worked in production...

Lesson: When testing disaster recovery, ensure you're not actually creating a
disaster in production.

(repeating my old story from
[https://news.ycombinator.com/item?id=7147108](https://news.ycombinator.com/item?id=7147108))

------
Rezo
Treating app servers as cattle, i.e. if there's a problem just shoot & replace
it, is easy nowadays if you're running any kind of blue/green automated
deployment best practices. But DBs remain problematic and pet-like in that you
may find yourself nursing them back to health. Even if you're using a managed
DB service, do you know exactly what to do and how long it will take to
restore when there's corruption or data loss? Having managed RDS replication
for example doesn't help a bit when it happily replicates your latest app
version starting to delete a bunch of data in prod.

Some policies I've personally adopted, having worked with sensitive data at
past jobs:

\- If the dev team needs to investigate an issue in the prod data, do it on a
staging DB instance that is restored from the latest backup. You gain several
advantages: Confidence your backups work (otherwise you only have what's
called a Schrödinger's-Backup in the biz), confidence you can quickly rebuild
the basic server itself (try not to have pets, remember), and an incentive to
the dev team to make restores go faster! Simply knowing how long it will take
already puts you ahead of most teams unfortunately.

\- Have you considered the data security of your backup artifacts as well? If
your data is valuable, consider storing it with something like
[https://www.tarsnap.com](https://www.tarsnap.com) (highly recommended!)

\- In the case of a total data loss, is your data retention policy sufficient?
If you have some standard setup of 30 days worth of daily backups, are you
sure losing a days worth of data isn't going to be catastrophic for your
business? Personally I deploy a great little tool called Tarsnapper (can you
tell I like Tarsnap?) that implements an automatic 1H-1D-30D-360D backup
rotation policy for me. This way I have hourly backups for the most valuable
last 24 hours, 30 days of daily backups and monthly backups for a year to
easily compare month-to-month data.

Shamless plug: If you're looking to draw some AWS diagrams while Gliffy is
down, check out [https://cloudcraft.co](https://cloudcraft.co) a free diagram
tool I made. Backed up hourly with Tarsnap ;)

~~~
calpaterson
I've found tarsnap to be slow at restoring in the past. My recollection is a
few hours for a ~1GB maildir. I was using it for my personal things but I
would (as would anything) test restore times if I was using it for serious
stuff.

~~~
Rezo
The amount of de-duplication performed by Tarsnap, and the amount of files,
which for a maildir I imagine is a lot of tiny files, probably negatively
impacts it. Dealing with a single DB dump file the performance is fine so far
at least. I can imagine one could also try to partition the data into multiple
independent dumps that can run in parallel during the restore if speed became
a concern.

~~~
rsync
You can also ZFS send an entire filesystem snapshot, very efficiently, to
rsync.net:

arstechnica.com/information-technology/2015/12/rsync-net-zfs-replication-to-
the-cloud-is-finally-here-and-its-fast/

[http://www.rsync.net/products/zfsintro.html](http://www.rsync.net/products/zfsintro.html)

~~~
RKearney
What is the benefit of rsync's $0.20/GB pricing over any other cloud storage
solution that costs $0.01-$0.03/GB?

~~~
marklyon
I've been a customer of theirs for a long while and will note that their
customer service is amazing. They helped implement what I needed and offer
support anytime I need.

Now that S3 has matured and prices have continued to drop, though, I am going
to be moving to trim costs. I actually kicked off backups to S3 earlier this
month and am backing up to both S3 and rsync.net at the moment, with the plan
of ending rsync once I've tested restores and made it through a billing cycle
at Amazon.

~~~
marklyon
Wow. To stress the amazing level of customer service, someone there just ran
across my comment and reached out - noting that my pricing was set for an
older structure, updating me to a far more competitive rate and offering a
retroactive credit.

While Amazon has offered some great service, it's never been as good as that.

They really do stand by and provide a superior level of support and assistance
if you need it on the technical side as well.

I highly recommend them.

------
SimplyUseless
Been there Done that :)

I was once on-call working for one the leading organizations. I got a call in
the middle of the night that some critical job had failed and due to the
significant data load, it was imperative to restart the processing.

I login to the system with a privileged account. Restart the job with new
parameters and since I wanted not to see the ugly logs, I wanted to redirect
the output to /dev/null.

I run the following command ./jobname 1>./db-file-name

and there is -THE DISASTER-

For some reason this kept popping in my head - "Bad things happen to Good
people"

We recovered the data but there was some data loss still as the mirror backup
had not run.

Of course, we have come long way since then. Now, there are constant sync
between Prod/DR and multitude of offline backups and recovery is possible for
last 7 days, the month or any month during the year and the year before.

~~~
hobs
I was doing a favor to a friend and on a refferal talked to a guy who didnt
have anything but weekly backups and had a corrupt database due to some drive
failures.

I was able to determine that the corrupt data was repairable if we had a copy
of the old db, and since it was a tiny machine system I asked "Would you mind
restoring the backup side by side with production and I can do what I need?"

"Sure thing!"

I wait for a minute, and then my connection to the production database dies.

I refresh the client, and now the one database available is restoring from a
backup...

I called him and asked if he meant to overwrite his production copy with his
backup instead of do it side by side, and he says petulantly, "I didnt do
that!"

I ask him to check again, and he responds with "I will call you right back!"

Five minutes later I get the call, "How do I roll back my restore partially
through the restore process?"

Oops.

------
ww520
We've all been there. Shit happens. That's what backup is for.

OT: It's probably bad form to publicly blame someone for it, even if it's done
by him. It's suffice to say, we screwed up but on our way to recovery. It's
better to follow the practice of praising in public and discussing problem in
private.

~~~
vollmond
I worked on a team that had a list of "breakfastable offences" \-- violating
these rules meant you had to bring in breakfast for the whole team (donuts,
bagels, whatever). One of them was "throwing someone under the bus." In
conversations with anyone outside the team, you weren't allowed to single out
a person as responsible for any particular bug/error/etc.

Granted, this is pretty vague (depending on how many "administrators" the
company has), but it's still too specific for me.

~~~
danso
On a public relations note, though: I think a case could be made that it was
important to give some specifics about who is to blame. Consider the
alternative:

"We discovered the production database had been deleted but we are now working
diligently to restore it"

How are people -- both non-technical and the HN crowd -- _not_ supposed to
suspect that this is a result of an external malicious hack?

~~~
vollmond
The organization can take responsibility for the issue. "During a system
update, we mistakenly deleted a production database. We are restoring it and
shoring up our disaster recovery plan."

That's very different from "During a system update, Dave mistakenly deleted a
production database." In an organization with 5 or 10 people, "During a system
update, our administrator mistakenly deleted a production database," is still
identifying.

Like I said, I'm not sure it's an issue in this particular case. I don't
personally know anything about the site in question.

------
dools
This is how I learned about xargs ...

I once typed onto a client's production mail and web server that basically ran
the whole business for about 50 staff, as root, from the root directory:

chmod -R 644 /dirname/ *

I seem to recall the reason was that tab completion put a space at the end of
the dirname, and I was expecting there to be multiple files with that name ...
anyway the upshot was that everything broke and some guy had to spend ages
making it right because they didn't have the non-data parts of the file system
backed up.

I learned that whenever you do anything you should:

find . -name "*.whatever you want" | more

then make sure you're looking at expected output, then hit the up arrow and
pipe it into xargs to do the actual operation.

~~~
verytrivial
When root or prod, the paranoid (like me) always start every mutating command
with a '#' to prevent a sneeze from prematurely sending the command and doing
damage.

~~~
elcct
[ $[ $RANDOM % 6 ] == 0 ] && rm -rf / || echo “You live”

For a good measure at the end of the day

~~~
int0x80
Oh, the good ol' shell russian roulete.

------
rlonstein
BTDT. Got the t-shirt. Early in my career...

* Multiple logins to the conserver, down the wrong system.

* rm -rf in the wrong directory as root on a dev box, get that sick feeling when it's taking too long.

* Sitting at the console before replacing multiple failed drives in a Sun A5200 storage array under a production Oracle DB, a more senior colleague walks up and says "Just pull it, we've got hot spares" and before I can reply yanks a blinking drive. Except we have only two remaining hot spares left and now we have three failed. Under a RAID5. Legato only took eight hours to restore it.

* Another SA hoses config on one side of a core router pair after hours doing who knows what and leaves telling me to fix it. We've got backups on CF cards, so restore to last good state. Nope, he's managed to trash the backups. Okay, pull config from other side's backup. Nope, he told me the wrong side and now I've copied the bad config. Restore? Nope, that backup was trashed by some other admin. Spent the night going through change logs to rebuild config.

There were a few others over the years, but all had in common not
having/knowing/following procedure, lacking tooling, and good old human error.

~~~
chris_wot
Is it really a good idea to use RAID 5 on a database? If the database is large
enough rebuild time can be more lengthy than a straight restore and under many
RAID 5 setups you have the added problem of slower write performance.

~~~
rlonstein
> Is it really a good idea to use RAID 5 on a database

Hell no. Had I been in involved in that setup it would have been RAID 10 or
RAID 50. Actually, had there been some planning there would have been a second
array and it would not have been physically co-located in the same rack as the
first so when the cooling or power inevitably fails it won't take out both.
But, you know, not my circus.

------
_spoonman
If that administrator is reading this, chin up ... it happens to the best of
us.

~~~
rpgmaker
When I was relatively new in my first job I forgot to include the WHERE clause
in an update, essentially resetting the value for the entire table. Needless
to say I felt awful and I was ready to hand in my resignation right after the
issue was sorted out (I even printed my resignation letter). Luckily there was
a relatively recent backup (not as recent as it should've been though... but I
obviously wasn't the DBA) and things went back to normal relatively soon.
Throughout the process the team shared their DB-related war stories with me.
Everyone seemed to have had a similar experience happen at some point during
their careers and knowing that made me feel a lot better. I ended up changing
my mind and decided not to quit.

~~~
hatter
Much like the folk putting echos into find and wildcarded shell commands to
check the output, I'll often start manual sql updates by doing 'SELECT
something FROM database WHERE...' and check the output rows match my
expectations before hitting the up arrow to replace the SELECT with an UPDATE.

For bigger tables, I use COUNT(something) if I expect it to be long but I have
an idea of rows affected, or LIMIT if that's going to give me an idea that
it's doing the right thing.

------
innertracks
Not long ago I discovered backups don't do any good if you delete them. The
incident went down while I was wiping out my hard drive to do a fresh install
of Fedora. I believe what happened may have been due to sleep fatigue.

Everything is a bit hazy. At one point in my wandering on the command line I
found the mount point for my external backup drive. "What's this doing here?"
and decide to remove it.

At some point I woke up in a panic and yanked the usb drive off the my laptop.
Heart pounding. "Oh shit."

I actually felt like I was going to get sick. Tax records, client contact
info, you name it, all gone. Except, basically, the pictures of my kids,
mozilla profile, and my resume files.

While I reconstructed some of the missing files there a bunch that would be
nice to have back. All of the business records though have had to be
reconstructed by hand. By the next day I did realize I really only cared about
the pictures of my kids in the end. And those were somehow saved from my
blunder.

Work flow change: backup drive is only connected to laptop while backups are
being made or restored. Disconnected at all other times. A third backup drive
for backups of backups is on the todo list.

~~~
steven2012
I have about 5 backups of things that I need. I just buy a new external drive
and copy everything over and leave it in the closet. And then in a year buy
another one. $200 a year for a backup of my photos is worth it

~~~
static_noise
If you don't have at least 3 copies in at least two different locations your
data is already vapourizing.

So 5 inexpensive backups of important data sounds just about rasonable.

------
sqldba
The Enterprise I work for is currently implementing a new idea - where they
hire a crack team of generalists - and give them complete and utter unfettered
access to production (including databases).

This is despite our databases being controlled by my team and having the best
uptime and least problems of anything in the entire business. Networks?
Fucked. Infrastructure? Fucked. Storage? Fucked. But the databases roll on,
get backed up, get their integrity checks, and get monitored while everyone
else ignores their own alarms.

The reasoning for this is (wait for it...) because it will improve the quality
of our work by forcing us to write our instructions/changes 3 MONTHS IN
ADVANCE for generalists to carry out rather than doing it ourselves. 3 MONTHS.
I AM NOT MAKING THIS UP. AND THIS IS PART OF AN EFFICIENCY STRATEGY TO STEM
BILLIONS OF DOLLARS IN LOSSES.

Needless to say the idea is fucking stupid. But yeah, some fucking yahoo
meddling with the shit I spent my entire career getting right, is sure to drop
a fucking Production database by accident. I can guarantee it. Your data is
never safe when you have idiots in management making decisions.

~~~
bpchaps
Eh, yes and no.

You're too focused on the idea that those generalist are a bunch of skill-less
dipshits. As one of those generalist skill-less dipshits, my calloused
perspective is that DBAs are the absolute most obstinate, narrow minded twats
that exist in any sort of enterprise arena - worse than that PM you probably
hate. They just suck! I can think of maybe one DBA who didn't flatout stink of
the 20 or so I've worked with. For some reason, there's just a complete lack
of understanding of anything that's NOT a database, even though their database
understanding is so incredibly deep. Y'all could use some more generalists.

An example of an obstinate DBA is one from my last place, who I wanted to take
root access from. She had root ssh keys all over the place, sudoers entries in
random places, passwords in her history, etc. It was a security nightmare. She
absolutely refused to allow me to take away her root access. She wouldn't even
allow any discussion. Her reason? "I need root to install mysql". Management
agreed.

There's a reason "That's something a DBA would do." has become a running joke
at multiple places I've worked at.

Edit to add: These problems could easily be solved if there was less silo'ing
going on. If everything but the database is awful, then that's an indication
of deeper, awful and likely legacy problems, not just with the generalists.

~~~
sqldba
> You're too focused on the idea that those generalist are a bunch of skill-
> less dipshits.

In this specific case it's because I've been working with the quality of
dipshits in the departments they are being pulled from, over the past few
years. They are going to be cross-trained by dipshits from those other
departments, so that they can become even worse generalists.

Hmmm. They don't care about backing up servers. They don't care about HA
cluster alarms or failovers. They don't notice or proactively monitor disks
filling despite being the sole custodian of the Enterprise monitoring
solution. They don't care about Windows security logging policies or even the
power plans. They manage AD but let service accounts expire all the time
instead of following anyone up first; leading to many outages.

I'm struggling to think of anything good they do. There's no quality or pride
to their work; they use GUIs. They get by because the few time I've seen other
managers criticise their boss, that boss has then filed official complaints of
harassment - and then everything quiets down and goes back to the status quo.

> there's just a complete lack of understanding of anything that's NOT a
> database

Guilty as charged. I don't care about anything outside of the database because
it's not my job ;-) However I do know a little about the server level backups,
clustering, performance counters, security settings, and such - anything that
affects my uptime - and I monitor it, unlike the people who are paid to do so.

> An example of an obstinate DBA is one from my last place, who I wanted to
> take root access from.

Oracle people have root access on Oracle boxes. We have admin access on
Windows boxes. It's extremely difficult for just a few staff to manage
hundreds of servers in a high quality fashion otherwise.

> There's a reason "That's something a DBA would do." has become a running
> joke at multiple places I've worked at.

There are plenty of shit DBAs, and obviously there are good Infrastructure
people as well especially on HN. I hope you realise - that DBA you were
talking about - likely isn't bothering to read HN either. I am somewhere near
the top middle of my profession.

> These problems could easily be solved if there was less silo'ing going on

Totally agreed.

> then that's an indication of deeper, awful and likely legacy problems, not
> just with the generalists.

Entrenched management and yes-men-or-you're-fired culture.

~~~
bpchaps
Ah, 100% fair enough! Didn't really mean to come off as critical if it came
across that way. The problems are so systemic that it's worth mentioning, I
suppose.

Do you work in finance? These problems you're describing are all too familiar.

------
brainbrane
About 15 years ago, my school's electrical engineering lab had a fleet of HP-
UX boxen that were configured by default to dump huge core files all over the
NFS shares whenever programs crashed. Two weeks before the end of the semester
a junior lab assistant noticed all the core files eating a huge chunk of
shared disk space and decided to slap together a script to recursively delete
files named "core" in all the students' directories.

After flinging together a recursive delete command that he thought would maybe
work, he fired it off with sudo at 9:00pm just before heading out for the
night. The next morning everyone discovered that all their work over the
semester had been summarily blown away.

No problem, we could just restore from backups, right? Oh, well, there was
just one minor problem. The backup system had been broken since before the
start of the semester. And nobody prioritized fixing it.

Created quite the scenario for professors who were suddenly confronted with
the entire class not having any code for their final projects.

They talked about firing the kid who wrote and ran the script. I was asking
why the head of I.T. wasn't on the chopping block for failing to prioritize a
working backup system.

------
DennisP
One time the DBA and I were looking at our production database, and one by one
the tables started disappearing. Turned out one of the devs had tried out a
Microsoft sample script illustrating how to iterate through all the tables in
the database, without realizing that the script was written to delete each
table.

~~~
brianwawok
"And this is why devs lost prod ddl access"

~~~
dorfsmay
Nobody should have write access to prod when wearing their dev hat. Only
scripts tested against a copy of the DB should be run with write in prod.

------
Jedd
[https://www.gliffy.com/examples/](https://www.gliffy.com/examples/)

First graphic on this page includes a bright red box asking: "Is your data
safe online?"

Evidently not a rhetorical question.

~~~
GedByrne
It also says the answer is NO if you have "services administered by meat-based
lifeforms"

------
blantonl
If the gentleman who did this loses his job, then those looking for a new
sysadmin should definitely give this guy some serious consideration.

Because I guarantee you he'll never, ever, let this happen again.

~~~
jawns
Your comment cracked me up.

But one thing worth quibbling over:

It might not have been a gentleman who did this. Might have been a lady. Or
might have been a not-very-gentle man.

~~~
treebeard901
Whoever it was they were not very gentle on the delete key.

------
bliti
The official rite of passage that turns anyone into a bona-fide sys admin. The
equivalent to running your production server on debug. D:

~~~
vidarh
I had a client call me in panic after he'd run a unit test script that started
by wiping and recreating the database from scratch, and he'd run it against
the wrong server.

Thankfully he'd just run it against a dev environment where the loss wasn't
particularly severe (the prod environment is firewalled off, so he couldn't
have done the same thing against that), but from the panicked tone of his
messages before it was clear what had happened, I'm sure he's come to be extra
careful about database credentials going forward....

~~~
bliti
We seem to have shared clients in some point in time. ;)

------
krzrak
Once I asked server support guy to move database from production to dev. He
did - without any question of doubt - exactly that: copied database to dev
environment and deleted it from the production. (note: in my language word
"move" is less unambiguous than in English - it may mean, depending on the
context "move" or "copy").

~~~
sqldba
Your server support person did wrong and language is no excuse for them. If
there's any ambiguity, if data may be lost, if something may cause an outage -
then they have a duty to tell you and then ask if you're sure.

I do it all the time. "Give this access to this user to this db." 'Okay, but
do you know they'll be able to drop your db?' "Oh shit okay wait a sec..."

------
alistproducer2
Last week I deleted a large portion of our pre-production ldap. I use jXplorer
ldap client and for some reason the control d (delete) confirm dialog defaults
to "Ok" instead of cancel. I'm use to hitting control f (search) and then
enter to repeat the last search and when I hit d instead of f I deleted a
bunch of stuff. The silver lining is I patched the problem in jXplorer and
submitted it. It's my first legit contribution to a project.

------
amelius
In my opinion it is way too easy in Unix to accidentally delete stuff (even
for experienced users). Having a filesystem with good (per-user) rollback
support is, imho, more than just a luxury.

~~~
moviuro
The people writing UNIX put all their brains into writing a stupid system.

Also, there are now CoW Filesystems that prevent data loss because of human
error. Btrfs and ZFS are good examples.

~~~
gnur
But you really don't want to use those when working with databases, the
performance loss is severe.

------
d0m
So.. story time. While at the university, there was that project where we had
to create an elevator simulator in C as a way to learn threading and mutexes.
All the tmp files were stored in ./tmp/.

In between build/run/debug cycle, I would "rm -fr ./tmp". But once, I did "rm
-fr . /tmp". At that time I didn't know any better and had no version control.

I had to redo that 2 weeks in a night, which turn out to be more easier than
expected considered I had just written the code.

My lessons from that:

    
    
      A) Version control, pushed somewhere else.
      B) Use simple build scripts.

~~~
abluecloud
just run `!rm` and hope for the best

------
Yhippa
"Tell me about a time where something didn't go the way you planned it at
work."

------
zimpenfish
I've done this - ran out of space on /home for the mSQL database (~1996 era),
I moved it to /tmp which had plenty free. I suspect most people can now guess
which OS this was on and what happened when the machine rebooted some weeks
later...

(Hint: Solaris)

~~~
ska
but ... but ... but ... it's right in the _name_ !

~~~
mikestew
That didn't stop a tester on a team I managed. Server was getting low on
space, I found a bunch of crap in a sub-directory named _tmp_. Deleted said
crap. Tester complains shortly after, I explain what happened and why one
shouldn't put stuff in directories named "tmp". He retorted that those were
_production_ tests. Okay, maybe one can't be expected to remember 30 or more
years of Unix naming conventions, but I should challenge you to battle simply
for using crappy, non-descriptive directory names. I'm supposed to find
production tests in a directory named "tmp"? Clean out your desk.

Of course there was no backup, because the sysadmin doesn't back up things in
_tmp_ directories.

------
orbitingpluto
I was forced to train someone so cocky that they ended up doing a rm -rf / on
our production server a month after I quit. He also accidentally euthanized a
legacy server, deleted the accounting database when he was trying to do a
hardware based RAID rebuild, completely destroyed the Windows domain server
and mocked my daily tape backup regimen - opting to ship USB consumer grade
hard drives in an unpadded steel box instead to off-site storage... The list
goes on. He literally destroyed everything he touched. The only reason he
wasn't fired was because he was a pretty man.

~~~
chris_wot
Oh man, this was my last boss who literally destroyed every one of the
critical reports to key clients. I quit when I realised he was undermining me
to the CEO, and the CEO was listening to him. He then completely borked the
entire system so badly that he was forced to resign 6 weeks later.

The company went from 25 clients to 4 last month.

~~~
orbitingpluto
I think tech workers should have a 'strategic vacation reserve' specifically
to deal with this situation.

~~~
chris_wot
Actually, that's what I'm planning on doing from now on. However, I'm going to
have to make sure all my own processes are _completely_ bulletproof and
documented.

------
linsomniac
A dark and snowy night a bunch of databases on a server just vanished. This
was on a server that was still in development, but was part of the billing
system for a huuuge company, and it was under a lot of scrutiny. The files are
just gone. So I contact the DBA and the backup group. For whatever reason,
they can't pull it off local backups, so tapes had to be pulled in from Iron
Mountain.

As I said above, a dark and snowy night. Took Iron Mountain 4 hours to get the
tapes across town. The DBA and I finally get the database up around 8am the
next morning. I investigate, but can't find any system reason for the
databases vanishing, the DBA can't either.

2 weeks later, the same thing happens.

I eventually track it down to a junior developer who has been logged in and
has on several occasions run this: "cd /" followed by "rm -rf
/home/username/projectname/ *" Note the space before the star. On further
investigation, I find the database group installed all the Oracle data
directories with mode 777.

------
fortpoint
Sounds like a terrible situation. I wish those guys luck.

One useful sys ops practice is the creation and yearly validation of disaster
recovery runbooks. We have a validated catalog of runbooks that describe the
recovery process for each part of our infrastructure. The validation process
involves provoking a failure (eliminate a master database), running the
documented recovery steps and then validating the result. The validation
process is a lot easier if you're in the cloud since it's cheap and easy to
set up a validation environment that mirrors your production env.

------
3stripe
Posting as a reminder to myself that "in the cloud" != safe

There's always room for computer error, and more like, human error.

Imagine if something like this happened to Dropbox? Ooooft.

~~~
noneshallpass6
This gives most SaaS providers a bad name. The error here is not the engineer
deleting the db, it's the complete lack of data restore testing.

Complex restoring never works well when the first implementation is under the
pressure of the real event. Other SaaS providers will be cursing such a big
name tool making such a public mess.

~~~
pmlnr
DR testing is hard, complicated and costs a lot. Yes, it should be done
regularly, but it's not an easy task; I believe small(ish) companies simply
can't afford it.

~~~
notalaser
Then perhaps small(ish) companies shouldn't hold data that's critical to their
customers. Just like real engineering companies don't build nuclear reactors
if they can't test their safety systems, cars if they can't afford to do crash
testing and so on.

DR is not a luxury. Systems that don't properly do DR aren't unoptimized or
something, they're badly engineered.

~~~
olalonde
Doubt any business will go down as a result of a web based diagram tool being
unavailable for a few hours.

~~~
notalaser
I wasn't referring to this specific case. Also, "don't worry, our services may
suck, but not to the point where they bring down your business" is not exactly
the kind of reliability one would want to aspire to.

------
xnohat
Every System admin could have a bad day like this :) Some years ago I have
deleted entire production server just by very simple command "rm -rf /"
instead "rm -rf ./" and I had logged in with root account. No words to explain
the feeling at that time. Thanks to backups, without it, I have been killed
thousand time by my customers.

~~~
mootothemax
>deleted entire production server just by very simple command "rm -rf /"
instead "rm -rf ./"

It's absolutely a rite of passage, fun times!

My personal favourite is similar:

    
    
        rm -fr .*
    

_ouch_

------
jon-wood
I'll join the chorus of people who've done something similar. In my case it
was the database of a small e-commerce site, where I'd taken a backup and then
formatted the database server to reinstall it.

What I hadn't realised was that the backup script was set to dump to the OS
drive, so in the process I'd also just formatted the backup. Thankfully one of
our developers had a recent copy of the database locally, but it definitely
wasn't my finest hour.

------
cyberferret
I am actually gladdened by reading the posts by others on here mentioning how
they did the same thing. I've been kicking myself for decades over a similar
thing I did when I was starting out as a programmer.

Not as big as some of those here, but back in the late 80's I was a self
employed programmer writing DOS apps for local businesses to help them run
more efficiently.

There was a local martial arts supply shop whose owner was sort of a friend of
mine, and he engaged me to write a stock control and hire database for him,
which I did. When it came time to implement, he told me that there was a LOT
of data to enter, so he would hire a couple of young students of his to sit
down for an entire week and key in the data, which was all good.

After they had finished, he called me back in to 'go live', and I sat down in
front of his server PC and began to check that everything was OK. Normally, it
is my habit to take a backup of the entire app directory before working on it,
but I think I was going through a break up with my then girlfriend and was a
little sleep deprived.

I noticed that some temporary indexes had been created during the data entry
and I went to quickly delete it (thinking to rebuild all the indexes for best
performance), but typed in 'DEL _.DAT ' instead of 'DEL _.KEY'.

I still remember that sinking feeling as I sat there looking at the blinking
'C:\>' prompt, knowing I had wiped out all his work. Telling the owner was
also one of the hardest things I have done, and I fully expected him to pull
down one of the sharp oriental weapons from the wall and take me apart.

But he was really cool and understanding about it. He refused my offer to pay
for the students to come back in and re-key the data again, which actually
made me feel worse, because I knew he wasn't having the easiest time at that
point making ends meet in his business.

End of the day, we got it all working and he used the system for many, many
years. But to this day, I still make a copy of anything I am about to touch,
before I work on it.

------
ghamrick
In prehistoric times on an OS named CTOS, a distributed client/server OS, I
was charged with making tape backups of user's local workstations, IVOLing
(formatting) the disk, and restoring from tape. The contract specced that 2
tape backups were to be made, but of course in the interest of expediency, I
only made one. And then I encountered the user's tape that wouldn't restore. I
remember thinking that losing a user's data is the biggest crime a sysadmin
can possibly commit, and it taught me a great lesson on the value of backups
and their integrity. Fortunately, I swapped out tape drives like a mad man
until one managed to restore the tape.

------
Illniyar
I must say that's really the most transparent way to handle a downtime I've
ever seen.

I would be scared shitless to expose for all to see what really happened and
what is happening, even more so when it's makes them look like they don't know
what they are doing.

I must applaud them for that, I wish if I ever get into such a nasty
situation, I'll be able to do what they did.

------
jjuhl
This reminds me of something I did at a previous employer (an ISP), many, many
years ago.

I needed to do an update in a SQL database to fix some customer issue - the
statement should just update one row but seemed to take a looong time to run,
which seemed strange. When it finished and printed something like "700000 rows
updated" I noticed I had forgotten the WHERE clause and I had also not started
a transaction that I could roll back. Whoops!

That's when our support got really busy answering customer phone calls and I
started asking who was in charge of our backups.

That was _not_ a good day.

~~~
sqldba
As a DBA that's pretty bad. I have seen someone do that, and was able to
restore a copy of the database side by side, and then update that column of
each row back (luckily it wasn't a high traffic table so many nobody would
even know).

But I was also not happy they were doing this stuff straight into production.
And even when they do it in dev/test/qa, I can see when I ask them questions
about how they can verify that what they did did what they wanted, that they
really don't know (especially when we go beyond single line statements where
you can see a row count, and into a couple dozen lines of stored procedures).

But then you need to start controlling it through a web front end to allow
that operation to happen. Or a secured PowerShell interface (which are time
consuming to put in place and then maintain let alone secure and also train
people on). And I don't have the energy for that with all of the other fires
I'm fighting.

So yeah. Understood. But not good.

------
creullin
Sucks, but we've all been there. If the admin is reading this, it's all going
to be ok! Just remember, life sucks, then you die...

------
shubb
Poor guys. Really interesting reading though.

I initially thought it was weird they had to run several "processes" in case 1
failed. But running out of space or something correctable is actually
something likely to happen. Is this standard? It's quite smart.

Anyway, assuming they get the data back, I think they've done pretty good - 0
data loss and a days downtime isn't bad given this is a true disaster.

It would be nice if they'd let us know how the db got deleted, and what they
suggest to mitigate in a blog after.

~~~
ndespres
In my experience it's good to have a few different restore attempts running in
parallel, so long as they won't conflict with one another. For example, the
restores may be to different hardware, from a different point-in-time, in a
different datacenter, or from a unique backup system (image-based, file-based,
transaction logs only, etc). One of their "restore" attempts could even be a
scan of the original disk for recoverable data, though that seems unlikely.

In a critical outage like the one Gliffy is experiencing, I take the same
approach. Outline all your restore options, estimate time for each restore,
drawbacks of each approach, etc and take every available angle.

If you want to know how the database was deleted, read through some of the
horror stories posted in the comments here and assume it was probably one of
those!

------
ZeWaren
That reminds me of that time where I imported the nightly dump of a database
TWICE into the same server.

Dropping an entire database brings problems, having duplicate content and
deleted content coming back bring a whole new realm of others good times.

------
moviuro
My mentor told me: "get everything wrong, but get the backups right", as he
was busy debugging the backup solution he had in place at my college (ZFS +
NetApp + rsync + sh + perl + tape).

On my own, I'd put CoW wherever possible. It's so easy to delete something on
UNIX that it should also be easy to restore and CoW is without a doubt a no-
brainer for this.

~~~
debacle
CoW?

~~~
moviuro
Copy on Write, see [https://en.wikipedia.org/wiki/Copy-on-
write](https://en.wikipedia.org/wiki/Copy-on-write) and
[http://arstechnica.com/information-
technology/2014/01/bitrot...](http://arstechnica.com/information-
technology/2014/01/bitrot-and-atomic-cows-inside-next-gen-filesystems/)

------
return0
> We are working hard to retrieve all of your data.

Given that in most cases where a backup exists, the user data is not lost,
it's a bit unsettling to say that (and also in most cases admins are not
working, they are mostly waiting). It's more reassuring to the user to say "we
are verifying that all data is restored correctly" or sth.

------
novaleaf
My first real job was a DBA at Microsoft, on a very large marketing database
(aprox 1.5TB in 2000)

That experience, how much work is required for "real" production databases,
led a bad taste in my mouth. I stay away from self-hosted db's to this day.
(example, I use google cloud datastore nowadays)

~~~
sqldba
IMHO that's still a very large database today. Most people don't appreciate
that restoring that is likely going to be 4-8 hours given SAN and network
speeds unless you're in a super modern SSD-driven environment.

When it goes higher, 5TB, 10TB, 15TB, that's out of my league, and I just say
days of downtime. Also remembering those are often spread across two servers
like in an Availability Group, which means two restores...

I know people will pipe in and talk about partitioning and read-only file
groups and partial restores. Except that in the real world I've seen zero of
it, and it's not general DBA knowledge, it's likely the top 1% or even 0.1%.

And even then you'll still have massive downtime (depending on how the
application is architected to support this), and you'd better have bullet-
proof testing (and 20TB spare disk space to repeatedly test it with), to make
sure a complex restore like that is easy to carry out while under the pump of
an outage.

------
kchoudhu
I did this to the trading database back in 2008 while supporting the mortgage
desk of a major investment bank, a day before Lehman went down.

Thank god for backups and translog replays.

------
dkopi
"The good news is that we have copies of our database that are replicated
daily, up until the exact point of time when the database was deleted. We are
working hard to retrieve all of your data."

Better news would be if every user had local copies of their work too. both in
local storage, and on a cloud storage provider of their choice. Preferably in
a non proprietary format.

This isn't just about getting me to trust your site if you crash or have a
tragic mistake. This is also about getting me to trust your site if you go out
of business (as too many startups unfortunately do).

------
tobinharris
In 2002 I accidentally executed

DROP TABLE HOTELS;

whilst working on the Virgin Holidays website. We managed to get it from
backup but it made me shart.

------
wazoox
Ah, that moment when we needed to copy 1 master disk drive to 80 PCs urgently
using Ghost, and my boss said "I'll take care of it, I'm very familiar with
Ghost". And with the first PC proceeded to copy the blank disk onto the
master.

Problem was : creating the master drive was the job of someone else 1000 km
away, with special pieces of tailor-made software... The guy ended at the
airport trying to get someone on a leaving plane taking couriering the disk
drive (fortunately for us, some lady accepted; this was still possible in
1998).

------
linsomniac
I once had a client we were running their office Linux server for. They needed
more storage, so they asked me to come in and put in some larger drives on the
RAID array. Somehow during this, the old drives freaked out and the data was
just gone.

So, we go to the backup tapes. Turns out that something changed in the few
years since we set up backups, and the incrementals were being written at the
beginning of the tape instead of appending. These were DDS tapes, and there is
a header that stores how much data is on the tape, so you can't just go to the
end and keep reading.

Now, we had been recommending to them every month for a year or more that a
backup audit should be done, but they didn't want to spend the money on it.

They contacted a data recovery company who could stream the data off the tape
after the "end of media", and I wrote a letter to go with the tape: "Data on
this tape is compressed on a per-file basis, please just stream the whole tape
off to disk and I'll take it from there." We overnight it to them and a week
later they e-mail back saying "The tape was compressed, so there is no usable
data on it." I call them up and tell them "No, the compression re-starts at
every file, so overwriting the beginning is fine, we can just pick up at the
next file. Can you just stream it off to disc?" "Oh. Welllll, we sent the tape
back to you, it should be there in a week." They shipped it ground. We shipped
it back, they did the recovery, and we got basically all the data back.

------
hiperlink
~20 years ago I was working for a relatively small banking software company
(in Hungary) (it was a really good job from the learning point of view, but
was really underpaid).

One Monday afternoon one of our clients just called that the banks officer's
suddenly can't log in, random strange errors are getting displayed for them,
etc.

OK, our support team tried to check, we can't login either, strange error.

"Did you do anything special, [name of the bank's main sysadmin]?"

"Well, nothing special, I just cleaned up the disks as usual."

"How did you do it?"

"As usual: 'mc', sort by file size in the INTERFACE/ folder, marked the files
and F8".

That's normal.

OK, since we had the same user account (I knoooow), launch 'mc'. Looks normal.
Except... In the left panel the APP/DB directory is opened... Check... Appears
normal... At first... But... WAIT. Where is the <BANKNAME>.DB1 file?

"<ADMIN>, how long time did it take?"

"Dunno, I went for my coffee, etc."

Apparently he deleted the production systems' main DB file. It's got resolved
by restoring the backup from Saturday and every file and input transactions
had to be re-inputed based on the printed receipts, the officer's stayed in
late night, etc. He is still the head of IT at the same bank. (Yeah everyone
makes mistakes, but it wasn't the only one of hims, but likely the biggest.)

------
aNoob7000
I would really love to get more detail about how they structured the full
backups and transaction log backups for the database. Are the backups dumped
to disk before being picked up on tape? Or are the backups streamed directly
to the backup system?

I'd also would love to know how large is the database that was deleted. Doing
a point in time restore of database that's a couple of hundred gigs should be
relatively fast (depending on what hardware you are running on).

------
peterwwillis
Serious question: Do modern "all in the cloud" tech companies actually have DR
plans?

All the presentations i've seen about people deploying in the cloud leaves out
any DR site, replication process, turnover time for the DR site taking
production traffic, etc. It's like they believe redundant machines will save
them from an admin accidentally hosing their prod site and having to take 3+
days to recover.

~~~
twunde
It really depends on the company. My suspicion is that companies only get
serious about DR once they reach a fairly big size. When I interviewed at
Squarespace they were just finishing building out their second datacenter. At
that point it was a DR data center but they were planning on switching all
production traffic to it in the coming months. Most of the startups/midsize
companies have relied on a third-party for backups ie Rackspace.

~~~
peterwwillis
That's what strikes me as odd. Even small and mid-size companies can do this
for cheap if they're mostly cloud-based.

Stand up the environment at another cloud provider, keep resource use at 1%
that of your current provider, implement a continuous replication procedure,
document the failover procedure, and test-run once a month. Much less work
than actually buying and organizing some small colo space in another DC, and
way faster than scrambling to recover. Yet I don't know of a single cloud-
dependent company that does this unless it's for performance reasons.

------
girkyturkey
My first internship used Google Drive for their database (small start up) and
there have been numerous times where I have almost lost a substantial amount
of work/information. This article brought back that feeling of anxiety. But
that is a lesson to be learned, even if it was the hard way. Everyone goes
through that at some point in their career.

------
jestar_jokin
Earlier in my career, I worked in prod support for an insurance web
application. It had a DB containing reference data. This reference data was
maintained in an Excel spreadsheet; a macro would then spit out CSV files,
which would be used by command line scripts to populate databases in different
environments (test, staging, pre-production). The DB data was totally replaced
each time. Pre-production data would be copied into production, every night or
so.

One time, I ran the staging and pre-production scripts at the same time. This
had the unusual effect of producing an empty CSV file for pre-production.

When I got in the next day, I discovered all of the production data had been
wiped out overnight...

Thankfully, it was all reference data, so it was just a matter of re-running
the export macros, and pleading with a DBA to run the data import job during
business hours.

I ended up writing a replacement using generated SQL, so we could apply
incremental updates (and integrate better with a custom ticketing system).

------
okket
These days it should be possible to roll back a few steps (à 15 min / 1 hour)
with a copy on write filesystem like zfs? Full scale restore from backup
should only be necessary if the storage hardware fails (IMHO).

You still need to apologize for a some data loss, though. So make sure that
everything you do has one or two safety nets before it hits the customer.

------
alphacome
I am wondering why the OS not introduce a policy to protect important
files/directories. For example, we can mark something is important, then if
someone try to delete it, it will ask the person to input some key (at least
20 characters), if the key is incorrect, the operation will be canceled.

~~~
codys
Yes, we call those marks "permissions". Most operating systems have them. The
problem is putting the right process in place around using them.

~~~
justinclift
There are also (extended) attributes, such as the "immutable" flag, available
on some OSs/filesystems.

If you're interested, look into __chattr __for Linux, and __chflags __for
BSDs.

[https://en.wikipedia.org/wiki/Chattr](https://en.wikipedia.org/wiki/Chattr)

------
donatj
New devops guy at my work a few years ago somehow completely blows away the
CDN. Of course we have all of the data locally but it took almost a full day
to reupload. I believe this is our longest downtime to date.

------
lasermike026
Just reading this headline makes me queasy.

------
Joyfield
I once accidentally moved the cgi-bin (long time ago) on one of Swedens
biggest websites. moved it back pretty quick so it was "only" down for a
couple of seconds.

------
BinaryIdiot
My very first commercial experience doing development was as an intern at Polk
Audio. At the time their online solution was pretty immature (no version
control and no development environments; everything was coded up in
production).

I was working on a very important, high traffic form and...accidentally
deleted it. Their backup consisted of paying another company to back up each
file. Fortunately they came through but it took a full day to restore a single
file.

------
forgottenacc56
Good management blames management for this. Bad management blames the sysadmin
and publicly says that "the sysadmin did it".

------
gtrubetskoy
This is where delayed replicas come in very handy:
[https://dev.mysql.com/doc/refman/5.6/en/replication-
delayed....](https://dev.mysql.com/doc/refman/5.6/en/replication-delayed.html)
(I don't know whether they're running on MySQL though...)

------
iamleppert
One time I restored a database from a MySQL binary log a table that held about
10,000 employee pay rates. Unfortunately, the log was shifted a few rows and
the mistake wasn't noticed until a few weeks later when the CEO and some high
level directors noticed their pay was traded in for the high 6 figures to an
hourly rate.

What a mess!

------
mrlyc
I've found that it's important to do my own backups and not rely on IT to do
them. I once returned from my holiday to find that the sysadmin had wiped my
hard drive. He said he thought I had left the company. Fortunately, I had
backups on computers in other states that he didn't know about.

------
unfunco
Have done this and similar. And now I have aliases in my zshrc:

    
    
        alias db="mysql --i-am-a-dummy"

------
w8rbt
People who do things make mistakes. It's the ones who don't make mistakes that
should be of concern.

------
ausjke
Knew one sysadmin was fired due to his "rm -rf /" fat finger without a working
backup tape scheme.

Also once we had to retrieve some code from tapes, which are just stacked in a
messy black room, and nobody can eventually find that, but no firing anybody
either.

------
pc86
Looks like the pricing page is 404 right now as well (but all other pages seem
to be fine)

------
noir-york
Admit it - who here hasn't read this and not gone back and tested their
restores?

------
PaulHoule
Last time I did that the chief sysadmin had my back and we had it restored in
5 min.

------
nwatson
Sorry for those that lost information, personally glad it didn't involve
Atlassian Confluence-hosted Gliffy illustrations ... I have a lot of those and
the tool is great for quick shareable embedded engineering sketches.

------
alienbaby
very early career days, wrote a script that had rm -rf in it. I knew this was
dangerous and so the script asked, 3 Times, if you were sure you were in the
right place.

That was the problem, asking 3 times.. people just spammed enter x3 at that
point in the script.

Someone using it came over to me one day.. 'hey, look, what going on with this
system. I can't do ls ? '

There was no system, pretty much. The script had rm -rf'd while he was root
and running the script from root.

The job of the script? installing and configuring the backups for a system. So
yea, there were no backups for this system at this point in time !

~~~
csours
We had a similar script in production, it was password protected with a
password of "badidea".

------
matchagaucho
I only store... IDK.... about 80% of my system architecture diagrams on
Gliffy.

FML :-/

------
keitmo
The "Other" Moore's Law:

Backups always work. Restores, not so much.

------
manishsharan
This is my biggest fear when I use my production Redis

------
sirpogo
And Gliffy is back up.

[https://www.gliffy.com/apology/](https://www.gliffy.com/apology/)

------
daodedickinson
Are there any more sites like gliffy and draw.io?

~~~
dmgrow
Yes, [https://www.lucidchart.com](https://www.lucidchart.com) is very popular.

------
Sujan
Poor guys...

------
Raed667
Shit happens =)

------
hathym
don't laugh, this can happen to you

------
yitchelle
Just going to add the obligatory
[http://thedailywtf.com/](http://thedailywtf.com/)

------
xg15
I accidentally all the data...

------
owenwil
Anyone have a screenshot?

------
odinduty
Well, who hasn't done a DELETE without a WHERE clause? ;P

~~~
traviscj
I have a very strict habit after once or twice doing this: always start all
DELETE commands with "\--" (that is, comment it out) until I have written a
where-clause.

My command in the buffer might go through these steps:

1\. "Delete from" 2\. "\-- delete from" 3\. "\-- delete from table where
condition limit n;" 4\. (Generally either ask a co-worker or make a Jira with
the exact command I have at this point, so there's a sanity check and/or
permanent record, but for very low risk/especially mundane/especially time
critical updates, do it) 5\. Delete the "\--" and run it. 6\. Think hard about
adding some functionality in the app for doing it in app code unsteady of in
database code.

Generally do the same for "UPDATE".

~~~
spacecowboy_lon
I tend to write my DELETE statements by writing a SELECT first then edit that
into the delete version

~~~
cwilkes
I've turned to creating a temp table and putting the primary keys from the
select statement into there so I can guard myself against the thing that was
to delete a handful of rows deleting everything. Plus with that you can do
another join and see if those rows have some sort of value in it that you
didn't expect.

