
How I Fired Myself - mkrecny
http://edu.mkrecny.com/thoughts/how-i-fired-myself
======
bguthrie
More than anything else, this describes an appalling failure at every level of
the company's technical infrastructure to ensure even a basic degree of
engineering rigor and fault tolerance. It's noble of the author to quit, but
it's not his fault. I cannot believe they would have the gall to point the
blame at a junior developer. You should expect humans to fail: humans are
fallible. That's why you automate.

~~~
potatolicious
More than that, it's telling that the company threw him under the bus when it
happened. I've been through major fuckups before, and in all cases the team
presents a united front - the _company_ fucked up, not an individual.

Which is, if you think about it, true, given that the series of events leading
up to the disaster (the lack of a testing environment, working with prod
databases, lack of safeties in the tools used to connect to database, etc...).

The correct way to respond to disasters like this is "we fucked up", not
"someone fucked up".

~~~
j_baker
_I've been through major fuckups before, and in all cases the team presents a
united front - the company fucked up, not an individual_

You should consider yourself _very_ lucky. Or very savvy at knowing which
companies to avoid.

~~~
silverbax88
I have to chime in and completely agree. Very lucky. Most people who survive
for years at companies have learned to either stay out of sight, or navigate
the Treacherous Waters of Blame whenever things go wrong.

This is actually one of the things most employees who have never been managers
don't understand.

~~~
blablabla123
Your comment makes me think. Are you implying that this is a good practice?

I mean, in fact I do something similar. At our company also a lot of stuff
goes wrong. Somehow it surprises me that there was no major fuckup yet. But I
do realize that I need to watch out all times that blame never concentrates on
me.

It is so easy to blame individuals, it just suffices to have participated
somehow in a task that fucked up. Given that all other participants keep a low
profile, one needs to learn how to defend/attack in times of blame.

~~~
ahoyhere
You (and the other commenters with similar strategies) are wasting productive
years of your life at jobs like these. You should go on a serious job hunt for
a new position, and leave these toxic wastelands before they permanently
affect your ability to work in a _good_ environment.

~~~
j_baker
As soon as you find an environment where no one ever plays the blame game, let
me know.

------
columbo
News flash,

If you are a CEO you should be asking this question: "How many people in this
company can unilaterally destroy our entire business model?"

If you are a CTO you should be asking this question: "How quickly can we
recover from a perfect storm?"

They didn't ask those questions, they couldn't take responsibility, they
blamed the junior developer. I think I know who the real fuckups are.

As an aside: Way back in time I caused about ten thousand companies to have to
refile some pretty important government documents because I was doubling xml
decoding (&amp; became &amp;amp;). My boss actually laughed and was like "we
should have caught this a long time ago"... by we he actually meant himself
and support.

~~~
lsc
>If you are a CEO you should be asking this question: "How many people in this
company can unilaterally destroy our entire business model?"

This is a question that the person in charge of backups needs to think about,
too. I mean, rephrase it as "Is there any one person who can write to both
production and backup copies of critical data?" but it means the same thing as
what you said.

(and if the CTO, or whoever is in charge of backups screws up this question?
the 'perfect storm' means "all your data is gone" - dono about you, but my
plan for that involves bankruptcy court, and a whole lot of personal shame.
Someone coming in and stealing all the hardware? not nearly as big of a deal,
as long as I've still got the data. My own 'backup' house is not in order,
well, for lots of reasons, mostly having to do with performance, so I live
with this low-level fear every day.)

Seriously, think, for a moment. There's at least one kid with root on
production /and/ access to the backups, right? At most small companies, that
is all your 'root-level' sysadmins.

That's bad. What if his (or her) account credentials get compromised? (or what
if they go rogue? it happens. Not often, and usually when it does it's "But
this is really best for the company" It's pretty rare that a SysAdmin actively
and directly attempts to destroy a company.)

(SysAdmins going fully rogue is pretty rare, but I think it's still a good
thought experiment. If there is no way for the user to destroy something when
they are actively hostile, you /know/ they can't destroy it by accident. It's
the only way to be sure.)

The point of backups, primarily, is to cover your ass when someone screws up,
primarily. (RAID, on the other hand, is primarily to cover your ass when
hardware fails) - RAID is not Backup and Backup is not RAID. You need to keep
this in mind when designing your backup, and when designing your RAID.

(Yes, backup is also nice when the hardware failure gets so bad that RAID
can't save you; but you know what? that's pretty goddamn rare, compared to
'someone fucked up.')

I mean, the worst case backup system would be a system that remotely writes
all local data off site, without keeping snapshots or some way of reverting.
That's not a backup at all; that's a RAID.

The best case backup is some sort of remote backup where you physically can't
overwrite the goddamn thing for X days. Traditionally, this is done with off-
site tape. I (or rather, your junior sysadmin monkey) writes the backup to
tape, then tests the tape, then gives the tape to the iron mountain truck to
stick in a safe. (if your company has money; if not, the safe is under the
owner's bed.)

I think that with modern snapshots, it would be interesting to create a 'cloud
backup' service where you have a 'do not allow overwrite before date X'
parameter, and it wouldn't be that hard to implement, but I don't know of
anyone that does it. The hard part about doing it in house is that the person
who managed the backup server couldn't have root on production and vis-a-vis,
or you defeat the point, so this is one case where outsourcing is very likely
to be better than anything you could do yourself.

~~~
Piskvorrr
> If there is no way for the user to destroy something when they are actively
> hostile, you /know/ they can't destroy it by accident.

Which also means they can't _fix_ something in case of a catastrophic event.
"Recover a file deleted from ext3? Fix a borked NTFS partition? Salvage a
crashed MySQL table? Sorry boss, no can do - my admin powers have been
neutered so that I don't break something 'by accident, wink wink nudge
nudge'." This is, ultimately, an issue of _trust_ , not of artificial
technical limitations.

> one case where outsourcing is very likely to be better than anything you
> could do yourself.

Hm. Your idea that "cloud is actually pixie dust magically solving all
problems" seems to fail your very own test. Is there a way to prevent the
outsourced admins from, um, destroying something when they are actively
hostile? Nope, you've only added a layer of indirection.

(also, "rouge" is "#993366", not "sabotage")

~~~
lsc
>> If there is no way for the user to destroy something when they are actively
hostile, you /know/ they can't destroy it by accident.

>Which also means they can't fix something in case of a catastrophic event.
"Recover a file deleted from ext3? Fix a borked NTFS partition? Salvage a
crashed MySQL table? Sorry boss, no can do - my admin powers have been
neutered so that I don't break something 'by accident, wink wink nudge
nudge'." This is, ultimately, an issue of trust, not of artificial technical
limitations.

All of the problems you describe can be solved by spare hardware and _read
only_ access to the backups. I mean, your SysAdmin needs control over the
production environment, right? to do his or her job. but a sysadmin can
function just fine without being able to overwrite backups. (assuming there is
someone else around to admin the backup server.)

fixing my spelling now.

Yes, it's about trust. but anyone who demands absolute trust is, well, at the
very least an overconfident asshole. I mean, in a properly designed backup
system (and I don't have anything at all like this at the moment) _I_ would
not have write-access to the backups, and I'm majority shareholder _and_ lead
sysadmin.

That's what I'm saying... backups are primarily there when someone screwed it
up... in other words, when someone was trusted (or trusted themselves) _too
much_.

~~~
Piskvorrr
Okay, now I think I understand you, and it seems we're actually in agreement -
there is still absolute power, but it's not all concentrated in one user :)

(that rouge/rogue thing is my pet peeve)

------
xentronium
This is certainly a monumental fuckup, but these things inevitably happen even
with better development practices, this is why you need backups, preferably
daily, and as much separation of concerns and responsibilities as humanly
possible.

Anecdote:

I am working for a company that does some data analysis for marketers
aggregated from a vast number of sources. There was a giant legacy MyISAM
(this becomes important later) table with lots of imported data. One day, I
made some trivial looking migration (added a flag column to that table). I
tested it locally, rolled it out to staging server. Everything seemed A-OK
until we started migration on the production server. Suddenly, everything
broke. By everything, I mean EVERYTHING, our web application showed massive
500-s, total DEFCON1 across the whole company. It turned out we ran out of
disk space, since apparently myisam tables are altered the following way:
first the table is created with updated schema, then it is populated with data
from the old table. MyISAM ran out of disk space and somehow corrupted the
existing tables, mysql server would start with blank tables, with all data
lost.

I can confirm this very feeling: "The implications of what I'd just done
didn't immediately hit me. I first had a truly out-of-body experience, seeming
to hover above the darkened room of hackers, each hunched over glowing
terminals." Also, I distinctly remember how I shivered and my hands shook. It
felt like my body temperature fell by several degrees.

Fortunately for me, there was a daily backup routine in place. Still, several
hour long outage and lots of apologies to angry clients.

"There are two types of people in this world, those who have lost data, and
those who are going to lose data"

~~~
perlgeek
Reading those stories makes me realize how well thought-out the process at my
work is:

We have dev databases (one of which was recently empty, nobody knows why; but
that's another matter), then a staging environment, and finally production.
And the database in the staging environment runs on a weaker machine than the
prod database. So before any schema change goes into production, we do a time
measurement in the staging environment to have a rough upper bound for how
long it will take, how much disc space it uses etc.

And we have a monthly sync from prod to staging, so the staging db isn't much
smaller than prod db.

And the small team of developers occasionally decides to do a restore of the
prod db in the development environment.

The downside is that we can't easily keep sensitive production data to find
its way into the development environment.

~~~
wpietri
When moving data from prod to other environments, consider a scrambler. E.g.,
replace all customer names with names generated from census data.

I try to keep data having the same form (e.g., length, number of records,
similar relationships, looks like production data). But it's random enough so
that if the data ever leaks, we don't have to apologize to everybody.

Since your handle is perlgeek, you're already well equipped to do a streaming
transformation of your SQL dump. :)

~~~
Domenic_S
Yep. For x.com I wrote a simple cron job that sterilizes the automated
database dump and sends it to the dev server. Roughly, it's like this:

-cp the dump to a new working copy

-sed out cache and tmp tables

-Replace all personal user data with placeholders. This part can be tricky, because you have to find everywhere this lives (are form submissions stored and do they have PII?)

-Some more sed to deal with actions/triggers that are linked to production's db user specifically.

-Finally, scp the sanitized dump to the dev server, where it awaits a Jenkins job to import the new dump.

The cron job happens on the production DB server itself overnight (keeping the
PII exposure at the same level it is already), so we don't even have to think
about it. We've got a working, sanitized database dump ready and waiting every
morning, and a fresh prod-like environment built for us when we log on. It's a
beautiful thing.

~~~
pc86
This sounds like it'd make a good blog post.

------
grey-area
Tens of thousands of paying customers and _no backups_?

No staging environment (from which ad-hoc backups could have been
restored)!?!?

No regular testing of backups to ensure they work?

No local backups on dev machines?!?

Using a GUI tool for db management on the live db?!?!?

No migrations!?!?!

Junior devs (or any devs) _testing_ changes on the live db and wiping
tables?!?!?!

What an astonishing failure of process. The higher ups are definitely far more
responsible than some junior developer for this, he shouldn't have been
allowed near the live database in the first place until he was ready to take
changes live, and then only on to a staging environment using migrations of
some kind which could then be replayed on live.

They need one of these to start with, then some process:

<http://www.bnj.com/cowboy-coding-pink-sombrero/>

~~~
jakejake
My hypothesis is that it's a game company and all of the focus was on the game
code. The lowly job of maintaining the state server was punted off to the
"junior dev" just out of school. Nobody was paying attention. It was something
that just ran.

They paid the price of ignoring what was actually the most critical part of
their business.

~~~
ptaipale
I disagree slightly. If you're a game company, your most critical part of the
business is the game.

Even if you have a rock-solid database management, backup, auditing etc
process, if your game is not playable, you won't have any data that you could
lose by having a DB admin mis-click.

Still, not handling your next-most-critical data properly is monumentally
stupid and a collective failure of everyone who should have known.

------
cmos
When I was 18 I took out half my towns power for 30 minutes with a bad scada
command. It was my summer job before college and I went from cleaning the
warehouse to programming the main SCADA control system in a couple weeks.

Alarms went off, people came running in freaking out, trucks started rolling
out to survey the damage, hospitals started calling about people on life
support and how the backup generators were not operational, old people started
calling about how they require AC to stay alive and should they take their
loved ones on machines to the hospital soon.

My boss was pretty chill about it. "Now you know not to do that" were his
words of wisdom, and I continued programming the system for the next 4 summers
with no real mistakes.

~~~
gnarbarian
I'm interested in knowing some more details regarding the architectural setup
and organizational structure that would allow something like this to happen.

~~~
sophacles
Honestly, you don't. The IT engineering in power and other SCADA systems is
downright scary.

~~~
gnarbarian
sounds like an opportunity to me.

~~~
sophacles
It's a hard space to break into. The businesses are conservative about any
change, and new vendors are fairly un-trusted. Further, it is a bit different
of a world than normal software realms, due to crazy long legacy lifetimes.
Another problem is that the money isn't as big as you'd think, its a
surprisingly small field.

All that being said, there is opportunity, just not easy opportunity. And a
huge number of the people in it are boomers, so there is going to be big
shake-ups in the next decade or two.

------
cedsav
Whoever was your boss should have taken responsibility. Someone gave you
access to the production database instead of setting up a proper development
and testing environment. For a company doing "millions" in revenues, it's odd
that they wouldn't think of getting someone with a tiny bit of experience to
manage the development team.

~~~
tseabrooks
We sell middleware to a number of customers with millions of dollars in
revenue who don't have backups, don't have testbeds for rolling out to "dev"
before pushing to "prod" and don't have someone with any expertise in managing
their IT / infrastructure needs.

My experience is that this is the norm, not the exception.

~~~
jzelinskie
As a college student, this fucking horrifies me. Is there anyway I can
guarantee I don't end up at someplace as unprofessional as this? I want to
learn my first job not teach/lead.

~~~
potatolicious
The interview advice here is excellent. Ask questions - in the current climate
they're hunting you, not the other way around.

Additionally, start networking now. Get to know ace developers in your area,
and you will start hearing about top-level development shops. Go to meetups
and other events where strong developers are likely to gather (or really,
developers who give a shit about proper engineering) and meet people there.

It's next to impossible to know, walking into an office building, whether the
company is a fucked up joke or good at what it does - people will tell you.

------
mootothemax
_The CEO leaned across the table, got in my face, and said, "this, is a
monumental fuck up. You're gonna cost us millions in revenue"._

No, the CEO was at fault, as was whoever let you develop against the
production database.

If the CEO had any sense, he should have put you in charge of fixing the issue
and then _making sure it could never happen again_. Taking things further,
they could have asked you to find other worrying areas, and come up with fixes
for those before something else bad happens.

I have no doubt that you would have taken the task extremely seriously, and
the company would have ended up in a better place.

Instead, they're down an employee, and the remaining employees know that if
they make a mistake, they'll be out of the door.

And they still have an empty users table.

~~~
j_baker
To be fair, if the CEO were willing to take those steps, the company would
probably not have a deleted USERS table.

------
hackoder
I was in a situation very similar to yours. Also a game dev company, also lots
of user data etc etc. We did have test/backup databases for testing, but some
data was just on live and there was no way for me to build those reports other
than to query the live database when the load was lower.

In any case, I did a few things to make sure I never ended up destroying any
data. Creating temporary tables and then manipulating those.. reading over my
scripts for hours.. dumping table backups before executing any scripts.. not
executing scripts in the middle/end of the day, only mornings when I was fresh
etc etc.

I didn't mess up, but I remember how incredibly nerve wracking that was, and I
can relate to the massive amount of responsibility it places on a "junior"
programmer. It just should never be done. Like others have said, you should
never have been in that position. Yes, it was your fault, but this kind of
responsibility should never have been placed on you (or anyone, really).
Backing up all critical data (what kind of company doesn't backup its users
table?! What if there had been hard disk corruption?), and being able to
restore in minimum time should have been dealt with by someone above your pay
grade.

~~~
arethuza
Out of interest, why not create a database user account that is read only and
use that?

~~~
petriw
Just remember to always verify it's still read only.

Or a coworker will find the login in your scripts, repurpose it, then notice
they need more rights and "fix" the account for you.

~~~
MBCook
Plus read-only isn't a guarantee. You can't _write_ data, but you can run a
bad select or join that ends up effectively locking the database.

SELECT * FROM my_200_GB_table will always be there.

~~~
michaelt
Why should a select or join lock a database? Surely no database lets one query
starve another of IO or CPU?

------
Yare
If it helps explain things, the only experience the CEO had before this social
game shop was running a literal one-man yogurt shop.

This happened a week before I started as a Senior Software Engineer. I
remember getting pulled into a meeting where several managers who knew nothing
about technology were desperately trying to place blame, figure out how to
avoid this in the future, and so on.

"There should have been automated backups. That's really the only thing
inexcusable here.", I said.

The "producer" (no experience, is now a director of operations, I think?)
running the meeting said that was all well and good, but what else could we do
to ensure that nobody makes this mistake again? "People are going to make
mistakes", I said, "what you need to focus on is how to prevent it from
sinking the company. All you need for that is backups. It's not the engineer's
fault.". I was largely ignored (which eventually proved to be a pattern) and
so went on about my business.

And business was dumb. I had to fix an awful lot of technical things in my
time there.

When I started, only half of the client code was in version control. And it
wasn't even the most recent shipped version. Where was the most recent
version? On a Mac Mini that floated around the office somewhere. People did
their AS3 programming in notepad or directly on the timeline. There were no
automated builds, and builds were pushed from peoples' local machines -often
contaminated by other stuff they were working on. Art content live on our CDN
may have had source (PSD/FLA) distributed among a dozen artist machines, or
else the source for it was completely lost.

That was just the technical side. The business/management side was and is
actually _more hilarious_. I have enough stories from that place to fill a
hundred posts, but you can probably get a pretty good idea by imagining a
yogurt-salesman-cum-CEO, his disbarred ebay art fraudster partner, and other
friends directing the efforts of senior software engineers, artists, and other
game developers. It was a god damn sitcom every day. Not to mention all of the
labor law violations. Post-acquisition is a whole 'nother anthology of tales
of hilarious incompetence. I should write a book.

I recall having lunch with the author when he asked me "What should I do?". I
told him that he should leave. In hindsight, it might have been the best
advice I ever gave.

------
Morendil
So the person who made a split-second mistake while doing his all for the
business was pressured into resigning - basically, got fired.

What I want to know is what happened to whoever decided that backups were a
dispensable luxury? In 2010?

There's a rule that appears in Jerry Weinberg's writings - the person
responsible for a X million dollar mistake (and who should be fired over such
a mistake) is whoever has controlling authority over X million dollars' worth
of the company's activities.

A company-killing mistake should result in the firing of the CEO, not in that
of the low-level employee who committed the mistake. That's what C-level
responsibility means.

(I had the same thing happen to me in the late 1990's, got fired over it. Sued
my employer, who opted to settle out of court for a good sum of money to me.
They knew full well they had no leg to stand on.)

------
lkrubner
Klicknation is hiring. Of themselves, they say:

"We make astonishingly fun, ferociously addictive games that run on social
networks. ...KlickNation boasts a team of extremely smart, interesting people
who have, between them, built several startups (successful and otherwise);
written a novel; directed music videos; run game fan sites; illustrated for
Marvel Comics and Dynamite Entertainment with franchises like Xmen, Punisher,
and Red Sonja; worked on hit games like Tony Hawk and X-Men games; performed
in rock bands; worked for independent and major record lables; attended
universities like Harvard, Stanford, Dartmouth, UC Berkeley; received a PhD
and other fancy degrees; and built a fully-functional MAME arcade machine."

And this is hilarious: their "careers" page gives me a 404:

<http://www.klicknation.com/careers/>

That link to "careers" is from this page:

<http://www.klicknation.com/contact/>

I am tempted to apply simply to be able to ask them about this. It would be
interesting to hear if they have a different version of this story, if it is
all true.

------
caseysoftware
One of the things I like asking candidates is "Tell me about a time you
screwed up so royally that you were sure you were getting fired."

Let's be honest, we all have one or two.. and if you don't, then your one or
two are coming. It's what you learned to do differently that I care about.

And if you don't have one, you're either a) incredibly lucky, b) too new to
the industry, or c) lying.

~~~
blowski
This. We all mess up, but only the best ones will deal with it professionally
and learn from it. Sounds like the OP is in that group. He didn't try to hide
it, blame it on anyone else, or make excuses. He just did what he could to fix
his mistake.

When people say "making mistakes is unacceptable - imagine if doctors made
mistakes" they ignore three facts:

1\. Doctors do make mistakes. Lots of them. All the time.

2\. Even an average doctor is paid an awful lot more than me.

3\. Doctors have other people analysing where things can go wrong, and
recommending fixes.

If you want fewer development mistakes, as a company you have to accept it
will cost money and take more time. It's for a manager to decide where the
optimal tradeoff exists.

~~~
jiggy2011
> _If you want fewer development mistakes, as a company you have to accept it
> will cost money and take more time. It's for a manager to decide where the
> optimal tradeoff exists._

This is absolutely it, of course it is possible to become so risk averse that
you never actually succeed in getting anything done and there are certainly
organisations that suffer from that (usually larger ones).

However some people seem to take the view that it is impossible to protect
oneself from all risks therefor it is pointless protecting from any of them.

The good news is that usually protecting against risks tends to get
exponentially more expensive as you add "nines" therefor having a 99%
guarantee against data loss is _a lot_ cheaper than a 99.999% guarantee.

Having a cronjob that does a mysqldump of an entire database, emails some
administrator and then does rsync to some other location (even just a dropbox
folder) is something that is probably only a couple of hours work.

------
kibwen
_"I found myself on the phone to Rackspace, leaning on a desk for support,
listening to their engineer patiently explain that backups for this MySQL
instance had been cancelled over 2 months ago."_

Here's something I don't get: didn't Rackspace have _their own_ daily backups
of the production server, e.g. in case their primary facility was annihilated
by a meteor (or some more mundane reason, like hard drive corruption)?

Regardless, here's a thought experiment: suppose that Rackspace _did_ keep
daily backups of _every_ MySQL instance in their care, even if you're not
paying for the backup service. Now suppose they get a frantic call from a
client who's not paying for backups, asking if they have any. How much of a
ridiculous markup would Rackspace need to charge to give the client access
this unpaid-for backup, in order to make the back-up-every-database policy
profitable? I'm guessing this depends on 1) the frequency of frantic phone
calls, 2) the average size of a database that they aren't being paid to back
up, and 3) the importance and irreplacebility of the data that they're
handling (and 4) the irresponsibility of their major clients).

~~~
pbhjpbhj
Nope, not going to happen. At least one good reason and that is that if
Rackspace leak your data via a backup they're going down to the tune of
millions.

Yes it would be nice if Rackspace could speculatively create a backup but
they'd be dancing on ice doing so.

------
laumars
I really feel sorry for this guy. Accidents happen, which is why development
happens in a sandboxed copy of the live system and why back ups are essential.
It simple shouldn't be possible (or at least, that easy) for human error to
put an entire company in jeopardy.

Take my own company, I've accidentally deleted /dev on development servers
(not that major of an issue thanks to udev, but the timing of the mistake was
lousy), a co-worker recently dropped a critical table on dev database and
we've had other engineers break Solaris by carelessly punching in _chmod -R /_
as root (we've since revised engineers permissions so this is no longer
possible). As much as those errors are stupid and as much as engineers of our
calibre should know better, it can only takes a minor lack of concentration at
the wrong moment to make a major fsck up. Which is doubly scary when you
consider how many interruptions the average engineer gets a day.

So I think the real guilt belongs to the entire technical staff as this is a
cascade of minor fcsk ups that lead to something catastrophic.

------
cantlin
Last year I worked at a start-up that had manually created accounts for a few
celebrities when they launched, in a gutsy and legally grey bid to improve
their proposition†. While refactoring the code that handled email opt-out
lists I missed a && at the end of a long conditional and failed to notice a
second, otherwise unused opt-out system that dealt specifically with these
users. It was there to ensure they really, really never got emailed. The
result?

<http://krugman.blogs.nytimes.com/2011/08/11/academia-nuts/>

What a screw up!

These mistakes are almost without fail a healthy mix of individual
incompetence and organisational failure. Many things - mostly my paying better
attention to functionality I rewrite, but also the company not having multiple
undocumented systems for one task, or code review, or automated testing -
might have saved the day.

 _[†] They've long been removed._

------
bambax
Once, a long time ago, I spent the best part of a night writing a report for
college, on an Amstrad PPC640 (<http://en.wikipedia.org/wiki/PPC_512>).

Once I was finished, I saved the document -- "Save" took around two minutes
(which is why I rarely saved).

I had an external monitor that was sitting next to the PC; while the saving
operation was under way, I decided I should move the monitor.

The power switch was on top of the machine (unusual design). While moving the
monitor I inadvertently touched this switch and turned the PC of... while it
was writing the file.

The file was gone, there was no backup, no previous version, nothing.

I had moved the monitor in order to go to bed, but I didn't go to bed that
night. I moved the monitor back to where it was, and spent the rest of the
night recreating the report, doing frequent backups on floppy disks, with
incremental version names.

This was in 1989. I've never lost a file since.

~~~
hga
Yeah; I was lucky that my first experience where I could lose data like that
(before it was on punched cards) was a nice UNIX(TM) V6 system on a PDP-11/70
that had user accessible DECTAPEs. Because I found the concept interesting, I
bought one tape, played around with it including backing up all my files ...
and then I learned the -rf flags to rm ^_^.

That was back in the summer of 1978; today I have an LTO-4 tape drive driven
by Bacula and backup the most critical stuff to rsync.net, the latter of which
saved my email archive when the Joplin tornado roared next to my apartment
complex and mostly took out a system I had next to my balcony sliding glass
doors and the disks in another room with my BackupPC backups.

As long as we're talking about screwups, my ... favorite was typing kill % 1,
not kill %1, as root, on the main system the EECS department was transitioning
to (that kills the initializer "init", from which all child processes are
forked). Fortunately it wasn't under really serious heavy use yet, but it was
embarrassing.

------
JohnBooty
This happened to me once on a much smaller scale. Forgot the "where" clause on
a DELETE statement. My screwup, obviously.

We actually had a continuous internal backup plan, but when I requested a
restore, the IT guy told me they were backing up everything _but_ the
databases, since "they were always in use."

(Let that sink in for a second. The IT team actually thought that was an
acceptable state of affairs: "Uh, yeah! We're backing up! Yeah! Well, some
things. Most things. The files that don't get like, used and stuff.")

That day was one of the lowest feelings I ever had, and that screwup "only"
cost us a few thousand dollars as opposed to the millions of dollars the blog
post author's mistake cost the company. I literally can't imagine how he felt.

~~~
anywherenotes
That is pretty hilarious. I guess you can save a lot of money on tapes, if you
do incremental backups only on files that never change.

Personally I felt bad when I deleted some files, that were recovered within
the hour, and learned from that experience. But when you create a monumental
setback as the OP by simple mistake, that's an issue with people at higher
ranks.

------
islon
I know how you felt. Many years ago when I was a junior working in a casual
game company, I were to add a bunch of credits to a poker player (fake money).
I forget the where in the SQL clause and added credits to every player in our
database. Lucky me it was an add and not a set and I could revert it. Another
time I was going to shutdown my pc (a debian box) using "shutdown -h now" and
totally forgot that I was in a ssh session to our main game server. I had to
call the tech support overseas and tell him to physically turn on the
server...

~~~
hollerith
To avoid mistakes like that is why I put the hostname _and only the hostname
plus one character_ in my shell prompt.

(The other character is a # or $ depending on whether the user is root or
not.)

~~~
protomyth
At one job I went with this scheme for terminal background color: green screen
for development, blue for testing, yellow for stage / system test, and red for
production. This saved a lot of problems because I knew to be very careful
when typing in the red.

~~~
rurounijones
I have done this on a few servers but found that it always screws up
formatting of the lines in bash when they are long and you are hitting up and
going back through the history.

Did you change the $PS1 variable? Can you share your config?

~~~
mgedmin
You need to wrap the escapes in \\[ \\] to tell bash (actually readline) that
these characters do not advance the cursor when printed.

------
newishuser
You did them more good than harm.

1) Not having backups is an excuse-less monumental fuckup.

2) Giving anyone delete access to your production db, especially a junior dev
through a GUI tool, is an excuse-less monumental fuckup.

Hopefully they rectified these two problems and are now a stronger company for
it.

~~~
generalpf
I think it's a bit extreme to say he did more good than harm. He might have
done some long-term good by having the company re-examine permissions and
environments, but he probably did a lot of long-term harm by alienating
current and future customers.

~~~
newishuser
Better that it happened 2 months after backups were canceled than 6 months or
later. If you're going to cancel your backups you're begging for disaster.

~~~
Piskvorrr
"But but but but...that item in the expense report is HUUUGE, and what revenue
did we get out of having backups lately? Or ever? I say we drop it, nothing
could possibly happen."

Some experiences are non-transferable. This identical conversation has taken
place millions of times, but noooo: every penny-wise-pound-foolish CEO wants
to experience the real thing, apparently.

------
mootothemax
If you ever notice that your employer or client isn't backing up important
data, take a tip from me: do a backup, today, in your free time, and if
possible, and again in your free time, create the most basic regular backup
system you can.

When the time comes, and someone screws up, you will seem like a god when you
deliver your backup, whether it's a 3-month-old one-off, or from your crappy
daily backup system.

~~~
danielh
That is good advice, just make sure that it doesn't look like you are stealing
your customers/clients data.

~~~
mootothemax
_That is good advice, just make sure that it doesn't look like you are
stealing your customers/clients data._

Excellent point! Any tips on how to avoid that, other than not taking the data
home / copying to personal Dropbox-type things?

~~~
danielh
Well, IANAL. I think you already covered the most important point: store
backups on hardware/services under the control of your employer/client.

I would document the backup process and communicate it to my manager/client
with a mail like "hey, I set up backups, they are stored at <server>, docs are
in the wiki".

Other potential issues: causing unauthorized costs ("who stored 10TB on S3?")
or privacy violations, e.g. when working with healtcare or payment data.

------
munificent
> The CEO leaned across the table, got in my face, and said, "this, is a
> monumental fuck up. You're gonna cost us millions in revenue".

Yes, it is a monumental fuck-up. You put a button in front of a junior
developer that can cost the company millions if he accidentally clicks it _and
doesn't even have undo_.

------
leothekim
Mistakes happen, and there should have been better safeguards -- backups,
locking down production, management oversight.

But, I actually applaud how he tried to take responsibility for his actions
and apologized. Both "junior" AND "senior" people have a hard time doing this.
I've seen experienced people shrug and unapologetically go home at 6pm after
doing something equivalent to this.

The unfortunate thing here seems to be that he took his own actions so
personally. He made an honest mistake, and certainly there were devastating
consequences, but it's important to separate the behavior from the person. I
hope he realizes this in time and forgives himself.

------
pmelendez
There are several reasons why you should _not_ feel guilty. The company was
asking for trouble and you just happen to be the trigger. These are the three
top things that could prevented that incident.

1) A cron job for the manually task you were doing.

2) Not working directly on production.

3) Having daily backups

And this could happened to anybody. After midnight any of us are at junior
level and very prone to do this kind of mistakes.

~~~
tlrobinson
Hopefully prioritized in reverse order

~~~
pmelendez
Touché! I didn't intended to prioritize my list but you are right :)

------
alan_cx
I cannot believe that people still don't have reliable back up in place.

My feeling is this: If you are in any way responsible for data that is not
backed up, you should be fired or resign right now. You should never work in
IT, in anyway, ever again. If you are the CEO of a company in a similar state,
again, fire your self right now. Vow to never ever run a business again. This
is 2013. And guess what? You still can't buy your unique data back from
PCWorld. Your data is "the precious".

As for the treatment of this guy, IMHO, his employers were the worst kind of
spineless cowards. This was 100% the fault of the management, and you know
what? They know it. To not have backups is negligent, and should result in a
high up firings. Yet these limp cowards sought to blame this kid. Pure
corporate filth of the lowest order. Even the fact he was junior is
irrelevant, any one could have done that, more likely a cocky senior taking
some short cut. Let me tell you now, I have made a similar cock up, and I
think I know it all. But I had backups, and lucky for me, it was out of
business hours. Quick restore, and the users never knew. I did fess up to my
team since I thought it had direct value as a cautionary tail.

Frankly, I am utterly amazed and gutted that such a thing can still happen.
The corporate cowardice is sadly expected, but to not have backups is
literally unforgivable negligence.

Yeah, Im quite fundamentalist about data and backups. I'd almost refer to my
self as a backup jihadist.

------
Tichy
Just wondering, when consulting I usually take care that there are appropriate
clauses in the contract to make me not liable. But what is the rule for
employees, are they automatically insured?

In Germany there is the concept of "Fahrlässig" (negligence) and "severe
negligence". Per law you are already liable if you are just negligent, but it
is possible to lower it to severe negligence in the contract. That is my
understanding anyway (not a lawyer). Usually I also try to kind of weasel out
of it by saying the client is responsible for appropriate testing and stuff
like that... Overall it is a huge problem, though, especially if the client
has a law department. Getting insurance is quite expensive because it's easy
to create millions of dollars in damages in IT.

Before court "standard best practices" can become an issue, too. This worries
me because I don't agree with all the latest fads in software development. It
seems possible that in the future x% test coverage could be required by law,
for example. Or even today a client could argue that I didn't adhere to
standard best practices if I don't have at least 80% test coverage (or
whatever, not sure what a reasonable number would be).

------
malux85
Whoever cancelled the backups was equally responsible

~~~
lmm
More responsible, I would say. You expect a junior to make mistakes; the
company should be structured to handle that happening.

Though I would look askance at whoever hired a philosophy grad as well, to be
perfectly honest. The author admits he didn't have the experience to spot bad
practice at the time.

~~~
eloisant
Actually even senior developers or architects make mistakes. Philosophy grad
or not, it doesn't matter. That's to be expected.

What's more questionable is:

* Developers have access to the production database from their machine, while it should only be accessible to the front machines within the datacenter.

* Junior developers don't need an access to production machine, only sysops and maybe the technical PM.

* No backup of the production database. WTF???

If they had a hardware failure they would have been in the same shit.

~~~
mbell
I'll add another one:

* No Foreign Keys

Attempting to clear the table should have just thrown a constraint violation
error.

~~~
eloisant
Well, depending how you configure your cascade clearing the user table could
have cleared all the other tables also :)

"on delete cascade"!

~~~
mbell
True on MySql with InnoDB, wouldn't be true with postgres.

You'd have to use TRUNCATE CASCADE on postgres to avoid the foreign key error.

------
kyllo
Sounds like no one else at that company had any more of a clue what they were
doing than you did. The whole scenario is horrifying.

------
Kesty
I've did a very similar thing after one year working at my company instead of
clearing the whole user table I replaced every user information with my
account information.

I forgot to copy the WHERE part of the query .....

The only difference is that it was policy to manually do a backup before doing
anything on production and the problem was restored in less than 10 minutes.
Even if I had forgotten to make a backup manually we had a daily complete
backup and an incremental one every couple of hours.

------
Spooky23
If I were the author, I would rewrite this and reflect on what was actually
wrong here. At the end of the day, you resigned out of shame for a serious
incident that you triggered.

But the fact that the organization allowed you to get to that point is the
issue. Forget about the engineering issues and general organizational
incompetence... the human side is the most incredibly, amazingly ridiculous.

I respect your restraint. If I was singled out with physical intimidation by
some asshat boss while getting browbeaten by some other asshat via Skype, I
probably would have taken a swing at the guy.

Competent leadership would start with a 5-why's exercise. Find out why it
happened, why even the simplest controls were not implemented. I've worked in
places running on a shoestring, but the people in charge were well aware of
what that meant.

------
rhizome
_The CEO leaned across the table, got in my face, and said, "this, is a
monumental fuck up. You're gonna cost us millions in revenue". His co-founder
(remotely present via Skype) chimed in "you're lucky to still be here"._

This is when you should have left. That's no way to manage a crisis.

------
eric_bullington
Wow, I'm sorry you had to experience that. I'm sure it was traumatic -- or
perhaps you took it better than I would have. It must be of some comfort to
look back now and realize that you only bore a small part of the blame, and
that ultimately a large potion of the responsibility lies on the shoulders of
whomever set up the dev environment like that, as well as whomever cancelled
backups.

------
hcarvalhoalves
You should fire the company for not having a staging environment nor up-to-
date backups.

------
lallouz
I would love to see some reflection on this story from OP. What do you think
you learned from this experience? Do you think your response was appropriate?
What would you have done differently? Are you forever afraid of Prod env?

Many, many , many of us have been in this situation before, whether as
'monumental' or not. So it is interesting to hear how others handle it.

~~~
mkrecny
OP here.

I realize that the dev environment was a recipe for disaster, and I was simply
the one to step on the mine .. but I believe my guilt about leaving the
company is 'quite right'. Thankfully I'm not forever afraid of Prod env - I
still do a lot of risky stuff .. but I always have nightly backups, and other
'recreate the data' strategies in place.

~~~
Kesty
Everyone makes, has made and will make mistakes. Junior/Senior is not
important.

You also could set up 20 layers of dev environments and it still doesn't
matter, mistakes can still reach the outer layer.

You need to have the ability to recover from any problem quickly and with the
data as updated as you need it to be.

~~~
konstruktor
Risk avoidance (decent staging) and risk mitigation (backups) are two mostly
orthogonal aspects of risk management. Often, a backup will be a good first
step for a totally messed up system. However, saying that mistakes will aways
reach the outer layer to discount the value of risk avoidance is talking about
the possibility of risk realisation where what matters is probability.

~~~
Kesty
I'm not trying to discount the value of risk avoidance they are both important
and should both be used, but mitigation should always be the priority of the
two.

1) When you have neither you should focus on risk mitigation first. 2) Having
a great and complex risk avoidance policy in place is a good thing but doesn't
mean that you need a lesser mitigation system.

------
aidos
Ah ha ha ah yeah.... I've done that.

Something similar anyway (was deleting rows from production and hadn't
selected the where clause of my query before I ran it).

It was on my VERY FIRST DAY of a new job.

Fortunately they were able to restore a several hours old copy from a sync to
dev but the wasn't a real plan in place for dealing with such a situation.
There could have just as easily not been a recent backup.

This was in a company with 1,000 employees (dev team of 50) and millions in
turnover. I've worked other places that are in such a precarious position too.

At least my boss at the time took responsibility for it - new dev (junior),
first day, production db = bad idea.

------
Yhippa
"The implications of what I'd just done didn't immediately hit me. I first had
a truly out-of-body experience, seeming to hover above the darkened room of
hackers, each hunched over glowing terminals."

Holy crap. I know that _exact_ same feeling. I had to laugh. I know that out-
of-body feeling all too well.

------
superflit
I would fire the CTO for canceling the backups.

NEVER.. NEVER go production without backup.

Backup is not only to 'recover' but to have a 'historical' data for audits,
check for intrusion, etc.

And the 'other' guy on Skype?

'You are lucky to be here..'

Seriously? You are lucky to still talk using skype because I am sure skype has
some kind of backup at their user table..

------
ownagefool
I worked at a small web hosting company that did probably £2m in revenue a
year in my first programming job. They had me spending part of my time as
support and the other part on projects.

After about 3 or so months they took me took me out of support and literally
placed my desk next to the only full time programmer that company had.

They made all changes direct on live servers and I'd already raised this as a
concern and now that became my full time job it was agreed that I'd be allowed
to create a dev environment.

Long story short, I exported the structure of our MySQL database and imported
it into dev. Some variable was wrong so it didn't all import, so I changed the
variable, dropped the schema and back to redo.

Yeah that was the live database I just dropped. After a horrible feeling that
I can't really explain I fessed up. I dropped it during lunch so it took about
two hours to get a restore.

The owner went mad but most other people were sympathetic, telling me their
big mistakes and telling me thats what backups were for.

The owner was going crazy about losing money or something and the COO pulled
me into a room. I thought I was getting fired but he just asked me what
happened and said "yeah we all make mistakes, thats fair enough, just try not
to do it again".

I was then told to get on with it and it must have took me a day to finish
what would have taken me an hour but I done it and now we had a process and a
simple dev environment. I lasted another two years there. I left over money.

------
vinceguidry
I used to be a web freelance web developer/tech guy with one client, a
designer. What made me quit was an incident where his client's Wordpress site
hadn't been moved properly to the new hosting. (not by me)

The DB needed to be searched and replaced to remove all the old urls. After
doing so, the wp_options cell on the production site holding much of the
customizations kicked back to the defaults for the theme, the serialized data
format being used was sensitive to brute DB-level changes.

I had talked to my client before about putting together a decent process
including dev databases, scheduled backups, everything needed to prevent just
such a screwup, but he waffled. Then blamed me when things went wrong.

I'd had enough and told him to do his own tech work, leaving him to fix his
client's website himself. Being that I didn't build it, I didn't know which
settings to flip back. I left freelance work and never looked back.

People and companies do this all the time, refuse to spend the time and money
ensuring their systems won't break when you need them the most, then scapegoat
the poor little technician when it does.

I'd like to say the answer is "don't work in such environments," but there's
really no saying that it won't be this way at the next job you work, either.

I certainly wouldn't internalize any guilt being handed down, ultimately it's
the founders' jobs to make sure that the proper systems are in place, after
all, they have much more on the line than you do. Count it a blessing that you
can just walk away and find another job.

------
KenL
I agree with the comments here that spread the blame past this author.

I manage a large number of people at a news .com site and know that screw-ups
are always a combination of two factors: people & systems.

People are human and will make mistakes. We as upper management have to
understand that and create systems, of various tolerance, that deal with those
mistakes.

If you're running a system allowing a low-level kid to erase your data, that
was your fault.

I'd never fire someone for making a stupid mistake unless it was a pattern.

------
johngalt
"How I was setup to fail."

Who asks junior engineer to develop directly on live systems with write access
and no backup? Are you kidding me?

Edit: No one ever builds a business thinking about this stuff, until something
like this happens. There are people who have learned about operations
practices the hard way, and those who are about to. They hung the author out
to dry for a collective failure and it shows that this shop is going to be
taught another expensive lesson.

------
jtchang
I'm with everyone else in this thread: you screwed up but in reality it is
EXPECTED.

Do you know why I have backups? Because I'm not perfect and I know one day I
will screw up and somehow drop the production database. Or mess up a
migration. Or someone else will. This is stuff that happens ALL THE TIME.

Your CEO/CTO should have been fired instead. It is up to the leadership to
ensure that proper safeguards are in place to avoid these difficult
conversations.

------
greghinch
Whoever a) gave production db access to a "junior" engineer and b) disabled
backups of said database is at fault. I hope the author takes this more as a
learning experience of how to (not) run a tech department than any personal
fault.

Someone who has to use a GUI to manage a db at a company of that scale
shouldn't have access to prod

------
chris_mahan
Let me make it really simple: Anything that happens in a company is always,
always management's fault. The investors hire the management ream to turn a
pile of money into a bigger pile of money, and if management fails, it is
management's fault, because it can do whatever it needs to do (within the law)
to make that happen. That they failed to hire, train, motivate, fire, promote,
follow the law, develop the right products, market them well, ensure
reliability, ensure business sustainability, ensure reinvestment in research
and development, and ultimately satisfy investor, is their fault, and they
further demonstrate their failure by not taking responsibility for their own
failure and blaming others.

------
noonespecial
This was a "sword of Damocles" situation. No backups, no recovery plan, and
now clue how important any of these things were.

A thousand things can make an SQL table unreadable. "What do we do _when_ this
happens" is what managers are for, not finding someone to blame for it.

------
ferrouswheel
Ah, I remember being called away from my new year holiday when an engineer
dropped our entire database.

This happened because they didn't realise they were connected to the
production database (rather than their local dev instance). We were a business
intelligence company, so that data was vital. Luckily we had a analysis
cluster we could restore from, but afterwards I ensured that backups were
happening... never again.

(Why were the backups not already set up? Because they were not trivial due to
the size of the cluster and having only been CTO for a few months there was a
long list of things that were urgently needed)

~~~
greyboy
This brings to mind on of my common responses: if it's not important enough to
back up - it's not important!

It may be expensive, either in complexity, costs of storage/services, etc, but
it's a necessity.

I'm curious about many of the comments in this thread - why are people logging
in as table owners? It's not too difficult (for talented data-driven
companies) to create roles or accounts that, while powerful, still make it
difficult to drop a table and such.

~~~
ferrouswheel
One answer... MongoDB.

------
KVFinn
>I was 22 and working at a Social Gaming startup in California.

>Part of my naive testing process involved manually clearing the RAIDS table,
to then recreate it programatically.

>Listening to their engineer patiently explain that backups for this MySQL
instance had been cancelled over 2 months ago.

"The CEO leaned across the table, got in my face, and said, "this, is a
monumental fuck up. You're gonna cost us millions in revenue".

What. The. Fuck.

The LAST person I would blame is the brand new programmer. They don't backup
up their production database? If it wasn't this particular incident it would
have been someone else, or a hardware failure.

------
desireco42
I was working two years ago in very successful, billion dollar startup. All
developers had production access, but then, if you didn't know what you were
doing, you would not be working there. Also, we didn't routinely access
production and when we did, mostly for support issues on which we rotated, we
did through 'rails console' environment that enforced business rules. In
theory you could delete all data, but only in theory, and even then, we could
restore it with minimal downtime.

I think it is obvious that CEO/CTO are the one to be held responsible here.

~~~
desireco42
To add to this. I work again in billion dollar company (I think, they are
really big0, I don't work on their main property, I have production db access.
This is something senior developers definitely should have access to, but also
with great privilege comes appropriate responsibility.

I routinely run reports, and sometimes I would wipe spammer out that passed
our filters etc.

------
systematical
Your CEO was correct. He should have also said the same thing to the guy who
cancelled backups as well...and the guy who never put in place and
periodically tested a disaster recovery plan. So much fail in this story, but
mistakes happen and I've had my share as well.

I once (nah twice) left a piece of credit card processing code in "dev mode"
and it wasn't caught until a day later costing the company over 60k initially.
Though they were able to recover some of the money getting the loss down to
20k. Sheesh.

------
S_A_P
Sounds to me like this operation was second rate and not run professionally.
If this sort of incident is even able to happen, you're doing it wrong. Maybe
it's just my experience with highly bureaucratic oil and gas companies, but
the customer database has no backup for 2 months?!?!?!?!?!?!

That is asinine. What would they have done if they couldn't pin it on a junior
engineer? A disk failure would have blown them out of the water. I think he
did them a favor, and hopefully they learned from that.

------
mac1175
Wow. This reminds me of a time in which I used to work for a consulting
agency. It was back in 2003 and I was working on a some database development
for one of the company's biggest clients. One day, I noticed the msdb database
had a strange icon telling me it was corrupted. I went onto MSDN and followed
some instructions to fix it and, BAM, the database I was working for months on
was gone (I was running SQL Server 2000 locally where this all happened and I
was very junior as a SQL developer). I was silently freaking out knowing this
could cost me my job. I got up from my desk and took a walk. On that walk, I
contemplated my resignation. When I got back from my walk, a thought occurred
to me that maybe the database file is still there (I had zero clue at the time
that msdb's main purpose was just cataloguing the existing databases among
other things). I did a file search in the MSSQL folders and found a file named
with my database's name. So, that day I learned to attach a database, what
msdb's role is, and to make sure to take precautions before making a fix!
However, OP's post shows that this company had no processes in place control
levels of access or disaster recovery. Show the company's faults more than
OP's.

------
samstave
This was clearly a lack of oversight and sound engineering practices.

Who cancelled the backups? Why were they cancelled? Was it for non-payment of
that service?

\---

I worked for an absolutely terrible company as Director of IT. The CEO and CTO
were clueless douchebags when it came to running a sound production operation.

The CTO would make patches to the production system on a REGULAR basis and
break everything, with the statement "that's funny... that shouldn't have
happened"

I had been pushing for dev|test|prod instances for a long time - and at first
they appeared to be on-board.

When I put the budget and plan together, they scoffed at the cost, and
reiterated how we needed to maintain better up-time metrics. Yet they refused
to remove Dave's access to the production systems.

After a few more outages, and my very loud complaining to them that they were
farking up the system by their inability to control access - I saw that they
had been hunting for my replacement.

They were trying to blame me for the outages and ignoring their own
operational faults.

I found another job and left - they offered me $5,000s to not disparage them
after I left. I refused the money and told them to fark off. I was not going
to lose professional credibility to their idiocy.

Worst company I have ever worked for.

------
fotoblur
I think that everyone does this at some point in their career. Don't let this
single event define you. The most important thing to ask yourself is what was
the lesson learned...not only from your standpoint but also from the
business'.

In addition, to heal your pain its best to hear that you're not the only one
who has ever done this. Trust me, all engineers I know have a story like this.
(Please share yours HN - Here I even started a thread for it:
<http://news.ycombinator.com/item?id=5295262>)

Here is mine: When I worked for a financial institution my manager gave me a
production level username and password to help me get through the mounds of
red tape which usually prevented any real work from getting done. We were
idealists at the time. Well I ended up typed that password wrong, more than 3
times...shit, I locked the account. Apparently half of production's apps were
using this same account to access various parts of the network. Essentially, I
brought down half our infrastructure in one afternoon.

Lesson learned: Don't use the same account for half your production apps. Not
really my fault :).

------
niggler
If you want to see monumental screw-up, look at knight capital group (they
accumulated a multi billion dollar position in the span of minutes, losing
upwards of $440M, and ended up having to accept a bailout and sell itself to
GETCO):

[http://dealbook.nytimes.com/2012/08/03/trading-program-
ran-a...](http://dealbook.nytimes.com/2012/08/03/trading-program-ran-amok-
with-no-off-switch/)

------
blisterpeanuts
Good lord, that's unbelievable! If millions of dollars are riding on a
database, they should have spent a few thousand to replicate the database, run
daily backups and maintain large enough rollback buffers to reverse an
accidental DROP or DELETE.

We've all screwed up at various times (sometimes well beyond junior phase),
but not to have backups.... That's the senior management's fault.

------
gearoidoc
This post just made me feel 10x smarter (not that I blame the author - the
blame here lies at the feet of the "Senior" Devs).

------
cmbaus
Any manager who doesn't take responsibility for this isn't a manager you'd
want to work for. The manager should be fired.

~~~
hakaaaaak
Agreed, but if the manager took responsibility for this he or she probably
would be fired. Still, it is the only way to be; otherwise, you're not a real
leader.

------
doktrin
I found myself doing very much this my very first day on the job working for a
software startup.

We had a Grails app that acted as a front end for a number of common DB
interactions, which were selected via a drop down. One of these (in fact, the
default) action was titled "init DB". Of course, this would drop any existing
database and initialize a new one.

When running through the operational workflow with our COO on the largest
production database we had, I found myself sleepily clicking through the menu
options without changing the default value. I vividly remember the out of body
experience the OP describes, and in fact offered to fire myself on the spot
shortly thereafter.

It's fun to laugh about in hindsight, but utterly terrifying in the moment -
to say nothing of the highly destructive impact it had on my self confidence.

~~~
roryokane
How did the company deal with the loss of that database? Did they actually
have backups, and just restored the data? Did they reconstruct the data from
other sources?

~~~
doktrin
In our case we had periodic backups, and together with filesystem logs were
able to restore most of the data. However, we were hosting _highly_ sensitive
data and the work being done was time critical. The downtime was therefore not
popular with our clients, who were losing ~$15k per hour offline.

------
jgeerts
This article sounds so incredible to me, I think I might have been holding my
breath reading it. These are two major mistakes that the company is
responsible for, not the author. Why would they let anyone in on the
production password and do direct queries onto that database instead of
working on a different environment, it's laughable that they sent this to
their customers admitting their amateurism. Secondly, no backups? At my
previous project, a similar thing happened to our scrum master, he accidently
dropped the whole production database in some kind of the same situation. The
database was back up in less than 10 minutes with an earlier version. It's
still a mistake that should not be possible to make, but when it happens you
should have a backup.

------
tetsuseus
I once fired everyone at a nonprofit foster care company with a careless
query.

I cried to the sysops guy, and he gave me a full backup from 12 hours before,
and before any cronjobs ran I had the database back in order.

Backups are free. It was their fault for not securing a critical asset to
their business model.

------
alyrik
Oh dear... I once logged into the postgresql database of a very busy hosted
service in order to manually reset a user's password. So I started to write
the query:

UPDATE principals SET password='

Then I went and did all the stuff required to work out the correctly hashed
and salted password format, then finally triumphantly pasted it in, followed
by '; and newline.

FORGOT THE WHERE CLAUSE.

(Luckily, we had nightly backups as pg_dump files so I could just find the
section full of "INSERT INTO principals..." lines and paste in a rename of the
old table, the CREATE TABLE from the dump, and all the INSERT INTOs, and it
was back in under a minute - short enough that anybody who got a login failure
tried again and then it worked, as we didn't get any phonecalls). It was a
most upsetting experience for me, however...

------
fayyazkl
While i fully agree with the position of author not being responsible
entirely, i find it hard to believe it happened the way it appears to be.

It could be but there are bunch of loopholes. - I can believe that he was
lousy enough to click no "delete" on users table. I can believe that he when
the dialog box asked "are you sure you want to drop this table" he clicked
yes. I can believe that after deleting he "committed" the transaction. But
what i can't believe that the database let him delete a table which was base
for every other table implemented by a Foriegn key constraint ? It could be
argued that due to efficiency they hadn't put constraints on the table but
it's hard to digest.

Probably the story is some what tailored to fit to a post.

------
clavalle
Wow. So many mistakes.

Working in production database? Bad.

No backups of mission critical data? Super bad.

Using a relational database as a flat data store? Super bad.

Honestly...I think this company deserved what they got. Good thing the author
got out of there. Hopefully in their new position they will learn better
practices.

~~~
jpalacios
Yah this story made me cringe. What exactly do you mean by:

"Using a relational database as a flat data store? Super bad."

Are you referring to the users table? I am not too accustomed to using flat
files, so I am curious.

~~~
clavalle
Users is a bit of a core table in most applications. If they were using the
relational database as it should be used there would be references to the user
table elsewhere in the database.

If you tried to delete the table, it would fail stating that a deletion would
violate the constraints assuming you didn't have deletions cascade
automatically (which would be equally bad).

On the other hand (and it probably happened here) there will be one table with
all sorts of data bolted on.

So say you want a user to have multiple pieces of armor (following the spirit
of this post). You should have an armor table and a user to armor many to many
table. But instead you just add an Armor column to the user record and create
a new user record (with a the same username for example but with a different
unique artificial key) with the new piece of armor in the armor column. Then
to retrieve it you just select armor where username = whatever and iterate
through the list. Adds and deletions are just as easy. So, why not? Well,
duplication of data, for one thing. And no referential integrity protection
for another. Delete a username and everything is deleted. Forget a where
clause and you are sunk.

~~~
jpalacios
Ah I see.. I misunderstood the first time around. I thought you meant to store
the user table in a flat file. Thank you for the explanation. That reminds me,
I need to convert to Innodb one of these days.

------
pja
I doubt the problems this company had started when they employed the author of
this blog post!

------
anton-107
It's unlikely that the database had no foreign keys related to users table.
And if so DBMS should have prevented deleting all users from the table.

Perhaps the Database Designer also failed his job. As well as the guys who
cancelled backups and set up dev environment.

~~~
beering
Well, the foreign key could have been set up to cascade deletes, in which case
they would have been extra-screwed.

------
elomarns
Although the author of the post obviously did a huge mistake, he is far from
being the actual responsible for the problem that follows his mistake. It's
the job of the CTO to make sure no one can harm the company main product this
way, accidentaly or not.

He could never write code against the production database when developing new
features. And if he was doing it, it wasn't his fault, considering he was a
junior developer.

And who the hell is so stupid to don't have any recent backup for the database
used by a piece of software that provides millions of revenue?

In the end, when you do such a shity job protecting your main product, shit
will eventually happens. The author of the post was merely a agent of destiny.

------
tn13
I dont think this is author's fault. These kind of human mistakes are more
than common. It is said that the top management actually assigned the blame to
this young man. This was an engineering failure.

I can understand what this person must have gone through.

------
jdmaresco
I had a similar situation when collaborating with a team on a video project
during a high school internship. Somehow I managed to delete the entire
timeline accounting for hours of editing work that my boss had put in. To this
day I don't know how it happened, I just looked down and all the clips were
gone from the timeline. In the end, I think we found some semblance of a
backup, and at least we didn't lose the raw data/video content, but I can
relate to the out-of-body experience that hits you when you realize you just
royally screwed up your team's progress and there's nothing you can do about
it.

------
zulfishah
Every engineer's worst nightmare. I've worked at a one of the biggest software
companies in the world, and I'm working on my own self-funded one-person
startup: the panic before doing anything remotely involving production user
data is still always nerve-wracking to me. But agree with everyone's
assessments here of the failure of the whole company to prevent this. A
hardware failure could have just as likely have wiped out all their data. If
you're going to cut corners with backing up user data, then you should be
prepared to suffer the consequences.

Thanks for sharing this. Took real guts to put it out there.

------
navid_dichols
If your senior management/devs are worth anything, they were already aware
that this was a possibility. There is no excuse for what ostensibly appears to
be a total lack of a fully functioning development & staging environment--not
to mention any semblance of a disaster recovery plan.

My feeling is that whatever post-incident anger you got from them was a
manifestation of the stress that comes from actively taking money from
customers with full knowledge that Armageddon was a few keystrokes away. You
were just Shaggy pulling-off their monster mask at the end of that week's
episode of Scooby Doo.

------
logn
Your response should have been: "With all due respect sirs, I agree that I am
still lucky to be here, that the company is still here being that it's so
poorly managed, that they cancelled their only backups with rackspace, that
they had no contingency plans, and that you were one click from losing
millions of dollars--in your estimate. It makes me wonder what other bills
aren't being paid and what other procedures are woefully lacking. I will agree
to help you though this mess and then we should all analyze every point of
failure from all departments, and go from there."

------
monkeyonahill
That wasn't your failure per se. But the failure of pretty much everyone above
you. That they treated you like that after the fact is pretty shitty. In
hindsight I'd say that you are much better off by not being there, where you
would learn bad practices.

No Stage Environment. Proactively Cancelled Backups on a Business Critical
System. Arbitarily implementing features 'because they have it' rather than it
having some purpose in the business model. No Test Drills of disaster
scenarios. The list goes on. As I say, and you probably realise now, that you
are lucky to no longer be there.

------
flayman
This is not your fault. Not really. And it's galling that the company blamed
the incident on the workings of a 'junior engineer'. There was NO DATABASE
BACKUP! For Christ's sake. This is live commercial production data. No
disaster recovery plan at all. Zilch. And to make matters worse, you were
expected to work with a production database when doing development work. This
company has not done nearly enough to mitigate serious risks. I don't blame
you for quitting. I would. I hope you have found or manage to find a good
replacement role.

------
banachtarski
What company that makes millions in revenue doesn't replicate their database
or at least have snapshots?

What engineer uses any GUI to administrate MySQL?

This story feels totally unreal to me (unreal as in just crazy, not
disbelief).

------
developingJim
Know a lot of others have said it, but no production backups? Blame a junior
dev for a mistake that almost 100% of the people I've worked with have made at
some point or another (including me)? I feel horrible for the author, it's
sickening the way he was treated. Now they'll just move on, hire another
junior, never mention this, and guess what? The next guy will do the same
thing and there probably still aren't any backups. Didn't learn anything,
well, other than how easy it is to blame one person for everyone's failure.

------
TheTechBox
A lot of people have said it before on here but really?! The company is
blaming on person, whilst yes it was technically his fault, why in the first
place was he allowed on the production database and why was the company
keeping very regular backups of all this mission critical data.

If the company saw that the data contained in this live database was so
critical you would have thought that would not have given the keys to everyone
and that if they did, they would at least make sure that they can recover from
this, and fast.

------
stretchwithme
While working for a large computer company in the late 90s, I joined a team
that ran the company store on the web. The store used the company's own
e-commerce system, which it was also selling.

The very first day, at home in the evening, I went to the production site to
see if I could log in as root using the default password. Not a problem.

Anyone with any experience with the product could easily have deleted the
entire database. I immediately changed the password and emailed the whole
team.

No one ever responded.

------
lucb1e
Let me get this straight.

\- Tens of thousands of paying customers

\- No backups

\- Working in a production database

\- Having the permissions to empty that table

\- Even having read access to that table with customer info...

You are hardly responsible. Yeah you fucked up badly, but everyone makes
mistakes. This was a big impact one and it sucks, but the effect it had was in
now way your fault. The worst-case scenario should have been two hours of
downtime and 1-day-old data being put back in the table, and even that could
have been prevented easily with decent management.

------
ThePhysicist
The only people that should have gotten fired for this are:

* The person responsible for the database backup (no backup plan for your production DB!? wtf)

* The person having designed the SQL admin tool (not putting an irreversible DELETE operation behind a confirmation dialogue!? wtf)

* The person giving full write access to the company's production database to a junior developer (data security!? wtf)

Sure, the employee made a mistake, but most of the failure here is due to the
bad management and bad organizational design.

------
danielna
I still remember the all-consuming dread I felt as an intern when I ran an
UPDATE and forgot the WHERE clause. I consider it part of the rite of passage
in the web developer world. Kind of like using an image or text in a dev
environment that you never expect a client to see.

Luckily the company I was at (like any rational company) backed up their db
and worked in different environments, so it was more of a thing my coworkers
teased me for than an apocalyptic event.

------
hkmurakami
I'm. a little worried for OP because he obviously took the time to keep the
characters in this article anonymous, but we now know who this CEO with
ridiculous behavior must have been, since we know the name of OP's former
company from his profile. Not sure what said former CEO of Noe acquired
company can do, but this is the kind of thing I fear happeneing toe when/if I
write something negative about a past employer, being a blogger myself.

------
jami
Awesomely honest and painful story.

This happened somewhat in reverse to someone I worked with. He was restoring
from a backup. He didn't notice the "drop tables" aspect, assuming, as one
might, that a backup would simply supplement new stuff rather than wipe it
clean and go back in time to a few weeks ago.

He is (still) well-liked, and we all felt sick for him for a few days. Our
boss had enough of a soul to admit that we should have had more frequent
backups.

------
VexXtreme
In the author's defense, it wasn't all his fault. Whoever thought it was a
good idea to:

1\. Work directly on the production database 2\. Not have daily backups 3\.
Not have data migrations in place for these kinds of situations

needs to be fired immediately. My guess is it was one of the 'senior'
engineers and that the author only worked with what they gave him.

I've worked with all kinds of bozos but I've never seen this kind of
incompetence. Ridiculous.

------
imsofuture
Wow, that's terrible. Mistakes happen, and for the notion of 'blame' to
surface requires some monumentally incompetent management... the exact kind
that would have their junior programmers developing against an un-backed-up
production database.

The immediate takeaway from a disaster should _always_ be 'How can we make
sure this doesn't happen again?' not 'Man, I can't believe Fred did that, what
an idiot.'

------
anovikov
LOL a gaming startup i worked for in 2010 had the same fuckup! But nobody was
fired or quit, there was just a total anger around the place for a few days,
and almost all data was eventually recovered. The startup still flopped in
about one year after that with ever falling user retention rates - the
marketplace was more and more flooded with those more and more similar games.

------
toomuchcoffee
_The CEO leaned across the table, got in my face, and said, "this is a
MONUMENTAL fuck up."_

It certainly was -- on multiple levels, but ultimately up at the C-level.
Blaming a single person (let a lone a junior engineer) for it just perpetuates
the inbred culture of clusterfuckitude and cover-ass-ery which no doubt was
the true root cause of the fuck-up in the first place.

------
ry0ohki
I think all developers have to do something like this at some point to get the
compulsion I have which is backups to the extreme. I can never have enough
backups. Any DROP/ALTER type change I make a backup. (And also learned to
pretty much never work on a production db directly, and in the event I need
to, doing a complete test of a script in staging first...)

------
gte910h
He worked at a company stupid enough to test on the prod databases without
tools to safely clear them. The former is stupid, the later is REALLY stupid.

This is a multi-layer failure and almost none of the blame falls on him.
Stupid compounded stupid, and this guy did nothing more than trip over the
server cord several people who knew better stupidly ran past his cube exit.

------
Shorel
Awesome tale.

However, I think the CTO was the one who deserved to be fired.

Not having at the very least develop and production environments is a higher
ups fault.

Where I work, developers can't even touch production systems, there's a
separate team responsible for that.

I even have a solr, nginx, php, mysql, etc separate install of almost
everything in my workstation, so I only touch test servers when doing testing.

------
andyhmltn
Stuff like this happens. The best thing to prevent something like this is to
completely sever the line between production and development. I've worked with
companies that work directly on the production database. It's horrible. How
can the person in charge of managing the workflow expect something like this
not to happen eventually?

------
msdet11
Things like this fall on the shoulders of the team as a whole. Certainly a
tough pill to swallow for a junior engineer, but a good more senior developer
or PM should've also realized you were working on prod and tried to remedy
that situation. Humans are notoriously prone to fat fingering stuff. Minimize
risk where ever you can!

------
friendly_chap
I think it is entirely clear from the writing that the author is a humble
being. I feel sorry for him, from the writing it seems he is a much better
person and engineer than most of the others at that company, pointing fingers
at him.

The guy may be absentminded, but that is a trait of some of the brighest
people on earth.

------
snambi
This company sucks. You are out of college and doing the first job. Are these
stupid enough to give you direct access to production database? if they are
making millions in revenue where was there DBAs? Obviously the management got
what they deserved. Its unfortunate that it happened through you.

------
bart42_0
You can't make an omelette without breaking eggs.

Clicking on 'delete' with the user table selected was not very wise. The
software maybe even asked 'Are you sure?' and of course you reply 'yes'.

But operating your company with proper recovery tools is a bit like climbing
Mount everest without a rope.

If something goes wrong you are in deep sh.t.

------
danielweber
I feel tremendous empathy for this guy.

Not because I've done this. But because there but for the grace of God go I.
It wouldn't take much changing in the universe for me to be this guy.

I'm very glad he's posting it, and I hope everyone reads it, so you can learn
from his very painful mistake instead of committing it yourself.

------
praptak
They should reward him. Seriously, anyone who exposed such a huge weakness
deserves a reward. He limited the damage to only 10k users' data loss. With
such abysmally crappy practices the damage would happen anyway only perhaps
with 30k users and who knows what else instead of a mere 10k.

------
bfrenchak
This was not a junior engineer's fault, but the DBA's fault. Any company
should be backing up their database regularly, and then testing the restores
regularly. Also don't give people access to drop table's etc. This was a very
poor setup on the part of the company/DBA not the engineer.

------
elhowell
Wow dude, that's quite a story. That must have been an awful feeling, I hope
you're doing better now

------
taurussai
Bold on your part to own up and offer a resignation. (he "higher ups" should
have recognized that and not accepted it). From the movie, "Social Network"
<http://www.youtube.com/watch?v=6bahX2rrT1I>

------
jiggy2011
Wow, that's quite a story. If your company is ever 1 button press away from
destruction. Know that this will eventually happen.

I'm quite surprised stuff like this hadn't happened earlier. When I am doing
development with a database I will quite often trash the data when writing
code.

------
ArenaSource
I can't believe this history, but if it's true, don't worry, you just did what
they deserve

------
jinfiesto
This has been stated by others, but it's not the author's fault. It's totally
idiotic for a database like that not to have been regularly backed up. At
worst, this should have been only a couple hours of down time while the
database was restored.

------
shaurz
It's an organisational failure if a junior employee can bring down the company
in a few clicks. No backups, testing on the production database, this is no
way to run a company. Feel sorry for the guy who made a simple mistake.

------
lexilewtan
So many structures in life are based around 'not fucking up.' We protect our
assets & our dignity as if they mean anything; and yet at the end of the day
nobody knows what the fuck is going on.

really simple, revealing story. kudos.

------
zaidf
It seems insane that you still worked 3 days in a row after a gigantic mistake
that can be attributed in good part to being overworked.

Once the damage was done, I would have sent you home instead of overworking
you further.

------
SonicSoul
this is insanity! it was already pointed out in comments but i still can't
believe a company that mature (actually having 1000's of users and millions in
revenue!!!) would omit such a basic security precaution. Giving [junior?!]
developers free reign in production database and no backups???? seriously, the
CTO should have been fired on the spot instead of putting blame on developer.

no matter how careful you are (i'm extremely careful) when working with data,
if you're working in dev/qa/uat/prd sooner or later someone on dev team will
execute on wrong environment.

------
misleading_name
It's not your fault. it's the fault of the person who cancelled backups, the
person who didn't check that backups are being created, the "senior" people
who let you work on the production database.... etc.

------
bwb
It was a mistake, but not huge. You should never have not have had backups,
and that wasn't your responsibility. + their should have been a dev instances
and a proper coding environment.

So don't blame yourself there!

------
ballstothewalls
This is the most ridiculous thing ever. Why weren't there backups? Sure, the
author was the one who "pulled the trigger" but the management "loaded the
gun" by not making sure there were back-ups.

------
elicash
I think it's admirable that you stayed long enough to help fix everything
before quitting, despite it being rough -- even though, as others have said,
others screwed up even bigger than you did.

------
kunil
But why did you cleared users table at the first place? I don't get it.

------
scottschulthess
In my opinion, it's the fault of the whole organization, or at least the
engineering team, for making it so likely that something like this would
happen.

Database backups would've solved the problem

------
mydpy
My name is Myles. I read this and felt like I was looking into a crystal ball.
Fortunately, my work doesn't require I interact with the production database
(yet). Gulp.

------
Nikolas0
To be honest as a CEO I would fire myself for letting someone in the team work
that wrong (I mean in the production server)

Plus there is no excuse for not having backups...

------
seivan
I find it disgusting that the "game designers" are the so called overlords.
Fuck them. If you're a developer and a gamer then you're practically a game
designer. What ever "education" they had is bullshit. You can go from
imagination to reality with just you alone. And perhaps an artists to do the
drawing. All those "idea" fuckers a.k.a game designers are just bullshitters.

And yeah this wasn't your fault. It was the CTO's fault. He holds
responsibility.

"They didn't ask those questions, they couldn't take responsibility, they
blamed the junior developer. I think I know who the real fuckups are."

------
nighthawk
the monumental fuck up was cancelling mysql backup and having all engineers
work directly with the production database, what you did was INEVITABLE..

------
nekitamo
Using LinkedIn, you can easily figure out the name of the company and the name
of the game. Using CrunchBase you can figure out the name of the CEO.

------
deciob
This is the most incredible story I have read in a long time. To have such a
business relying all on one database and no backups... unbelievable!

------
engtech
I'm guessing the game was one of these? Likely Age of Champions.

<http://www.klicknation.com/games/>

------
zensavona
How is it that a company with 'millions in revenue' is directing a junior
developer to develop on a production database with no backups?

------
Aardwolf
In the article it says your coworkers looked differently at YOU. Did anyone
look differently at their database without back ups though?

------
coldclimate
Did you click the wrong button - yes. Was this your fault - no. So many things
wrong here.

I hope he came out ok in the long run, it's a hell of a story.

------
cafard
He did well to leave a company that a) had such practices in place and b)
would hang out an inexperienced employee to dry like that.

------
capex
The senior engineers got to own the mistakes of their juniors. That's how
teams are built. This clearly didn't happen in this case.

------
ommunist
You are accidental hero and should be proud! You freed thousands of souls from
one of the worse digital addictions.

------
meshko
I really hope this is some kind of hoax and no real company was operating like
that in 2010.

------
kordless
I'm getting internal server error.

~~~
namityadav
Me too. Here's the cached version:
[http://webcache.googleusercontent.com/search?q=cache:http://...](http://webcache.googleusercontent.com/search?q=cache:http://edu.mkrecny.com/thoughts/how-
i-fired-myself)

------
madao
What I want to know is, why didnt the guy who canceled the database backups
get fired also?

------
geldedus
it is the fault of those responsible with creating a regular backup procedure
and/or a hotswap database server

and developping on the production database speaks volumes about the
incompetency level of that company and of the "developer" in particular,
afterall

------
shocks
I'm getting a 500 error. Anyone else? Anyone got a mirror or able to paste the
content?

~~~
shocks
Finally got through. Mirrored here in case other people get 500 errors.
<http://pastie.org/6348763>

------
stretchwithme
This is handing a heart for transplant to the Post Office and hoping for the
best.

------
joeblau
On the bright side, now you know not to test on the production database :).

------
alexrson
If your data is not backed up it may as well not exist.

------
outside1234
this is a great example of why you run the "five whys?" after a failure like
this.

The CEO/CTO should have fired himself as the answer of one of those.

------
smallegan
This reads like a PSA for backups and RI.

------
coolSCV
This is why you have backups.

------
jblotus
just awesome

------
hawleyal
TIFU

------
daemonfire300
This could've happened anyone. It's a huge shame for those in charge not for
you. Any business letting such operations happen without having backups or
proper user-right-management should consider why they still exists, if they
really make huge amounts of many as you mentioned.

------
rorrr
I don't see how it's your fault, other than making a slight error of clicking
on the wrong table name

1) Senior developers / CTO letting anybody mess with the prod DB should be
grounds for their firing. It's so incompetent, it's insane.

2) No backups. How is this even possible. You even had paying customers.

