Which is, if you think about it, true, given that the series of events leading up to the disaster (the lack of a testing environment, working with prod databases, lack of safeties in the tools used to connect to database, etc...).
The correct way to respond to disasters like this is "we fucked up", not "someone fucked up".
There is the truth that here is someone who truncated the users table and because of that it caused the company great harm.
Here's another 'truth'.
1. The company lacked backups
2. The junior developer was on a production database.
Note: I'm from the oldschool of sysops who feel that you don't give every employee the keys to your kingdom. Not every employee needs to access every server, nor do they need the passwords for such.
3. Was there a process change? I doubt it, likely they made the employee feel like a failure every day and remind him of how he can be fired at any moment. So he did the only thing he could do to escape: Quit!
Horrible and wrong, if there was a good ethics lawyer around he would say it smells ripe of a lawsuit.
... That said, that lawyer and lawsuit won't fix anything.
If it hadn't been a junior developer who nobody noticed or cared was using the prod DB for dev work, it would have been an outright failure. DBs fail, and if you don't have backups you are not doing your damn job.
The CEO should be ashamed of himself, but the lead engineers and the people at the company who were nasty to this guy should all be even more ashamed of themselves.
Giving keys to irresponsible people seems irresponsible ;)
You should consider yourself very lucky. Or very savvy at knowing which companies to avoid.
This is actually one of the things most employees who have never been managers don't understand.
I mean, in fact I do something similar. At our company also a lot of stuff goes wrong. Somehow it surprises me that there was no major fuckup yet. But I do realize that I need to watch out all times that blame never concentrates on me.
It is so easy to blame individuals, it just suffices to have participated somehow in a task that fucked up. Given that all other participants keep a low profile, one needs to learn how to defend/attack in times of blame.
The reality is that fear is a greater motivator than any other emotion - over anger, sorrow, happiness. So companies create cultures of fear which results in productivity (at least a baseline, 'do what I need to do or not get fired' productivity), but little innovation and often at the expense of growth.
Plus, it's just hell. You want to do great things, but know you are stepping into the abyss every time you try.
I've worked as a Software Engineer at TechStarsNYC, Klicknation(acquired), Shelby and I'm currently working on Followgen
It's somewhat fascinating to hear that a company that damaged managed to build a product that actually had users. And I thought my impression of the social gaming segment couldn't go any lower.
Not surprised that this was a gaming company because lack of teamwork seems to be endemic in that industry.
The correct way to respond to disasters like this is "we fucked up", not "someone fucked up".
its a fail on every possible level, the technical part beeing only minor.
I've had more than my fair share of failures in the start up world. It always drives me crazy to see internet companies that have absolutely no technical ability succeed, while competing services that are far superior technically never get any traction.
"Why are we paying for backups? The database has never failed (yet)"!
One interesting observation I can make: no correlation between excellence in operations and commercial success!
But if no one is watching to ensure the move was finished (or they got distracted), then something many people treat as set-it-and-forget-it could easily get into that state.
It's not just noble. Considering how everyone treated him, and the company's attitude in general, he had no future there, and neither should anyone else.
This only happened because nobody even asked "What happens if I press this button?"
Not once have I seen a full backup restore tested. Not once have I seen a network failure simulated (though I've seen several system failures due to "kicking out a cable" that sort of acts as a proxy for that technique). On multiple occasions I've seen systems taken down by single points of failure that weren't forseen, but probably could have been.
 My favorite: the whole closet went down once because everything was plugged into a single, very expensive, giant UPS that went poof. $40/system for consumer batteries from Office Depot would have been a much better bet. And the best part? Once the customer service engineer replaced whatever doodad failed and brought the thing back up? They plugged everything right back into it.
This took down news information for a good chunk of Wellington finance for about half a day. (Fortunately Wellington, NZ is a tiny corner of the finance world).
Hilarious! But I admit I was super glad it was the boss playing chaos monkey, not me.
It turns out the emergency stop button was pushed in. Easy enough for us to fix then, but if the power had gone out at 4am it would have been quite another matter.
After that incident, we turned off the main breaker to the building weekly. It was great fun, as most of our offices were in the same building. We had complaints for the first couple of months until everyone got used to it and had installed mini UPS's for their office equipment.
We did actually have to use the generator for real a while later. Someone had driven their car into the local power substation, and it was at least a month until it was fixed. Electricity was restored through re-routing fairly quickly, but until the substation was repaired we were getting a reduced voltage that caused the UPSs to slowly drain...
Reminds me of the webserver UPS setup at a previous company.
The router (for the incoming T1) and the webserver were plugged in to the UPS.
UPS connected (via serial port) to webserver. Stuff running on webserver to poll whether UPS running from mains power or batteries and send panic emails if on batteries (for more than 60 seconds) and eventually shutdown the webserver cleanly if UPS power dropped below 25%.
Thing not plugged in to UPS: DMZ Network switch (that provided the connectivity between webserver and router).
Really, this attitude (that things have to be done right) is part of the problem here. To a seasoned IT wonk, the only alternative to doing something "The Right Way" is not doing it at all. And that's a killer in situations like these.
Don't hack your systems to make them work. Absolutely do hack at them to test.
For example, our recovery plan for a financial mainframe in case of most major disasters was to restore the daily backup to off-site hardware identical to production hw; however, the (expensive) hardware wasn't "empty" but used as an acceptance test environment.
Doing a full test of the restore would be possible, but it would be a very costly disruption; taking multiple days of work for the actal environment restoration, deployment,testing and then all of this once more to build a proper acceptance-test-environment. Also destroying a few man-months worth of long tests-in-progress and preventing any change deployments while this is happening.
All of this would be reasonable in any real disaster, but such costs and disruptions aren't acceptable for routine testing.
"Chaos Monkey" works only if your infrastucture is built on cheap unstable and massively redundant items. You can also get excellent uptime with expensive, stable, massively controled environment with limited redundancy (100% guaranteed recovery, but not "hot failover") - but you can't afford chaos there.
How is this distinct from "Don't hack your systems to make them work. Absolutely do hack at them to test."? I don't see it.
This just sounds like "my business doesn't have the financial capacity to engineer data recovery processes". Well, OK then. Just don't claim to be doing it.
But successful, efficient disaster recovery plan doesn't always mean "no damage" - it often means damage mitigation; i.e., we can fix this with available resources while meeting our legal obligations so that our customers don't suffer; not that there aren't consequences at all - valid data recovery plans ensure that data recovery really is possible and details how it happens, but that recovery can be expensive. And while you can plan, document, train and test activities like "those 100 people will do X; and those 10 sales reps will call the involved customers and give them $X credit", you really don't want to put the plan into action without a damn good reason.
For example, a recovery plan for a bunch of disasters that are likely to cut all data lines from a remote branch to HQ involves documenting, printing & verifying a large pile of deal documents of the day, having them shipped physically and handled by a designated unit in the HQ. The process has been tested both as a practice and in real historical events.
However, if you "pull a wire in the closet" and cause this to happen just so, then you've just 'gifted' a lot of people a full night of emergency overtime work, and deserve a kick in the face.
All I can say is that you're very lucky to have a working system (and probably a company to work for), and I'm very lucky not to work where you do. Seriously, your "test" of a full disaster recovery was an actual disaster! More than one!
And frankly, if your response to the idea of implementing dynamic failure testing is that someone doing that should be "kicked in the face" (seriously, wtf? even the image is just evil), then shame on you. That's just way beyond "mistaken engineering practice" and well on the way to "Kafkaesque caricature of a bad IT department". Yikes.
Admittedly: you have existing constraints that make moving in the right direction expensive and painful. But rather than admit that you have a fragile system that you can't afford to engineer properly you flame on the internet against people who, quite frankly, do know how to do this properly. Stop.
For example, any reasonable factory also has a disaster recovery process to handle equipment damage/downtime - some redundant gear, backup power, inventory of spare parts, guaranteed SLA's for shipping replacement, etc; But still, someone intentionally throwing a wrench in the machine isn't "dynamic failure testing" but sabotage that will result in anger from coworkers who'll have to fix this. Should their system be called "improperly engineered"?
We had great engineers implementing failover for a few 'hot' systems, but after much analysis we knowingly chose not to do it 'your way' for most of them since it wasn't actually the best choice.
I agree, in 99% of companies talked about in HN your way is undoubtedly better, and in tech startups it should be the default option. But there, much of the business process was people & phone & signed legalese, unlike any "software-first" businesses; and the tech part usually didn't do anything better than the employees could do themselves, but it simply was faster/cheaper/automated. So we chose functional manual recoveries instead of technical duplications. And you have to anyway - if your HQ burns down, who cares if your IT systems still work if your employees don't have planned backup office space to do their tasks? IT stuff was only about half of the whole disaster recovery problems.
In effect, all the time we had an available "redundant failover system" that was manual instead of digital. It wasn't fragile (it didn't break, ever - as I said, we tried), fully functional (customers wouldn't notice) but very expensive to run - every hour of running the 'redundant system' meant hundreds of man-hours of overtime-pay and hundreds of unhappy employees.
So, in such cases, you do scheduled disaster-testing and budget the costs of these disruptions as neccessary tests - but if someone intentionally hurts his coworkers by creating random unauthorised disruptions, then it's not welcome.
The big disadvantage for this actually is not the data recovery or systems engineering, but the fact that it hurts the development culture. I left there because in such place you can't "move fast and break things", but everyone tends to ensure that every deployment really, really doesn't break anything. So we got there very good system stability, but all the testing / QA usually required at least 1-2 months for any finished feature to go live - which fits their business goals (stability & cost efficiency rather than shiny features) but demotivates developers.
And almost no one is willing to put up the money to do compete testing of restore paths, let along statistically making sure they continue to work.
Another story supporting Chaos Monkey is what the Obama team did for their Narwhal infrastructure - they staged outages and random failures to prepare for their big day, meanwhile Romney's team who outspent the Obama team at least an order of magnitude, had their system fail on e-day.
The short of it is: they used static HTML generated by Jekyll and stored on S3.
> ... what the Obama team did for their Narwhal infrastructure - they staged outages and random failures to prepare for their big day, meanwhile Romney's team who outspent the Obama team at least an order of magnitude, had their system fail on e-day.
If there were only two junior folks, what were the senior folks doing?
Part of your job as a senior developer is to ensure this very scenario doesn't happen, let alone to someone on your watch.
I'm assuming the senior devs were 23 and had a year on the job. The principal devs were a ripe 24, otherwise known as old men. :-)
rm * .doc
Or hires them and then adequately educating them in software development.
Of course the person saying that is likely to care about technical infrastructure when it costs them money and/or customers due to being hacked-together.
The whole company fucked this one up pretty badly. NO excuses.
In reality, this should have been re-factored to the dev db.
If it couldn't be, the junior dev should have been given access to the raids table alone for writes.
Lastly the developer who didn't back up this table is the MOST to blame. Money was paid for the state in that table. That means people TRUST you to keep it safe.
I count tons of people to blame here. I don't really see the junior dev as one of them.
The next morning he told us to restore the site from backups and fix the security hole. That's when we reminded him, again, that he had refused to pay for backup services for a site of that size.
We all ended up looking fro new jobs withing a couple days.
CEO sounds incompetent as hell.
I automated ours on S3 with email notifications in under an hour...
The guy who should be falling on the sword, if anyone, is the person in charge of backups.
Better yet, the CEO or CTO should have made this a learning opportunity and taken the blame for the oversight + praised the team for banding together and coming up with a solution + a private chat with the OP.
This is a truly amazing story if this system was really supporting milions in revenue.
The kind of table manipulations he mentions would be unspeakable in most companies. Someone changing the wrong table would be inevitable. If I were an auditor, I would rake them all over the coals.
There's a big difference between being told "Do X, don't do Y" and that sinking feeling you get when you realize a big problem exists, regardless of the eventual outcome.
I feel for him, but at the same time there's a point at which you have to if testing guns by shooting them near (but not specifically at) your coworkers is actually a good idea.
When the people screw up, there's no processes to hold them back, they can really, really screw it up.
Once you hit millions in revenue it's probably wise to put a couple of fall back processes in place, reliability becomes as important as agility.
If you are a CEO you should be asking this question: "How many people in this company can unilaterally destroy our entire business model?"
If you are a CTO you should be asking this question: "How quickly can we recover from a perfect storm?"
They didn't ask those questions, they couldn't take responsibility, they blamed the junior developer. I think I know who the real fuckups are.
As an aside: Way back in time I caused about ten thousand companies to have to refile some pretty important government documents because I was doubling xml decoding (& became &amp;). My boss actually laughed and was like "we should have caught this a long time ago"... by we he actually meant himself and support.
In high tech this can get really messy, these are frequently inherently more fragile companies. My favorite example is from Robert X. Cringley in this great book: http://www.amazon.com/Accidental-Empires-Silicon-Millions-Co... ; from memory:
One day Intel's yields suddenly went to hell (that's the ratio of working die on a wafer to non-working, and is a key to profitability). And no matter how hard they tried, they could only narrow it down to the wafers being contaminated, but the wafer supplier swore up and down they were shipping good stuff, and they were. So eventually they tasked a guy to follow packages from the supplier all the way to the fab lines, and he found the problem in Intel's receiving department. Where a clerk was breaking open the sealed packages and counting out the wafers on his desk to make damned sure Intel was getting its money worth....
His point is that you can have a Fortune 500 company, normally thought to be stable companies that won't go "poof" without ample warning, in which there are many more people than in previous kinds of companies who can very quickly kill it dead.
One of Toyota's mantras is "If the student has failed to learn, the teacher has failed to teach." Their point is that managers are responsible for solving issues that come from employee ignorance, not line workers.
I do understand that humans don't have an instinctual understanding of using nipples, human babies have a physical sucking reflex that kicks in when you put anything near their mouths. They usually quickly learn that sucking a nipple in a particular way gives out lots of yummy milk.
I find this hard to believe. At some point a person in a space suit was introducing them into a clean room; she should have noticed that the packages were not sealed.
There's a good chance a tamper revealing seal would have stopped the clerk from opening the containers, and of course it they'd been broken before reaching the people at the fab lines who were supposed to open them that would have clued Intel into the problem before any went into production and would have allowed them to quickly trace the problem back upstream.
Or in this case energy, ignorance, and not learning enough of the big picture, or wondering why these wafers were sealed in air tight containers in the first place.
It's the well meaning mistakes that tend to be the most dangerous since most people are of good will, or at least don't want to kill the company and lose their jobs.
A mere few months into my current job, I ran an SQL query that updated some rows without properly checking one of the subqueries. Long story short - I fucked up an entire attribute table on a production machine (client insisted they didn't need a dev clone). I literally just broke down and started panicking, sure that I'd just lost my new job and wouldn't even have a reference to run on for a new one.
After a few minutes of me freaking out, my boss just says: "You fucked up. I've fucked up before, probably worse than this. Let's go fix it." And that was it. We fixed it (made it even better than it was, as we fixed the issue I was working on in the process), and instead of feeling like an outcast who didn't belong, I learned something and became a better dev for it.
I was handed a legacy codebase with zero tests. I left a few small bugs in production, and got absolutely chewed out for it. It was never an issue with our processes, it was obviously an issue with the guy they hired who had 1 intro CS class and 1 rails hobby project on his resume. The lead dev never really cared that we didn't have tests, or a careful deploy process. He just got angry when things went wrong. And even gave one more dev with even less experience than I had access to the code base.
It was a mess and the only thing I was "learning" was "don't touch any code you don't have to because who knows what you might break" which is obviously a terrible way to learn as a junior dev (forget "move fast and break things" we "move slowly and work in fear!"). So I quit and moved on, it was one of the better decisions I've ever made.
Seven and a half years later I make sure that I pass that knowledge on to my junior colleagues. I'm proud to say that just in the past 2 weeks I've said this twice to one of my younger team-mates, a recent hire, "don't be afraid to break things!"
"Don't be afraid to break things as long as you have a backup".
It might be a simple version of the previous code, database copy or even the entire application. Do not forget to backup. If everything fails, we can quickly restore the previous working version.
Every production deployment should involve blowing away the prior instance, rebuilding from scratch, and restarting the service; you are effectively doing a near-full "restore" for every deployment, which forces you to have everything fully backed up and accessible...
Any failure to maintain good business continuity practices will manifest early for a product / employee / team, which allows you to prevent larger failures...
In the world where data needs to be maintained, this is not necessarily an option. In the bank where I work, we deploy new code without taking any outage (provide a new set of stored procedures in the database, then deploy a second set of middleware, then a new set of front-end servers, test it, then begin starting new user sessions on the new system; when all old user sessions have completed the old version can be turned off). Taking down the database AT ALL would require user outages. Restoring the database from backup is VERY HARD and would take hours (LOTS of hours).
That being said, we do NOT test our disaster-recovery and restore procedures well enough.
The bug is in the part that I though it was so obvious, I missed some check.
it first builds inefficieny at the technical level then at the business level, finally it causes issues at the cultural level, and thats when the smart people start leaving.
This is a question that the person in charge of backups needs to think about, too. I mean, rephrase it as "Is there any one person who can write to both production and backup copies of critical data?" but it means the same thing as what you said.
(and if the CTO, or whoever is in charge of backups screws up this question? the 'perfect storm' means "all your data is gone" - dono about you, but my plan for that involves bankruptcy court, and a whole lot of personal shame. Someone coming in and stealing all the hardware? not nearly as big of a deal, as long as I've still got the data. My own 'backup' house is not in order, well, for lots of reasons, mostly having to do with performance, so I live with this low-level fear every day.)
Seriously, think, for a moment. There's at least one kid with root on production /and/ access to the backups, right? At most small companies, that is all your 'root-level' sysadmins.
That's bad. What if his (or her) account credentials get compromised? (or what if they go rogue? it happens. Not often, and usually when it does it's "But this is really best for the company" It's pretty rare that a SysAdmin actively and directly attempts to destroy a company.)
(SysAdmins going fully rogue is pretty rare, but I think it's still a good thought experiment. If there is no way for the user to destroy something when they are actively hostile, you /know/ they can't destroy it by accident. It's the only way to be sure.)
The point of backups, primarily, is to cover your ass when someone screws up, primarily. (RAID, on the other hand, is primarily to cover your ass when hardware fails) - RAID is not Backup and Backup is not RAID. You need to keep this in mind when designing your backup, and when designing your RAID.
(Yes, backup is also nice when the hardware failure gets so bad that RAID can't save you; but you know what? that's pretty goddamn rare, compared to 'someone fucked up.')
I mean, the worst case backup system would be a system that remotely writes all local data off site, without keeping snapshots or some way of reverting. That's not a backup at all; that's a RAID.
The best case backup is some sort of remote backup where you physically can't overwrite the goddamn thing for X days. Traditionally, this is done with off-site tape. I (or rather, your junior sysadmin monkey) writes the backup to tape, then tests the tape, then gives the tape to the iron mountain truck to stick in a safe. (if your company has money; if not, the safe is under the owner's bed.)
I think that with modern snapshots, it would be interesting to create a 'cloud backup' service where you have a 'do not allow overwrite before date X' parameter, and it wouldn't be that hard to implement, but I don't know of anyone that does it. The hard part about doing it in house is that the person who managed the backup server couldn't have root on production and vis-a-vis, or you defeat the point, so this is one case where outsourcing is very likely to be better than anything you could do yourself.
Which also means they can't fix something in case of a catastrophic event. "Recover a file deleted from ext3? Fix a borked NTFS partition? Salvage a crashed MySQL table? Sorry boss, no can do - my admin powers have been neutered so that I don't break something 'by accident, wink wink nudge nudge'." This is, ultimately, an issue of trust, not of artificial technical limitations.
> one case where outsourcing is very likely to be better than anything you could do yourself.
Hm. Your idea that "cloud is actually pixie dust magically solving all problems" seems to fail your very own test. Is there a way to prevent the outsourced admins from, um, destroying something when they are actively hostile? Nope, you've only added a layer of indirection.
(also, "rouge" is "#993366", not "sabotage")
>Which also means they can't fix something in case of a catastrophic event. "Recover a file deleted from ext3? Fix a borked NTFS partition? Salvage a crashed MySQL table? Sorry boss, no can do - my admin powers have been neutered so that I don't break something 'by accident, wink wink nudge nudge'." This is, ultimately, an issue of trust, not of artificial technical limitations.
All of the problems you describe can be solved by spare hardware and read only access to the backups. I mean, your SysAdmin needs control over the production environment, right? to do his or her job. but a sysadmin can function just fine without being able to overwrite backups. (assuming there is someone else around to admin the backup server.)
fixing my spelling now.
Yes, it's about trust. but anyone who demands absolute trust is, well, at the very least an overconfident asshole. I mean, in a properly designed backup system (and I don't have anything at all like this at the moment) I would not have write-access to the backups, and I'm majority shareholder and lead sysadmin.
That's what I'm saying... backups are primarily there when someone screwed it up... in other words, when someone was trusted (or trusted themselves) too much.
(that rouge/rogue thing is my pet peeve)
the idea here is to make sure that the people with write-access to production don't have write-access to the backups and vis-a-vis. The point is that now two people have to screw it up before I lose data.
Outsourcing has it's place. You are an idiot if you outsource production and backups to the same people, though. This is why I think "the cloud" is a bad way of thinking about it. Linode and rackspace are completely different companies... one of them screwing it up is not going to effect the other.
I test backups for F500 companies on a daily basis (IT Risk Consulting) - this would be missing the point really, the business process around this problem is really moving towards live mirrored replication. This allows much faster recall time, and also mitigates many risks with the conventional 'snapshot' method through either tapes, cloud, etc.
Does Amazon Glacier offer this?
Redundancy = Reduce the number of component failures that can lead to system failure (RAID, live replication, hot standby).
Backup = Recover from an obvious failure or overwite (Weekly full backups, daily differentials).
Archival = Recover from a non-obvious failure and/or malicious activity (WORM tapes, offsite backup).
As a failsafe against malicious sysadmins is to split up the responsibilities. The guy handling backups isn't handling archival etc...
I am working for a company that does some data analysis for marketers aggregated from a vast number of sources. There was a giant legacy MyISAM (this becomes important later) table with lots of imported data. One day, I made some trivial looking migration (added a flag column to that table). I tested it locally, rolled it out to staging server. Everything seemed A-OK until we started migration on the production server. Suddenly, everything broke. By everything, I mean EVERYTHING, our web application showed massive 500-s, total DEFCON1 across the whole company. It turned out we ran out of disk space, since apparently myisam tables are altered the following way: first the table is created with updated schema, then it is populated with data from the old table. MyISAM ran out of disk space and somehow corrupted the existing tables, mysql server would start with blank tables, with all data lost.
I can confirm this very feeling: "The implications of what I'd just done didn't immediately hit me. I first had a truly out-of-body experience, seeming to hover above the darkened room of hackers, each hunched over glowing terminals." Also, I distinctly remember how I shivered and my hands shook. It felt like my body temperature fell by several degrees.
Fortunately for me, there was a daily backup routine in place. Still, several hour long outage and lots of apologies to angry clients.
"There are two types of people in this world, those who have lost data, and those who are going to lose data"
We have dev databases (one of which was recently empty, nobody knows why; but that's another matter), then a staging environment, and finally production. And the database in the staging environment runs on a weaker machine than the prod database. So before any schema change goes into production, we do a time measurement in the staging environment to have a rough upper bound for how long it will take, how much disc space it uses etc.
And we have a monthly sync from prod to staging, so the staging db isn't much smaller than prod db.
And the small team of developers occasionally decides to do a restore of the prod db in the development environment.
The downside is that we can't easily keep sensitive production data to find its way into the development environment.
I try to keep data having the same form (e.g., length, number of records, similar relationships, looks like production data). But it's random enough so that if the data ever leaks, we don't have to apologize to everybody.
Since your handle is perlgeek, you're already well equipped to do a streaming transformation of your SQL dump. :)
-cp the dump to a new working copy
-sed out cache and tmp tables
-Replace all personal user data with placeholders. This part can be tricky, because you have to find everywhere this lives (are form submissions stored and do they have PII?)
-Some more sed to deal with actions/triggers that are linked to production's db user specifically.
-Finally, scp the sanitized dump to the dev server, where it awaits a Jenkins job to import the new dump.
The cron job happens on the production DB server itself overnight (keeping the PII exposure at the same level it is already), so we don't even have to think about it. We've got a working, sanitized database dump ready and waiting every morning, and a fresh prod-like environment built for us when we log on. It's a beautiful thing.
More than once (over the last few years) I've been doing some important update. I tend to do it the same way.
COMMIT; or ABORT TRANSACTION;
So you always have to make a backup and check that the tables are defined they way they should be. Nothing is quite as much fun as the first time you delete a bunch of critical data off a production system. Luckily the table was small, basically static, and we had backups so it was only a problem for ~5 minutes.
PostgreSQL can do DDL inside a transaction though.
But as far as I know, nobody was fired for this. Because, yes, things like this just can happen. An eventually it got fixed anyway.
No staging environment (from which ad-hoc backups could have been restored)!?!?
No regular testing of backups to ensure they work?
No local backups on dev machines?!?
Using a GUI tool for db management on the live db?!?!?
Junior devs (or any devs) testing changes on the live db and wiping tables?!?!?!
What an astonishing failure of process. The higher ups are definitely far more responsible than some junior developer for this, he shouldn't have been allowed near the live database in the first place until he was ready to take changes live, and then only on to a staging environment using migrations of some kind which could then be replayed on live.
They need one of these to start with, then some process:
They paid the price of ignoring what was actually the most critical part of their business.
Even if you have a rock-solid database management, backup, auditing etc process, if your game is not playable, you won't have any data that you could lose by having a DB admin mis-click.
Still, not handling your next-most-critical data properly is monumentally stupid and a collective failure of everyone who should have known.
What I'd normally do is have a production server, which has daily backups, copies of which are used for dev on local machines, and then pushed to a dev server with separate dev db which is wiped periodically with that production data (a useful test of restoring backups), and has no connection with the production server or db.
Can't work out why they would possibly be doing development on a live db like this, that's insanity.
I still use the mysql CLI and have for 10 years plus-or-minus, but I actually use Sequel Pro a lot. If I'm perusing tables with millions of rows, or I want to quickly see a schema, or whatever, it's been a net gain in productivity.
A migration enables you to track the changes you made and possibly rollback to previous migrations (database states) if ever required.
1. "Oops, I wrote TRUNCATE TABLE User instead of TRUNCATE TABLE Raids"
2. Transaction complete. ... ... "oops!"
This is an easy mistake to make on command line. I hate GUIs too but not having one doesn't really help when your fundamental operating model is wrong.
Alarms went off, people came running in freaking out, trucks started rolling out to survey the damage, hospitals started calling about people on life support and how the backup generators were not operational, old people started calling about how they require AC to stay alive and should they take their loved ones on machines to the hospital soon.
My boss was pretty chill about it. "Now you know not to do that" were his words of wisdom, and I continued programming the system for the next 4 summers with no real mistakes.
I was setting up a system to detect the peak usage and alert the attendant to send a fax (1992) to our local college generator plant to switch over some of their power to us, thus reducing our yearly power bill by a lot.
There was a title grabber in charge of the department, and a highly competent engineer running everything and keeping the crews going. That was my boss. Each summer I was doing new and crazy things, from house calls for bad TV signals from grounding to mapping out the entire grid. Oddly enough there was no documentation as to what house was on what circuit.. it was in the heads of the old linesman. Most of the time when they got a call about power outages they drove around looking at what houses were dark and what were not.
Sometimes the older linesmen would call me on the radio to meet them somewhere, and we'd end up in their back yards taking a break having some beers. I learned a lot from those old curmudgeonly linesmen. They made fun of me non stop, but always with a wink and roaring laughter. Especially when I cut power to half the town.
All that being said, there is opportunity, just not easy opportunity. And a huge number of the people in it are boomers, so there is going to be big shake-ups in the next decade or two.
My experience is that this is the norm, not the exception.
Additionally, start networking now. Get to know ace developers in your area, and you will start hearing about top-level development shops. Go to meetups and other events where strong developers are likely to gather (or really, developers who give a shit about proper engineering) and meet people there.
It's next to impossible to know, walking into an office building, whether the company is a fucked up joke or good at what it does - people will tell you.
Also, you have a choice of leaving if you don't like the job, and or don't find the practices in place to be any good, or you can fix them.
It doesn't cover every last thing, but a team following these practices is the kind of team you're looking for. Ask these questions at your interviews.
I honestly think the test is a little out of date, but if they say, "Well, instead of X we're doing better thing Y", that's a great answer.
But, things are not always perfect, even on great teams. Not saying its normal to destroy your production database! But even in good shops it's a constant challenge to stay organized and do great work. Look for a team that is at least trying to do great work, rather than a complacent team.
No, the CEO was at fault, as was whoever let you develop against the production database.
If the CEO had any sense, he should have put you in charge of fixing the issue and then making sure it could never happen again. Taking things further, they could have asked you to find other worrying areas, and come up with fixes for those before something else bad happens.
I have no doubt that you would have taken the task extremely seriously, and the company would have ended up in a better place.
Instead, they're down an employee, and the remaining employees know that if they make a mistake, they'll be out of the door.
And they still have an empty users table.
In any case, I did a few things to make sure I never ended up destroying any data. Creating temporary tables and then manipulating those.. reading over my scripts for hours.. dumping table backups before executing any scripts.. not executing scripts in the middle/end of the day, only mornings when I was fresh etc etc.
I didn't mess up, but I remember how incredibly nerve wracking that was, and I can relate to the massive amount of responsibility it places on a "junior" programmer. It just should never be done. Like others have said, you should never have been in that position. Yes, it was your fault, but this kind of responsibility should never have been placed on you (or anyone, really). Backing up all critical data (what kind of company doesn't backup its users table?! What if there had been hard disk corruption?), and being able to restore in minimum time should have been dealt with by someone above your pay grade.
Just to add some more thoughts based on other comments.. yes a lot of companies do stuff like this, particularly startups. The upside in these situations is that you end up learning things extremely quickly which wouldn't be possible in a more controlled environment. However not having backup and restore working is just ridiculous and I keep shaking my head at how they blamed the OP for this mistake. Unbelievable.
Or a coworker will find the login in your scripts, repurpose it, then notice they need more rights and "fix" the account for you.
SELECT * FROM my_200_GB_table will always be there.
It's actually quite nice using a database server that doesn't require explicit credentials to be used.
This happened a week before I started as a Senior Software Engineer. I remember getting pulled into a meeting where several managers who knew nothing about technology were desperately trying to place blame, figure out how to avoid this in the future, and so on.
"There should have been automated backups. That's really the only thing inexcusable here.", I said.
The "producer" (no experience, is now a director of operations, I think?) running the meeting said that was all well and good, but what else could we do to ensure that nobody makes this mistake again? "People are going to make mistakes", I said, "what you need to focus on is how to prevent it from sinking the company. All you need for that is backups. It's not the engineer's fault.". I was largely ignored (which eventually proved to be a pattern) and so went on about my business.
And business was dumb. I had to fix an awful lot of technical things in my time there.
When I started, only half of the client code was in version control. And it wasn't even the most recent shipped version. Where was the most recent version? On a Mac Mini that floated around the office somewhere. People did their AS3 programming in notepad or directly on the timeline. There were no automated builds, and builds were pushed from peoples' local machines -often contaminated by other stuff they were working on. Art content live on our CDN may have had source (PSD/FLA) distributed among a dozen artist machines, or else the source for it was completely lost.
That was just the technical side. The business/management side was and is actually more hilarious. I have enough stories from that place to fill a hundred posts, but you can probably get a pretty good idea by imagining a yogurt-salesman-cum-CEO, his disbarred ebay art fraudster partner, and other friends directing the efforts of senior software engineers, artists, and other game developers. It was a god damn sitcom every day. Not to mention all of the labor law violations. Post-acquisition is a whole 'nother anthology of tales of hilarious incompetence. I should write a book.
I recall having lunch with the author when he asked me "What should I do?". I told him that he should leave. In hindsight, it might have been the best advice I ever gave.
What I want to know is what happened to whoever decided that backups were a dispensable luxury? In 2010?
There's a rule that appears in Jerry Weinberg's writings - the person responsible for a X million dollar mistake (and who should be fired over such a mistake) is whoever has controlling authority over X million dollars' worth of the company's activities.
A company-killing mistake should result in the firing of the CEO, not in that of the low-level employee who committed the mistake. That's what C-level responsibility means.
(I had the same thing happen to me in the late 1990's, got fired over it. Sued my employer, who opted to settle out of court for a good sum of money to me. They knew full well they had no leg to stand on.)
"We make astonishingly fun, ferociously addictive games that run on social networks.
...KlickNation boasts a team of extremely smart, interesting people who have, between them, built several startups (successful and otherwise); written a novel; directed music videos; run game fan sites; illustrated for Marvel Comics and Dynamite Entertainment with franchises like Xmen, Punisher, and Red Sonja; worked on hit games like Tony Hawk and X-Men games; performed in rock bands; worked for independent and major record lables; attended universities like Harvard, Stanford, Dartmouth, UC Berkeley; received a PhD and other fancy degrees; and built a fully-functional MAME arcade machine."
And this is hilarious: their "careers" page gives me a 404:
That link to "careers" is from this page:
I am tempted to apply simply to be able to ask them about this. It would be interesting to hear if they have a different version of this story, if it is all true.
Let's be honest, we all have one or two.. and if you don't, then your one or two are coming. It's what you learned to do differently that I care about.
And if you don't have one, you're either a) incredibly lucky, b) too new to the industry, or c) lying.
When people say "making mistakes is unacceptable - imagine if doctors made mistakes" they ignore three facts:
1. Doctors do make mistakes. Lots of them. All the time.
2. Even an average doctor is paid an awful lot more than me.
3. Doctors have other people analysing where things can go wrong, and recommending fixes.
If you want fewer development mistakes, as a company you have to accept it will cost money and take more time. It's for a manager to decide where the optimal tradeoff exists.
This is absolutely it, of course it is possible to become so risk averse that you never actually succeed in getting anything done and there are certainly organisations that suffer from that (usually larger ones).
However some people seem to take the view that it is impossible to protect oneself from all risks therefor it is pointless protecting from any of them.
The good news is that usually protecting against risks tends to get exponentially more expensive as you add "nines" therefor having a 99% guarantee against data loss is a lot cheaper than a 99.999% guarantee.
Having a cronjob that does a mysqldump of an entire database, emails some administrator and then does rsync to some other location (even just a dropbox folder) is something that is probably only a couple of hours work.
This. I don't regret my life's many failures. I regret the times I've flamed out, blamed others, or ran away.
Doing things means making mistakes. You can spot a professional by how they deal with mistakes.
I asked similar, and agree that it's a really useful question.
I think it's an especially great one for startups, as successful candidates are more likely to come into contact with production systems.
For these positions you not only want people capable of recovering accidents, but also people who have screwed up because, conversely, they've been trusted not to screw up. Those who've never been trusted enough to not damage a system are unlikely to be of much use.
In most companies there's enough process to protect you from the big screw ups.
Here's something I don't get: didn't Rackspace have their own daily backups of the production server, e.g. in case their primary facility was annihilated by a meteor (or some more mundane reason, like hard drive corruption)?
Regardless, here's a thought experiment: suppose that Rackspace did keep daily backups of every MySQL instance in their care, even if you're not paying for the backup service. Now suppose they get a frantic call from a client who's not paying for backups, asking if they have any. How much of a ridiculous markup would Rackspace need to charge to give the client access this unpaid-for backup, in order to make the back-up-every-database policy profitable? I'm guessing this depends on 1) the frequency of frantic phone calls, 2) the average size of a database that they aren't being paid to back up, and 3) the importance and irreplacebility of the data that they're handling (and 4) the irresponsibility of their major clients).
Yes it would be nice if Rackspace could speculatively create a backup but they'd be dancing on ice doing so.
Take my own company, I've accidentally deleted /dev on development servers (not that major of an issue thanks to udev, but the timing of the mistake was lousy), a co-worker recently dropped a critical table on dev database and we've had other engineers break Solaris by carelessly punching in chmod -R / as root (we've since revised engineers permissions so this is no longer possible). As much as those errors are stupid and as much as engineers of our calibre should know better, it can only takes a minor lack of concentration at the wrong moment to make a major fsck up. Which is doubly scary when you consider how many interruptions the average engineer gets a day.
So I think the real guilt belongs to the entire technical staff as this is a cascade of minor fcsk ups that lead to something catastrophic.
What a screw up!
These mistakes are almost without fail a healthy mix of individual incompetence and organisational failure. Many things - mostly my paying better attention to functionality I rewrite, but also the company not having multiple undocumented systems for one task, or code review, or automated testing - might have saved the day.
[†] They've long been removed.
Once I was finished, I saved the document -- "Save" took around two minutes (which is why I rarely saved).
I had an external monitor that was sitting next to the PC; while the saving operation was under way, I decided I should move the monitor.
The power switch was on top of the machine (unusual design). While moving the monitor I inadvertently touched this switch and turned the PC of... while it was writing the file.
The file was gone, there was no backup, no previous version, nothing.
I had moved the monitor in order to go to bed, but I didn't go to bed that night. I moved the monitor back to where it was, and spent the rest of the night recreating the report, doing frequent backups on floppy disks, with incremental version names.
This was in 1989. I've never lost a file since.
That was back in the summer of 1978; today I have an LTO-4 tape drive driven by Bacula and backup the most critical stuff to rsync.net, the latter of which saved my email archive when the Joplin tornado roared next to my apartment complex and mostly took out a system I had next to my balcony sliding glass doors and the disks in another room with my BackupPC backups.
As long as we're talking about screwups, my ... favorite was typing kill % 1, not kill %1, as root, on the main system the EECS department was transitioning to (that kills the initializer "init", from which all child processes are forked). Fortunately it wasn't under really serious heavy use yet, but it was embarrassing.
We actually had a continuous internal backup plan, but when I requested a restore, the IT guy told me they were backing up everything but the databases, since "they were always in use."
(Let that sink in for a second. The IT team actually thought that was an acceptable state of affairs: "Uh, yeah! We're backing up! Yeah! Well, some things. Most things. The files that don't get like, used and stuff.")
That day was one of the lowest feelings I ever had, and that screwup "only" cost us a few thousand dollars as opposed to the millions of dollars the blog post author's mistake cost the company. I literally can't imagine how he felt.
Personally I felt bad when I deleted some files, that were recovered within the hour, and learned from that experience. But when you create a monumental setback as the OP by simple mistake, that's an issue with people at higher ranks.
(The other character is a # or $ depending on whether the user is root or not.)
Did you change the $PS1 variable? Can you share your config?
It means having a slightly difference .bashrc for each machine, but it's trivial.
(molly-guard makes you type in the hostname before a halt/shutdown/reboot command.)
Have a wonderful day! (and i'll definitely look at installing molly-guard on my production debian servers)
1) Not having backups is an excuse-less monumental fuckup.
2) Giving anyone delete access to your production db, especially a junior dev through a GUI tool, is an excuse-less monumental fuckup.
Hopefully they rectified these two problems and are now a stronger company for it.
Some experiences are non-transferable. This identical conversation has taken place millions of times, but noooo: every penny-wise-pound-foolish CEO wants to experience the real thing, apparently.