More than anything else, this describes an appalling failure at every level of the company's technical infrastructure to ensure even a basic degree of engineering rigor and fault tolerance. It's noble of the author to quit, but it's not his fault. I cannot believe they would have the gall to point the blame at a junior developer. You should expect humans to fail: humans are fallible. That's why you automate.
More than that, it's telling that the company threw him under the bus when it happened. I've been through major fuckups before, and in all cases the team presents a united front - the company fucked up, not an individual.
Which is, if you think about it, true, given that the series of events leading up to the disaster (the lack of a testing environment, working with prod databases, lack of safeties in the tools used to connect to database, etc...).
The correct way to respond to disasters like this is "we fucked up", not "someone fucked up".
When things like this happen, you have to realize there is more than one 'truth'.
There is the truth that here is someone who truncated the users table and because of that it caused the company great harm.
Here's another 'truth'.
1. The company lacked backups
2. The junior developer was on a production database.
Note: I'm from the oldschool of sysops who feel that you don't give every employee the keys to your kingdom. Not every employee needs to access every server, nor do they need the passwords for such.
3. Was there a process change? I doubt it, likely they made the employee feel like a failure every day and remind him of how he can be fired at any moment. So he did the only thing he could do to escape: Quit!
Horrible and wrong, if there was a good ethics lawyer around he would say it smells ripe of a lawsuit.
... That said, that lawyer and lawsuit won't fix anything.
The problem isn't who has the keys, it's how they're used. I don't care as much if a junior developer has the prod password; I care more about building an engineering and ops team that understands that dicking around with the prod database isn't okay. Sysops and DBAs are fallible too--I've seen a lot of old school shops that relied heavily on manual migration and configuration. Automate, test, isolate and expect failure!
And, most importantly, shake off the "agile" ethos when doing DB migrations. Just forget they exist and triple-check every character you type into the console.
What, exactly, do you think the "agile" ethos is?
We are agile, but that does _not_ mean that we don´t triple-check everything we type into production consoles.
It does, however, mean that everything we do in production has been done before in at least one test environment. Why would we want to forget about doing that?
The fact that the cheap ass idiots at the company had cancelled their backup protection at Rackspace, and lacked any other form of backup is just complete incompetence.
If it hadn't been a junior developer who nobody noticed or cared was using the prod DB for dev work, it would have been an outright failure. DBs fail, and if you don't have backups you are not doing your damn job.
The CEO should be ashamed of himself, but the lead engineers and the people at the company who were nasty to this guy should all be even more ashamed of themselves.
I have to chime in and completely agree. Very lucky. Most people who survive for years at companies have learned to either stay out of sight, or navigate the Treacherous Waters of Blame whenever things go wrong.
This is actually one of the things most employees who have never been managers don't understand.
Your comment makes me think. Are you implying that this is a good practice?
I mean, in fact I do something similar. At our company also a lot of stuff goes wrong. Somehow it surprises me that there was no major fuckup yet. But I do realize that I need to watch out all times that blame never concentrates on me.
It is so easy to blame individuals, it just suffices to have participated somehow in a task that fucked up. Given that all other participants keep a low profile, one needs to learn how to defend/attack in times of blame.
I absolutely do NOT think it is a good practice. I think it is what lazy companies full of people afraid to lose their jobs do. I think it's most companies.
The reality is that fear is a greater motivator than any other emotion - over anger, sorrow, happiness. So companies create cultures of fear which results in productivity (at least a baseline, 'do what I need to do or not get fired' productivity), but little innovation and often at the expense of growth.
Plus, it's just hell. You want to do great things, but know you are stepping into the abyss every time you try.
You (and the other commenters with similar strategies) are wasting productive years of your life at jobs like these. You should go on a serious job hunt for a new position, and leave these toxic wastelands before they permanently affect your ability to work in a good environment.
I wish I could say I have some sort of prescience about bad working environments. In reality I'm just good at getting out while the gettin' is still good.
Apparently their logo is ";-)". At first I thought they just had this annoying tendency to overuse the wink emoticon and found it a bit creepy. Then I saw it all over their menus and found it a bit creepy. Then I realized they appeared to be using it as their logo and I found it creepy.
When you think about it it's almost a logical certainty that he'd take the fall. Any company collectively able to understand that the actual failure was inadequate safeguards would have been able to see it coming and presumably would have prevented it from happening. If you're so inexperienced that you expect no one will ever make a mistake you'll obviously assume that the only problem was that someone made one.
It's somewhat fascinating to hear that a company that damaged managed to build a product that actually had users. And I thought my impression of the social gaming segment couldn't go any lower.
Indeed. The news here is not that a junior developer made an error with SQL db (everyone makes them, eventually), nor that the company did not have proper safeguards against problems (this happens far too often). The news is that despite such basic management incompetence, the company had been able to get a large number of paying customers.
Just think, the company spent millions of dollars training this guy in protecting production data, and then instead of treating him like gold, and putting him in charge of fixing all the weaknesses that allowed this to happen, they pushed him out of the company. Stupidity beyond measure.
Not surprised that this was a gaming company because lack of teamwork seems to be endemic in that industry.
The correct way to respond to disasters like this is "we fucked up", not "someone fucked up".
can't agree more! a company that starts playing the blame game - and even communicates this to the outside a) looks unprofessional b) is unprofessional c) poisons the corporate culture
its a fail on every possible level, the technical part beeing only minor.
Agreed. How could a company with "millions in revenue" not backup critical databases? Not only were they exposed to the threat of human error, but hardware failures, hackers, etc. When he submitted his resignation, the company should have encouraged him to stay. Instead, anyone at the company that had anything to do with the failure to implement regular database backups and the use of redundant databases should have been fired.
I've had more than my fair share of failures in the start up world. It always drives me crazy to see internet companies that have absolutely no technical ability succeed, while competing services that are far superior technically never get any traction.
It's the difference between understanding the incident (programmer makes mistake) and the root cause (failing at data integrity 101). Not just no backups, but no audit log of what's happened in the game - I'd expect an MMO to have an append-only event history quite apart from their state information.
At least all transactions for purchased game items should have been logged in a separate database. There's nothing worse for a digital content company than forgetting who bought what.
Spending $5k on a UPS is very different from not having backups of your production database which runs your multimillion dollar business. This story just doesn't add up.
You misunderstood the comment. Angersock is saying that the networking equipment cost more than $5k and the friend was unwilling to buy a UPS to protect the equipment.
Sounds completely possible to me - I have worked in a lot of environments and at some places I've seen decisions that make cancelling backups look like an act of genius.
One interesting observation I can make: no correlation between excellence in operations and commercial success!
I've seen something like this. It could be a situation where they were transitioning from that backup service to their own server or another service, and the move was never completed due to some hiccup or priority change.
But if no one is watching to ensure the move was finished (or they got distracted), then something many people treat as set-it-and-forget-it could easily get into that state.
If you are transitioning from one backup service to another, shouldn't you only cancel your old service, after you successful set up and tested the new one?
I don't think it is that far-fetched. I've had experience with a well known company that does millions in revenue per week (web based shopping cart) that just FTP's everything with no DVCS. Designers, developers, managers all have access to the server and db.
It's not just noble. Considering how everyone treated him, and the company's attitude in general, he had no future there, and neither should anyone else.
This really needs to be more of a standard thing. I've been near (but as an engineer, never responsible for) production systems my whole career. None of these systems were as terribly maintained as the one in the linked article. Production data was isolated. Backups were done regularly. Systems were provisioned with fault tolerance in mind.
Not once have I seen a full backup restore tested. Not once have I seen a network failure simulated (though I've seen several system failures due to "kicking out a cable" that sort of acts as a proxy for that technique). On multiple occasions I've seen systems taken down by single points of failure[1] that weren't forseen, but probably could have been.
[1] My favorite: the whole closet went down once because everything was plugged into a single, very expensive, giant UPS that went poof. $40/system for consumer batteries from Office Depot would have been a much better bet. And the best part? Once the customer service engineer replaced whatever doodad failed and brought the thing back up? They plugged everything right back into it.
I'll never forget when my boss was showing the girl scouts (literally) our very expensive UPS room. He explained how even if the power goes off we'll switch to batteries then switch over to generator power. See watch, he says - then flicked the switch. Fooomm... our entire office goes dark.
This took down news information for a good chunk of Wellington finance for about half a day. (Fortunately Wellington, NZ is a tiny corner of the finance world).
Hilarious! But I admit I was super glad it was the boss playing chaos monkey, not me.
Back when I worked for a small ISP, we had a diesel generator in case the power went out longer than our UPS batteries would last. This provided a great sense of security until we decided to test the system by powering off the main break and... it didn't start!
It turns out the emergency stop button was pushed in. Easy enough for us to fix then, but if the power had gone out at 4am it would have been quite another matter.
After that incident, we turned off the main breaker to the building weekly. It was great fun, as most of our offices were in the same building. We had complaints for the first couple of months until everyone got used to it and had installed mini UPS's for their office equipment.
We did actually have to use the generator for real a while later. Someone had driven their car into the local power substation, and it was at least a month until it was fixed. Electricity was restored through re-routing fairly quickly, but until the substation was repaired we were getting a reduced voltage that caused the UPSs to slowly drain...
The last time they tested the diesel generator failover at a customer's site, the generator went on just fine, but then it did not want to switch to mains again. The whole building was powered by the generator for almost two days, until they managed to convince the generator to switch.
> Not once have I seen a network failure simulated.
Reminds me of the webserver UPS setup at a previous company.
The router (for the incoming T1) and the webserver were plugged in to the UPS.
UPS connected (via serial port) to webserver. Stuff running on webserver to poll whether UPS running from mains power or batteries and send panic emails if on batteries (for more than 60 seconds) and eventually shutdown the webserver cleanly if UPS power dropped below 25%.
Thing not plugged in to UPS: DMZ Network switch (that provided the connectivity between webserver and router).
Doing that kind of testing is hard. It costs time and effort. If you want to see it done on a truly awe-inspiring scale (whole data centers being taken down by zombies ;) : http://queue.acm.org/detail.cfm?id=2371516
Doing this kind of testing in a gold-plated, heavily-engineered way is hard. But that's not an excuse for not doing it at all. Just walking into your closet and pulling a cable gets you 80-95% of the testing you need, and is free. Setting up a sandbox and "restoring" a backup onto it and then doing some quick queries is likewise easy to do and eliminates huge chunks of the failure space of "bad backups".
Really, this attitude (that things have to be done right) is part of the problem here. To a seasoned IT wonk, the only alternative to doing something "The Right Way" is not doing it at all. And that's a killer in situations like these.
Don't hack your systems to make them work. Absolutely do hack at them to test.
"walking into your closet and pulling a cable" is not free, if your planned disaster recovery is not a seamless failover, but a process to recover data with some work and limited (nonzero) downtime/cost to business.
For example, our recovery plan for a financial mainframe in case of most major disasters was to restore the daily backup to off-site hardware identical to production hw; however, the (expensive) hardware wasn't "empty" but used as an acceptance test environment.
Doing a full test of the restore would be possible, but it would be a very costly disruption; taking multiple days of work for the actal environment restoration, deployment,testing and then all of this once more to build a proper acceptance-test-environment. Also destroying a few man-months worth of long tests-in-progress and preventing any change deployments while this is happening.
All of this would be reasonable in any real disaster, but such costs and disruptions aren't acceptable for routine testing.
"Chaos Monkey" works only if your infrastucture is built on cheap unstable and massively redundant items. You can also get excellent uptime with expensive, stable, massively controled environment with limited redundancy (100% guaranteed recovery, but not "hot failover") - but you can't afford chaos there.
To paraphrase: if you go with a awful hack job for your disaster recovery plan, testing is more expensive. And to extend: you won't actually test because it's "too expensive", and your disaster recovery plan won't work.
How is this distinct from "Don't hack your systems to make them work. Absolutely do hack at them to test."? I don't see it.
This just sounds like "my business doesn't have the financial capacity to engineer data recovery processes". Well, OK then. Just don't claim to be doing it.
We did know that we can recover backups because we did it for small parts of data, and we know that we can do disaster recovery because (a) we did test this, though very rarely; and (b) we had successfully recovered from actual full-scale disasters twice over ~7 years.
But successful, efficient disaster recovery plan doesn't always mean "no damage" - it often means damage mitigation; i.e., we can fix this with available resources while meeting our legal obligations so that our customers don't suffer; not that there aren't consequences at all - valid data recovery plans ensure that data recovery really is possible and details how it happens, but that recovery can be expensive. And while you can plan, document, train and test activities like "those 100 people will do X; and those 10 sales reps will call the involved customers and give them $X credit", you really don't want to put the plan into action without a damn good reason.
For example, a recovery plan for a bunch of disasters that are likely to cut all data lines from a remote branch to HQ involves documenting, printing & verifying a large pile of deal documents of the day, having them shipped physically and handled by a designated unit in the HQ. The process has been tested both as a practice and in real historical events.
However, if you "pull a wire in the closet" and cause this to happen just so, then you've just 'gifted' a lot of people a full night of emergency overtime work, and deserve a kick in the face.
All I can say is that you're very lucky to have a working system (and probably a company to work for), and I'm very lucky not to work where you do. Seriously, your "test" of a full disaster recovery was an actual disaster! More than one!
And frankly, if your response to the idea of implementing dynamic failure testing is that someone doing that should be "kicked in the face" (seriously, wtf? even the image is just evil), then shame on you. That's just way beyond "mistaken engineering practice" and well on the way to "Kafkaesque caricature of a bad IT department". Yikes.
Admittedly: you have existing constraints that make moving in the right direction expensive and painful. But rather than admit that you have a fragile system that you can't afford to engineer properly you flame on the internet against people who, quite frankly, do know how to do this properly. Stop.
I'd like not to stop, but continue exploring the viewpoints. And I'd like you and others to try and consider also less-tech solutions to tech problems if they meet the needs instead of automatically assuming that we made stupid decisions.
For example, any reasonable factory also has a disaster recovery process to handle equipment damage/downtime - some redundant gear, backup power, inventory of spare parts, guaranteed SLA's for shipping replacement, etc; But still, someone intentionally throwing a wrench in the machine isn't "dynamic failure testing" but sabotage that will result in anger from coworkers who'll have to fix this. Should their system be called "improperly engineered"?
We had great engineers implementing failover for a few 'hot' systems, but after much analysis we knowingly chose not to do it 'your way' for most of them since it wasn't actually the best choice.
I agree, in 99% of companies talked about in HN your way is undoubtedly better, and in tech startups it should be the default option. But there, much of the business process was people & phone & signed legalese, unlike any "software-first" businesses; and the tech part usually didn't do anything better than the employees could do themselves, but it simply was faster/cheaper/automated. So we chose functional manual recoveries instead of technical duplications. And you have to anyway - if your HQ burns down, who cares if your IT systems still work if your employees don't have planned backup office space to do their tasks? IT stuff was only about half of the whole disaster recovery problems.
In effect, all the time we had an available "redundant failover system" that was manual instead of digital. It wasn't fragile (it didn't break, ever - as I said, we tried), fully functional (customers wouldn't notice) but very expensive to run - every hour of running the 'redundant system' meant hundreds of man-hours of overtime-pay and hundreds of unhappy employees.
So, in such cases, you do scheduled disaster-testing and budget the costs of these disruptions as neccessary tests - but if someone intentionally hurts his coworkers by creating random unauthorised disruptions, then it's not welcome.
The big disadvantage for this actually is not the data recovery or systems engineering, but the fact that it hurts the development culture. I left there because in such place you can't "move fast and break things", but everyone tends to ensure that every deployment really, really doesn't break anything. So we got there very good system stability, but all the testing / QA usually required at least 1-2 months for any finished feature to go live - which fits their business goals (stability & cost efficiency rather than shiny features) but demotivates developers.
My favourite way to test restores is to do them frequently to the dev server from the production backups - this keeps the dev data set up to date, and works as a handy test of the restore mechanism. Of course if you have huge amounts of data or files on production this becomes more difficult, but not impossible, to manage.
This works well, though you may need an "anonymizer" (and maybe some extra compliance testing) if your systems have PCI or HIPPA data on them. We have federal restrictions against storing certain types of data on servers outside the US. Cloud computing sounds great but neither Amazon or Google will guarantee the data stays within the country's borders.
Minor correction: the Chaos Monkey was Netflix's innovation. It just happened to be implemented on Amazon's cloud. It would have been just as useful if they had their own colocated servers or used a different cloud computing provider.
Apple did this before Amazon or Netflix in this regard [1], but the point needs to be made that a system needs to be tested and not just in a controlled aseptic way, because the real world isn't.
Another story supporting Chaos Monkey is what the Obama team did for their Narwhal infrastructure - they staged outages and random failures to prepare for their big day, meanwhile Romney's team who outspent the Obama team at least an order of magnitude, had their system fail on e-day.
I'd like to see a source for Romney outspending the Obama team "at least" 10x, because while I can speak from experience that ORCA was a gigantic piece of shit, it's not like the Obama people were struggling to pay their bills.
I don't know what metric the parent comment is referring to, but in terms of technology stack, I can fully believe that the Romney team spent more than Obama's team. Here's a post by one of the creators of the fundraising platform:
I actually had that post in my mind when writing my reply, but I assumed r00fus was referring to ORCA and Narwhal specifically.
> ... what the Obama team did for their Narwhal infrastructure - they staged outages and random failures to prepare for their big day, meanwhile Romney's team who outspent the Obama team at least an order of magnitude, had their system fail on e-day.
Exactly what you said. Rigor is truly the right word to use here. Cancelling your db backups is basically asking for a disaster. I'm not sure i have ever been at job that didnt require a db backup for some reason at some time.
This was my question. Usually the "junior" folks are shown how to do things by the senior engineers. The fact they threw this guy under the bus while letting the rest of the senior guys skate is appalling.
Part of your job as a senior developer is to ensure this very scenario doesn't happen, let alone to someone on your watch.
The author makes mention of using a UI to connect to their db. If i was in a position over there i can see myself writing a script to clear out the tables i wanted. This reduces errors, but not the risk.
And yet, the net result is the same; right-click, clear table, or ./clearTable.sh. Both are human actions, and both are fallible. What if some prankster edited clearTable.sh to do the users table instead of the raids table? What if he did it himself to test something?
Heck when you put it that way, this guy actually did them a FAVOR. He ONLY wiped out the User table. The company was able to learn the value of backups, and they had enough data left to be able to partially recover it from the remaining tables, which is much better than the worst case scenario.
Do you think the company learned the value of backups, or do you think they learned to blame junior devs for fuckups? Sounds like they learned nothing, because no attempt was made to determine the root cause.
Exactly. I'm ashamed that my first reaction reading this was to blame OP. But in the 2 min it took to read the post I had come full circle to wondering what kind of terribly run company would allow this to happen--I guess the type that hires philosophy majors straight out of college without vetting their engineering skills.
"Nobody cares about technical infrastructure. Our customers don't pay us for engineering rigor. We need to just ship!"
Of course the person saying that is likely to care about technical infrastructure when it costs them money and/or customers due to being hacked-together.
When I was a junior analyst, I once deleted the main table that contained all 70k+ users, passwords, etc. The problem was fixed in 15min after the DBA was engaged to copy all data back from the QA environment that was synchronized every X minutes. Or we could have restored a back from a few hours ago.
The whole company fucked this one up pretty badly. NO excuses.
Indeed. If it had been a sporadic hardware failure, they would have been exactly as screwed. The fact that they gave an overworked junior dev direct read/write access to the production database is astounding.
It's not that they gave him r/w at all that's so criminally stupid. It's that the required him to clear a table manually, using generic full access tools, over and over.
In reality, this should have been re-factored to the dev db.
If it couldn't be, the junior dev should have been given access to the raids table alone for writes.
Lastly the developer who didn't back up this table is the MOST to blame. Money was paid for the state in that table. That means people TRUST you to keep it safe.
I count tons of people to blame here. I don't really see the junior dev as one of them.
Yeah I think him manually clearing a table over and over again was the big problem here. The amount of entropy that had to be introduced into the process to turn a routine task into millions of dollars of loss was tiny. He just needed to click in a slightly different spot on the webpage to bring up USERS instead of RAIDS.
I worked for a startup a few years back where the CEO deleted the entire 1+TB database/website when the server was hacked and being used as a spam server because he 1. didn't know who to disable the site short of deleting and 2. could reach anyone that did know how.
The next morning he told us to restore the site from backups and fix the security hole. That's when we reminded him, again, that he had refused to pay for backup services for a site of that size.
We all ended up looking fro new jobs withing a couple days.
If you have only one copy of data, especially if it is important, the chance of something happening to that copy, either hardware, software, or human error, is always big enough to justify a backup. No hindsight needed for that.
Thinking about this just makes my mind go numb. All someone had to do was have the idea that they should back their database up. It would be done in 5 minutes, tops!
I automated ours on S3 with email notifications in under an hour...
Exactly. Problems with production databases are inevitable. It's just a matter of time.
The guy who should be falling on the sword, if anyone, is the person in charge of backups.
Better yet, the CEO or CTO should have made this a learning opportunity and taken the blame for the oversight + praised the team for banding together and coming up with a solution + a private chat with the OP.
Unbelievable. Even at my 3 person startup, back in 2010, with thousands in revenue, not millions, we had development environments with test databases and automated daily database snapshotting. Sure I've accidentally truncated a few tables in my time, but luckily I wasn't dumb enough to be developing on a production server.
I cannot even begin to fathom how they functioned without a working development environment for testing, let alone let their backups lapse.
The kind of table manipulations he mentions would be unspeakable in most companies. Someone changing the wrong table would be inevitable. If I were an auditor, I would rake them all over the coals.
Absolutely. Your immediate technical management sucked, and you were made the scape goat for your management's failure. Welcome to the real world.
Don't get me wrong, you should feel bad, very bad, bad enough so that never again you do that. But you shouldn't feel guilty nor rethink your career.
A little bit of feeling guilty is in order; the author "didn't know that he didn't know", and I'm sure this motivated him to learn a lot more about proper engineering processes ... something I'll note aren't particularly a focus in CS degrees. Especially since I haven't come across anyone who's really dedicated to them who hadn't first gotten burned in one way or another.
There's a big difference between being told "Do X, don't do Y" and that sinking feeling you get when you realize a big problem exists, regardless of the eventual outcome.
You're forgetting the guy who didn't speak up to say "Hey, maybe we shouldn't do this in prod?".
I feel for him, but at the same time there's a point at which you have to if testing guns by shooting them near (but not specifically at) your coworkers is actually a good idea.
Agree with the point that humans are fallible. We should always have back ups. A company with 1000s of paying customers should at least take steps to protect itself from this sort of catastrophe.
To the credit of the management, they did not fire him. He resigned. But the coworkers felt he was responsible personally. That makes a uneasy work environment.